xkrogen commented on a change in pull request #31133: URL: https://github.com/apache/spark/pull/31133#discussion_r570355910
########## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ########## @@ -388,6 +394,9 @@ private[hive] object HiveTableUtil { private[hive] object DeserializerLock private[hive] object HadoopTableReader extends HiveInspectors with Logging { + + val avroTableProperties = AvroTableProperties.values().map(_.getPropName()).toSet Review comment: It seems that the examples of `avro.serde.type` and `avro.serde.skip.bytes` provide good reason why we shouldn't pick up `AvroTableProperties.*` or `startsWith("avro.")`. These properties indicate something about the specific files, e.g. it's possible you could have some partitions serialized with the Confluence Avro serializer (thus requiring the `avro.serde` properties) and some partitions serialized with normal Avro behavior. Schemas seem to be a special case. Looking at the full list of properties in Hive 2.3: ``` SCHEMA_LITERAL("avro.schema.literal"), SCHEMA_URL("avro.schema.url"), SCHEMA_NAMESPACE("avro.schema.namespace"), SCHEMA_NAME("avro.schema.name"), SCHEMA_DOC("avro.schema.doc"), AVRO_SERDE_SCHEMA("avro.serde.schema"), SCHEMA_RETRIEVER("avro.schema.retriever"); ``` It seems that the first 5 we would definitely want at the table level, and probably the last one (?). I don't understand `avro.serde.schema` and how it relates to the others. But it might be reasonable to say that we should take all `startsWith("avro.schema.")` Separately from the question of _which_ properties to pick up, there is the question of whether to hard-code the strings or use the enum values. @dongjoon-hyun can you explain more fully why you don't want to depend on `AvroTableProperties`? We already have another reference to it in the Spark codebase [within HiveShim](https://github.com/apache/spark/blob/6a194f197819e2970f619ed92321e0a0b716bf37/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala#L92). I don't understand the argument that some older branches may not have this class available -- if an older branch has a Hive dependency that is too old and we need to cherry pick this change there, then we can hard-code it in the cherry pick, but I don't see a reason why we can't use the enum in newer versions. ########## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ########## @@ -388,6 +394,9 @@ private[hive] object HiveTableUtil { private[hive] object DeserializerLock private[hive] object HadoopTableReader extends HiveInspectors with Logging { + + val avroTableProperties = AvroTableProperties.values().map(_.getPropName()).toSet Review comment: @attilapiros would you mind looking into what `avro.serde.schema` is used for? Understanding that better would provide more confidence (for me, at least) that `avro.schema` is the correct prefix. ########## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ########## @@ -388,6 +394,9 @@ private[hive] object HiveTableUtil { private[hive] object DeserializerLock private[hive] object HadoopTableReader extends HiveInspectors with Logging { + + val avroTableProperties = AvroTableProperties.values().map(_.getPropName()).toSet Review comment: Based on new discussion, sounds like we can leave that as a follow-on for later. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org