[GitHub] [spark] xkrogen commented on a change in pull request #31133: [SPARK-26836][SQL] Supporting Avro schema evolution for partitioned Hive tables

GitBox Fri, 05 Feb 2021 02:04:18 -0800


xkrogen commented on a change in pull request #31133:
URL: https://github.com/apache/spark/pull/31133#discussion_r570355910




##########
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala
##########
@@ -388,6 +394,9 @@ private[hive] object HiveTableUtil {
 private[hive] object DeserializerLock
 
 private[hive] object HadoopTableReader extends HiveInspectors with Logging {
+
+  val avroTableProperties = 
AvroTableProperties.values().map(_.getPropName()).toSet

Review comment:
       It seems that the examples of `avro.serde.type` and 
`avro.serde.skip.bytes` provide good reason why we shouldn't pick up 
`AvroTableProperties.*` or `startsWith("avro.")`. These properties indicate 
something about the specific files, e.g. it's possible you could have some 
partitions serialized with the Confluence Avro serializer (thus requiring the 
`avro.serde` properties) and some partitions serialized with normal Avro 
behavior. Schemas seem to be a special case.
   
   Looking at the full list of properties in Hive 2.3:
   ```
       SCHEMA_LITERAL("avro.schema.literal"),
       SCHEMA_URL("avro.schema.url"),
       SCHEMA_NAMESPACE("avro.schema.namespace"),
       SCHEMA_NAME("avro.schema.name"),
       SCHEMA_DOC("avro.schema.doc"),
       AVRO_SERDE_SCHEMA("avro.serde.schema"),
       SCHEMA_RETRIEVER("avro.schema.retriever");
   ```
   It seems that the first 5 we would definitely want at the table level, and 
probably the last one (?). I don't understand `avro.serde.schema` and how it 
relates to the others. But it might be reasonable to say that we should take 
all `startsWith("avro.schema.")`
   
   
   Separately from the question of _which_ properties to pick up, there is the 
question of whether to hard-code the strings or use the enum values. 
@dongjoon-hyun can you explain more fully why you don't want to depend on 
`AvroTableProperties`? We already have another reference to it in the Spark 
codebase [within 
HiveShim](https://github.com/apache/spark/blob/6a194f197819e2970f619ed92321e0a0b716bf37/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala#L92).
 I don't understand the argument that some older branches may not have this 
class available -- if an older branch has a Hive dependency that is too old and 
we need to cherry pick this change there, then we can hard-code it in the 
cherry pick, but I don't see a reason why we can't use the enum in newer 
versions. 

##########
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala
##########
@@ -388,6 +394,9 @@ private[hive] object HiveTableUtil {
 private[hive] object DeserializerLock
 
 private[hive] object HadoopTableReader extends HiveInspectors with Logging {
+
+  val avroTableProperties = 
AvroTableProperties.values().map(_.getPropName()).toSet

Review comment:
       @attilapiros would you mind looking into what `avro.serde.schema` is 
used for? Understanding that better would provide more confidence (for me, at 
least) that `avro.schema` is the correct prefix.

##########
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala
##########
@@ -388,6 +394,9 @@ private[hive] object HiveTableUtil {
 private[hive] object DeserializerLock
 
 private[hive] object HadoopTableReader extends HiveInspectors with Logging {
+
+  val avroTableProperties = 
AvroTableProperties.values().map(_.getPropName()).toSet

Review comment:
       Based on new discussion, sounds like we can leave that as a follow-on 
for later.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xkrogen commented on a change in pull request #31133: [SPARK-26836][SQL] Supporting Avro schema evolution for partitioned Hive tables

Reply via email to