[GitHub] [hudi] sathyaprakashg commented on a change in pull request #2012: [HUDI-1129] Deltastreamer Add support for schema evolution

GitBox Sat, 07 Nov 2020 15:04:15 -0800


sathyaprakashg commented on a change in pull request #2012:
URL: https://github.com/apache/hudi/pull/2012#discussion_r519230640




##########
File path: hudi-spark/src/main/scala/org/apache/hudi/AvroConversionHelper.scala
##########
@@ -364,4 +366,40 @@ object AvroConversionHelper {
         }
     }
   }
+
+  /**
+   * Remove namespace from fixed field.
+   * org.apache.spark.sql.avro.SchemaConverters.toAvroType method adds 
namespace to fixed avro field
+   * 
https://github.com/apache/spark/blob/master/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala#L177
+   * So, we need to remove that namespace so that reader schema without 
namespace do not throw erorr like this one
+   * org.apache.avro.AvroTypeException: Found 
hoodie.source.hoodie_source.height.fixed, expecting fixed
+   *
+   * @param schema Schema from which namespace needs to be removed for fixed 
fields
+   * @return input schema with namespace removed for fixed fields, if any
+   */
+  def removeNamespaceFromFixedFields(schema: Schema): Schema  ={

Review comment:
       @n3nash @bvaradar I checked the three steps you mentioned and it works 
fine when the reader and writer schema has same set of fields (and writer 
schema has namespace in fixed field). 
   
   If reader schema has extra field then, this approach does not work. Here is 
an 
[example](https://gist.github.com/sathyaprakashg/f423291be7be6f9d96b9cb850fc72edf)
 that has extra field in reader schema and gives error.  When schema evolves, 
table schema (reader schema) may have more or less number of fields then writer 
schema(mor log file schema). So, if we have to implement this approach, then it 
would work only when schema is same (except the extra namespace information in 
writer schema). Please let me know how to handle this or correct me if approach 
i took is wrong.
   
   Just to recap, issue we are trying to solve is, in the existing code, when 
we write fixed avro field in mor log file, it gets written with extra namespace 
information in one of the flow (Transformation without userProvidedSchema) but 
not in other two flows and with this PR, extra namespace information will no 
longer be written. 
   
   Since this extra namespace information is written only in mor log file and 
not in parquet file, one possible solution for user to do is do compaction 
before running job with this upgraded version of hudi. Also, compaction is not 
mandatory for upgrading to this version but only needs to be done if they are 
having fixed field in schema and they were using Transformation without 
userProvidedSchema flow.
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] sathyaprakashg commented on a change in pull request #2012: [HUDI-1129] Deltastreamer Add support for schema evolution

Reply via email to