sathyaprakashg commented on pull request #2012:
URL: https://github.com/apache/hudi/pull/2012#issuecomment-835222588


   > > @sathyaprakashg and others: trying to understand the use-case here. I 
understand its related to deltastreamer receiving events in old schema after 
Hudi's dataset schema got evolved. what's the schema from schema provider when 
source is producing events in old schema (after schema got evolved w/ hudi 
dataset)? if the schema provider's schema is updated, I guess there is no need 
to store the writer schema w/ payload.
   > > AvroConversionUtils.createDataFrame() will ensure to convert the JavaRDD 
w/ old schema to Dataset w/ new schema if schemaProvider.SourceSchema() has the 
evolved schema.
   
   Issue with schema evoluation happens in `HoodieAvroUtils.avroToBytes` and 
`HoodieAvroUtils.bytesToAvro`. Let us consider a scenario where there are two 
versions of schema in schema regsitry. In the 2nd (latest) version, there is a 
new field added. But data is stilling with schema version 1. 
   
   `HoodieAvroUtils.avroToBytes` uses schema part of data (i.e version 1 
schema) to convert avro to bytes. `HoodieAvroUtils.bytesToAvro` uses the latest 
schema registry schema (version 2) to convert the bytes to avro. This will fail 
because v1 schema was used to convert to bytes, but v2 schema is being used to 
convert bytes to avro. In order to solve this, we need both v1 (writer schema) 
and v2 (reader schema) to convert bytes back to avro. We can get v2 schema from 
schema registry, but to get v1 schema, we were trying to store the writer 
schema part of the payload itself.
   
   Please let me know if it is still not very clear


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to