guanziyue edited a comment on issue #3078:
URL: https://github.com/apache/hudi/issues/3078#issuecomment-866499977


   Hi tandonraghav
   I did some similar work before. Hope my experience could help you.
   First, as nanash mentioned before, we may call precombine method in two 
cases. First is dedup in ingestion. Second is in compaction.
   In compaction process, we first read log file use schema stored in log block 
to construct generic record and then have generic record transfer into payload. 
Then put them into a map. When we find duplicate key (yes they are ingested in 
different commit), we call precombine to combine two records with same key. 
This process is similar to hashJoin in spark. Finally, we got a map of payload 
which all key are unique. After that, we read record from parquet, use schema 
user provided in config to construct indexedRecord and call 
combinAndGetUpdateValue to merge payload in map and data from parquet.
   As you mentioned, it may not find schema in precombine. Could you please 
hold a reference to the schema in GenericRecord when payload is constructed as 
an attribute of class MongoHudiCDCPayload ? Then you can use schema in 
precombine method. And you may find that schema in avro 1.8.2 is not 
serializable, mark this attribute as transient may be a good idea. However, 
this may lead to schema lost in ingestion, as there is shuffle of payload in 
ingestion. You could recreate schema from properties arg in precombine when 
ingestion. This props is actually write config of hoodie. 
   Note that you may not always get schema from config, try this when schema is 
null may be a good idea.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to