hudi-bot opened a new issue, #14758: URL: https://github.com/apache/hudi/issues/14758
Today the upsert partitioner does the file sizing/bin-packing etc for inserts and then sends some inserts over to existing file groups to maintain file size. We can abstract all of this into strategies and some kind of pipeline abstractions and have it also consider "affinity" to an existing file group based on say information stored in the metadata table? See http://mail-archives.apache.org/mod_mbox/hudi-dev/202102.mbox/browser for more details ## JIRA info - Link: https://issues.apache.org/jira/browse/HUDI-1628 - Type: Epic --- ## Comments 10/Jun/21 04:50;thirumalai.raj;Hi [~vinoth] / [~satishkotha] , is anyone working on this feature ? When we tried to insert data into Hudi COW table with drop duplicates enabled using Spark Streaming (DStreams) the pipeline wasn't scaling because Min Max pruning in HoodieBloomIndex wasn't efficient and the exploded RDD size was >5X which caused bottleneck in the shuffle stage. If no one has started working on this, I would like to understand the requirements better and contribute to it ;;; --- 23/Jun/21 22:06;vinoth;[~thirumalai.raj] apologies for the delay. This is up for grabs, if you are interested. This will be a pretty popular addition I can imagine. ;;; --- 26/Jun/21 15:11;thirumalai.raj;[~vinoth], I am interested in taking up this task, will start working on this;;; --- 06/Jan/22 16:45;vinoth;[~guoyihua] assigning to you to drive this forward. cc [~thirumalai.raj] please let us know if you are still interested in pursuing this.;;; --- 09/Jan/22 16:21;thirumalai.raj;[~vinoth] , sorry I am bit busy with my startup stuff. [~guoyihua] can take this forward.;;; --- 18/Jan/22 16:50;guoyihua;My approach to this: - For new file write (insert, upsert, etc), the sorting is handled at the write handle level: - For upsert, HoodieMergeHandle does the file write and all records are known before writing, so we can sort the records based on a single or multiple columns (space curve) before the actual writing. This will add memory pressure. - For insert, HoodieCreateHandle does the file write and it does dynamic file sizing so it is not known when a file will be closed. In this case, we need to sort the records in Spark RDD partition before this. - For data column sorting, this requires serde of the record payload which adds overhead in the ingestion as well. - For partitioner, we need to abstract a better Partitioner interface so that sorting, bucketing logic does not leak into the core write path.;;; -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
