[I] [Umbrella] Improve data locality during ingestion [hudi]

via GitHub Sat, 29 Nov 2025 19:41:43 -0800


hudi-bot opened a new issue, #14758:
URL: https://github.com/apache/hudi/issues/14758


   Today the upsert partitioner does the file sizing/bin-packing etc for
   inserts and then sends some inserts over to existing file groups to
   maintain file size.
   We can abstract all of this into strategies and some kind of pipeline
   abstractions and have it also consider "affinity" to an existing file group
   based
   on say information stored in the metadata table?
   
   See http://mail-archives.apache.org/mod_mbox/hudi-dev/202102.mbox/browser
    for more details
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-1628
   - Type: Epic
   
   
   ---
   
   
   ## Comments
   
   10/Jun/21 04:50;thirumalai.raj;Hi [~vinoth] / [~satishkotha] , is anyone 
working on this feature ? When we tried to insert data into Hudi COW table with 
drop duplicates enabled using Spark Streaming (DStreams) the pipeline wasn't 
scaling because Min Max pruning in HoodieBloomIndex wasn't efficient and the 
exploded RDD size was >5X which caused bottleneck in the shuffle stage. 
   
   If no one has started working on this, I would like to understand the 
requirements better and contribute to it ;;;
   
   ---
   
   23/Jun/21 22:06;vinoth;[~thirumalai.raj] apologies for the delay. This is up 
for grabs, if you are interested. 
   
   This will be a pretty popular addition I can imagine. ;;;
   
   ---
   
   26/Jun/21 15:11;thirumalai.raj;[~vinoth], I am interested in taking up this 
task, will start working on this;;;
   
   ---
   
   06/Jan/22 16:45;vinoth;[~guoyihua] assigning to you to drive this forward. 
   
   cc [~thirumalai.raj] please let us know if you are still interested in 
pursuing this.;;;
   
   ---
   
   09/Jan/22 16:21;thirumalai.raj;[~vinoth] , sorry I am bit busy with my 
startup stuff. [~guoyihua]  can take this forward.;;;
   
   ---
   
   18/Jan/22 16:50;guoyihua;My approach to this:
    - For new file write (insert, upsert, etc), the sorting is handled at the 
write handle level:
      - For upsert, HoodieMergeHandle does the file write and all records are 
known before writing, so we can sort the records based on a single or multiple 
columns (space curve) before the actual writing.  This will add memory pressure.
   
                 - For insert, HoodieCreateHandle does the file write and it 
does dynamic file sizing so it is not known when a file will be closed.  In 
this case, we need to sort the records in Spark RDD partition before this.
   
                 - For data column sorting, this requires serde of the record 
payload which adds overhead in the ingestion as well.
    - For partitioner, we need to abstract a better Partitioner interface so 
that sorting, bucketing logic does not leak into the core write path.;;;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Umbrella] Improve data locality during ingestion [hudi]

Reply via email to