[I] Support updates during clustering [hudi]

via GitHub Sat, 29 Nov 2025 19:32:23 -0800


hudi-bot opened a new issue, #14611:
URL: https://github.com/apache/hudi/issues/14611


   h4. We need to allow a writer w writing to file groups f1, f2, f3, 
concurrently while a clustering service C  reclusters them into  f4, f5. 
   
   Goals
    * Writes can be either updates, deletes or inserts. 
    * Either clustering C or the writer W can finish first
    * Both W and C need to be able to complete their actions without much 
redoing of work. 
    * The number of output file groups for C can be higher or lower than input 
file groups. 
    * Need to work across and be oblivious to whether the writers are operating 
in OCC or NBCC modes
    * Needs to interplay well with cleaning and compaction services.
   
   
   
   h4. Non-goals 
    * Strictly the sort order achieved by clustering, in face of updates (e.g 
updates change clustering field values, causing output clustering file groups 
to be not fully sorted by those fields)
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-1045
   - Type: Task
   - Epic: https://issues.apache.org/jira/browse/HUDI-6640
   - Fix version(s):
     - 1.2.0
   
   
   ---
   
   
   ## Comments
   
   26/Apr/24 22:00;vinoth;At first it may seem trivial to have clustering fail 
all the time, ceding preference to the incoming writes. But, aside from wasting 
resources, clustering can finish before writes and we cannot atomically both 
rollback clustering (note that restoring a completed action is 
considered/recommended as an offline maintenance) as well as finish the 
write..;;;
   
   ---
   
   30/Apr/24 17:55;vinoth;h2.  [WIP] Approach 1 :  Redistribute records from 
the conflicting file groups 
   
   Within the finalize section (done within a table level distributed lock), we 
could either have W or C perform the following . 
   {code:java}
   W {
     - identify the file groups that have been clustered concurrently by C
     - Read out all records written by W, into these conflicting file groups
     - Redistribute records based on new records distribution based on C
     - finalize W
   } {code}
    
   {code:java}
   C {
     - identify the file groups that have been written to concurrently by W.
     - Read out all records written by such W, into conflicting file groups
     - Redistribute records based on new records distribution, based on C
     - finalize C
   }
   {code}
   h3. Pros: 
    # Simple to understand/debug, no storage format changes. 
    # Could work well for cases where the overlap between C and W is rather 
small.
    # No extra read amplification for queries, W/C absorbs tha cost. 
   
   {*}Cons{*}:
    # Can be pretty wasteful in continuous writers or with high overlap between 
C and W, forcing the entire write to be redone effectively (same as writer 
failing and retrying like today)
    # Particularly more expensive for CoW, where W has paid the cost of merging 
columnar base files, with incoming records. ;;;
   
   ---
   
   30/Apr/24 17:55;vinoth;h2. [WIP] Approach 2 : Introduce pointer data blocks 
into storage format
   
   if we truly wish to achieve, independent operations of C and W, while 
minimizing the amount of work redone on the writer side, we need to introduce a 
notion of "pointer data blocks" (name TBD) and a new "retain" command block - 
in Hudi's log format. 
   
   *Pointer data blocks* 
   A pointer data block just keeps pointers to other blocks in a different file 
groups. 
   {code:java}
   pointer data block {
      [
       {fg1, logfile X, ..},
       {fg2, logfile Y, ..},
       {fg3, logfile Z, ..}
      ]
   }
   {code}
   In this approach, instead of redistributing the records from file groups 
f1,f2,f3 to f4,f5 - we will
    * log pointer data blocks to f4, f5 , pointing back to new log files W 
wrote to file groups f1,f2,f3. (X,Y,Z in the example above) 
    * log retain command blocks to f1 (with logfilex),f2 (with logfileY),f3 
(with logfile z) to indicate to the cleaner service, that these log files are 
supposed to be retained for later, so it can be skipped.
    * When f4,f5 is compacted, these log files  will be reconciled to new file 
slices in f4, f5 
    * When the file slices in f4 and f5 are cleaned, the pointers are followed 
and the retained logfiles in f1,f2,f3 can also be cleaned. (note: need some 
reference counting in case f4 is cleaned while f5 is not)
   
   Note that there is a 1:N relationship between pointer data block and log 
files. For e.g both f4, f5's pointer data blocks will be pointing to all three 
files X,Y,Z. 
   
   A snapshot read of the file groups f4, f5 need to carefully filter records 
to avoid exposing duplicate records to the reader. 
   
   
    * merges the log files pointed to by pointer data blocks based on keys i.e 
only delete/update records in the base file for keys that match from the 
pointed log files.
    * inserts need to be handled with care and need to be distinguishable from 
an update e.g a record in pointed log files for f4 can either be an insert or a 
update to a record that is now clustered into f5. 
    * storage format also needs to store a bitmap indicating which records are 
inserts vs updates in the pointed to log files. Once such a list of available, 
then the reader of f4/f5 can split the inserts amongst them, by using a hash 
mod of the insert key for e.g i.e f4 will get insert 0, 2, 4, 6 ... in log file 
x,y,z. while f5 will get inserts 1,3,5,7 in log files x,y,z
   
   
   
   h3. Pros: 
    * Truly non-blocking, C and W can go at their own pace, complete without 
much additional overhead since the pointer blocks are pretty light to add. 
    * Works even for high throughput streaming scenarios where there is not 
much time or writer to be reconciling (e.g Flink)
   
   h3. Cons:
    * More merge costs on the read/query side (although can be very tolerable 
for same cases approach 1 works well for)  
    * More complexity and new storage changes. ;;;
   
   ---
   
   21/Nov/24 21:09;yihua;Related ticket: HUDI-8464.
   
   When we support NBCC for concurrent updates and clustering (HUDI-1045), we 
also need to keep this in mind for clustering since there is a time gap between 
clustering request time generation and adding clustering requested file on 
timeline after the clustering planning, i.e., clustering planning need to 
consider the requested time of the clustering too when generating the list of 
files to cluster.;;;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Support updates during clustering [hudi]

Reply via email to