[ https://issues.apache.org/jira/browse/HUDI-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842465#comment-17842465 ]
Vinoth Chandar edited comment on HUDI-1045 at 4/30/24 8:27 PM: --------------------------------------------------------------- h2. [WIP] Approach 1 : Redistribute records from the conflicting file groups Within the finalize section (done within a table level distributed lock), we could either have W or C perform the following . {code:java} W { - identify the file groups that have been clustered concurrently by C - Read out all records written by W, into these conflicting file groups - Redistribute records based on new records distribution based on C - finalize W } {code} {code:java} C { - identify the file groups that have been written to concurrently by W. - Read out all records written by such W, into conflicting file groups - Redistribute records based on new records distribution, based on C - finalize C } {code} h3. Pros: # Simple to understand/debug, no storage format changes. # Could work well for cases where the overlap between C and W is rather small. # No extra read amplification for queries, W/C absorbs tha cost. {*}Cons{*}: # Can be pretty wasteful in continuous writers or with high overlap between C and W, forcing the entire write to be redone effectively (same as writer failing and retrying like today) # Particularly more expensive for CoW, where W has paid the cost of merging columnar base files, with incoming records. was (Author: vc): h2. [WIP] Approach 1 : Redistribute records from the conflicting file groups Within the finalize section (done within a table level distributed lock), we could either have W or C perform the following . {code:java} W { - identify the file groups that have been clustered concurrently by C - Read out all records written by W, into these conflicting file groups - Redistribute records based on new records distribution based on C - finalize W } {code} {code:java} C { - identify the file groups that have been written to concurrently by W. - Read out all records written by such W, into conflicting file groups - Redistribute records based on new records distribution, based on C - finalize C } {code} h3. Pros: # Simple to understand/debug, no storage format changes. # Could work well for cases where # Absorbs any read amplification. h3. Cons: # sort order may be disturbed from the re-distribtion of keys. # Can be pretty wasteful, if > Support updates during clustering > --------------------------------- > > Key: HUDI-1045 > URL: https://issues.apache.org/jira/browse/HUDI-1045 > Project: Apache Hudi > Issue Type: Task > Components: clustering, table-service > Reporter: leesf > Assignee: Vinoth Chandar > Priority: Blocker > Fix For: 1.0.0 > > > We need to allow a writer w writing to file groups f1, f2, f3, concurrently > while a clustering service C reclusters them into f4, f5. > * Writes can be either updates, deletes or inserts. > * Either clustering C or the writer W can finish first > * Both W and C need to be able to complete their actions without much > redoing of work. > * The number of output file groups for C can be higher or lower than input > file groups. > * Need to work across and be oblivious to whether the writers are operating > in OCC or NBCC modes > * Needs to interplay well with cleaning and compaction services. -- This message was sent by Atlassian Jira (v8.20.10#820010)