hudi-bot opened a new issue, #16643: URL: https://github.com/apache/hudi/issues/16643
As of now, when ingestion writer conflicts with pending clustering, the ingestion write is aborted, while the clustering eventually succeeds. But many times, ingestion writer might have stronger SLAs and might want to make progress over clustering, since clustering is considered to be more of an optimization. So, would be good to add this support to Hudi ## JIRA info - Link: https://issues.apache.org/jira/browse/HUDI-8291 - Type: Improvement - Fix version(s): - 1.1.0 --- ## Comments 02/Oct/24 22:53;shivnarayan;h2. Design h3. Design considerations: We have two approaches to take. either we introduce a new state to the timeline or we avoid making any changes to timeline, but take a hit on ensuring all readers of pending clustering plans account for concurrent deletion of the plan or deser eagerly within the lock which might bring in some complexity. We can dive into the specifics below h3. Design Option1 : *Introducing "abort" state to actions in timeline* As of now, an action typically starts with "requested" state and then moves to "inflight" and eventually goes to "complete". If not, a rollback will be triggered in which the commit files are deleted from timeline. We are introducing a new state called "abort". With this, an action can actually move from ➝ "complete" "requested" ➝ "inflight" ➝ ➝ "abort" Abort means that the action is aborted or cancelled. Cancellable clustering plan requests will be added to ".hoodie/cancel_requests/" folder within ".hoodie" by any ingestion writer. format: [\{clustering_instant}_\{writer_commit_time}.cancel] h3. Writer: whenever it detects a conflict w/ clustering and wants to proceed ahead, will check if there are any pending requests for cancellations. {code:java} if not { acquires lock Checks if there are no pending requests for cancellations and clustering is not yet complete adds a cancellation request to ".hoodie/clustering_cancel_requests/" else if there is a pending clustering, return else if clustering is already complete: return. releases lock } {code} In all cases, the writer assumes that it can proceed w/o any hiccups. Todo: If clustering completes, should the writer bail out? Eventually during conflict resolution step: {code:java} if there is a conflict w/ an on-going clustering, and if there is a cancellation request continue and complete the commit. else if the clustering moved to abort state. continue and complete the commit. else if clustering is completed or is in progress w/o any request for cancellations, current writer should abort itself. {code} h4. Clustering execution runnable: just before wrapping up the clustering (we could optimize this down the line, to do it eagerly at different points in time. For now, lets say we do this check at the end) {code:java} takes a lock checks for any cancellation request. if yes, rollsback the clustering. and moves the state to abort state. else, continues as usual. does conflict resolution as usual. bcoz, if the writer does not want to cancel the clustering, there could be data loss. hence the conflict resolution step should not change for clustering writer. {code} h4. Consideration: Instead of adding the cancel requests to a adhoc folder in ".hoodie", we could also introduce a potential action in the timeline. t5.rc.requested t5.rc.inflight t5.rc.abort.requested t5.rc.abort Up until now, for the past 8+ years, we were able to confine to just 3 states (requested, inflight, complete). But this one introduces a 4th state. So, we need to really think about if we wanted to go this route. Above is just an alternative to "conveying the cancellation request" to the cluster worker by other ingestion writer. Irrespective of this, the final abort state is required. h4. Fixes to other entities accessing pending clustering. lets try to understand the usages of the pending clustering. 1. Delete partition: When planning for delete partition, ignores file groups to be replaced that already has a pending clustering. 2. Future clustering planning will ignore file groups from pending clustering plans. 2.a Consistent Bucket index. Will ignore file groups that are part of already pending clustering. 3. Future compaction planning will ignore file groups from pending clustering plans. 4. CommitActionExecutor: updateStrategy.handleUpdate() to resolve conflicts with pending clustering for incoming writes. We should be able to deprecate this once we have conurrent update support with pending clustering. 5. Upsert partitioner. To avoid adding small file handling to file groups from pending clustering. 6. Guess w/ consistent bucket index, incoming records will be added to both replaced and new file groups from pending clustering and hence pending clustering is required there. In most cases, callers wanted to ignore file groups from a pending clustering. So, eventually if its deleted, should not be an issue. Delete partition and consistency bucket index might need some fixes to redo or replan itself again if it detects a pending clustering is eventually deleted or up for cancellation. h4. Example illustration of the above design Lets walk through a simple scenario and see how it might pan out. [t5.rc.re|http://t5.rc.re/] t5.rc.infli crashed. restart : we detect that there is a cancellation request and hence trigger a rollback. t6.rb.req t6.rb.inf t6.rb.complete Note: t6 fully rollsback t5's prev attempt. And then proceeds on to move the action to abort. t5.rc.requested t5.rc.inflight (empty file) t5.rc.aborted (empty) h2. Design choice 2: All we are trying to achieve w/ approach 1 is to 1) intimate clustering worker that a particular clustering has to be cancelled. And 2) the abort state is introduced so that any other concurrent reader may not run into any deser issue. If not, we could go ahead and delete the clustering plan. In this design, we try to take a stab to see if we can avoid both. 1) should be doable by adding some additional metadata to commits. Every writer will be given some priority. for simplicity say, ingestion writer gets priority of 1 and clustering gets 2. Ingestion writer will not worry about any conflicts wrt any pending clustering and will proceed onto complete its commits. On the other hand, clustering worker during the conflict resolution step, will find out all conflicting commits, and find for high priority writers (commit metadata). Based on that it will cancel itself. 2) is what might be tricky. Before we go further, lets try to understand the usages of the pending clustering. Please refer to above section where I have compiled all users/callers of pending clustering. So, its gonna be tough to fix all callers so that they account for the scenario that a pending clustering could be deleted or revoked eventually. To the bare minimum, we should account for scenario, where timeline contains a pending replace commit, but when a caller is about to deser replace commit metadata, it was deleted or rolled back by the concurrent clustering worker. If at all we really wanted to achieve this, we need every caller to take locks, refresh timeline, and deser replace commit metadata eagerly and release the lock. so, that while the caller is within the lock, deser replace commit may not fail at all. But post releasing the lock, a concurrent clustering worker could delete the replace commit, and the caller will have to account for it. h2. Conclusion: I feel approach 1 seems elegant and does not involve lot of complexities. But it does introduce a new state to the timeline. Will brainstorm w/ the team further.;;; --- 08/Oct/24 01:03;shivnarayan;Here is the RFC w. the first approach [https://github.com/apache/hudi/pull/11555/files?short_path=0de0b99#diff-0de0b9940a382f28dbf83c9007047ea84e0a61c553dfb6b55b279a327cc7f159] ;;; -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
