[I] Make sure Spark RDDs triggering actual FS activity are only dereferenced once [hudi]

via GitHub Sat, 29 Nov 2025 20:42:43 -0800


hudi-bot opened a new issue, #15017:
URL: https://github.com/apache/hudi/issues/15017


   Currently, RDD `collect()` operation is treated quite loosely and there are 
multiple flows which used to dereference RDDs (for ex, through `collect`, 
`count`, etc) that way triggering the same operations being carried out 
multiple times, occasionally duplicating the output already persisted on FS.
   
   Check out HUDI-3370 for recent example.
   
   NOTE: Even though Spark caching is supposed to make sure that we aren't 
writing to FS multiple times, we can't solely rely on caching to guarantee 
exactly once execution.
   
   Instead, we should make sure that RDDs are only dereferenced {*}once{*}, 
w/in "commit" operation and all the other operations are only relying on 
_derivative_ data.
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-3397
   - Type: Bug
   - Epic: https://issues.apache.org/jira/browse/HUDI-3249


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Make sure Spark RDDs triggering actual FS activity are only dereferenced once [hudi]

Reply via email to