hudi-bot opened a new issue, #15017:
URL: https://github.com/apache/hudi/issues/15017

   Currently, RDD `collect()` operation is treated quite loosely and there are 
multiple flows which used to dereference RDDs (for ex, through `collect`, 
`count`, etc) that way triggering the same operations being carried out 
multiple times, occasionally duplicating the output already persisted on FS.
   
   Check out HUDI-3370 for recent example.
   
   NOTE: Even though Spark caching is supposed to make sure that we aren't 
writing to FS multiple times, we can't solely rely on caching to guarantee 
exactly once execution.
   
   Instead, we should make sure that RDDs are only dereferenced {*}once{*}, 
w/in "commit" operation and all the other operations are only relying on 
_derivative_ data.
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-3397
   - Type: Bug
   - Epic: https://issues.apache.org/jira/browse/HUDI-3249


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to