[ https://issues.apache.org/jira/browse/HUDI-3397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
sivabalan narayanan updated HUDI-3397: -------------------------------------- Fix Version/s: 0.12.1 (was: 0.13.0) > Make sure Spark RDDs triggering actual FS activity are only dereferenced once > ----------------------------------------------------------------------------- > > Key: HUDI-3397 > URL: https://issues.apache.org/jira/browse/HUDI-3397 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core > Reporter: Alexey Kudinkin > Assignee: Alexey Kudinkin > Priority: Blocker > Labels: spark > Fix For: 0.12.1 > > > Currently, RDD `collect()` operation is treated quite loosely and there are > multiple flows which used to dereference RDDs (for ex, through `collect`, > `count`, etc) that way triggering the same operations being carried out > multiple times, occasionally duplicating the output already persisted on FS. > Check out HUDI-3370 for recent example. > NOTE: Even though Spark caching is supposed to make sure that we aren't > writing to FS multiple times, we can't solely rely on caching to guarantee > exactly once execution. > Instead, we should make sure that RDDs are only dereferenced {*}once{*}, w/in > "commit" operation and all the other operations are only relying on > _derivative_ data. -- This message was sent by Atlassian Jira (v8.20.10#820010)