Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/22112 > The FileCommitProtocol is an internal API, and our current implementation does store task-level data temporary in a staging directory (See HadoopMapReduceCommitProtocol). That said, we can fix the FileCommitProtocol to be able to rollback a committed task, as long as the job is not committed. > > As an example, we can change the semantic of FileCommitProtocol.commitTask, and say that this may be called multiple times for the same task, and the later should win. And change HadoopMapReduceCommitProtocol to support it. Hmm I wasn't aware that option was added, looks like its only if you use the dynamicPartitionOverwrite which looks like its for very specific output committers for sql. I don't see how that works with all output committers. I can write an output committer that never writes anything to disk, it might write it to a DB, Hbase, or any custom one. Some of these moves don't make sense in. Note that its also a performance impact, that is why the mapreduce output committers stopped doing this, see the v2 algorithm set via mapreduce.fileoutputcommitter.algorithm.version . (https://issues.apache.org/jira/browse/SPARK-20107). I realize the performance might need to be impacted in certain cases but wanted to point this out at least, my main concern is really the above statement I don't see how this works for all output committers
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org