Jungtaek Lim created SPARK-27210:
------------------------------------

             Summary: Cleanup incomplete output files in 
ManifestFileCommitProtocol if task is aborted
                 Key: SPARK-27210
                 URL: https://issues.apache.org/jira/browse/SPARK-27210
             Project: Spark
          Issue Type: Improvement
          Components: Structured Streaming
    Affects Versions: 3.0.0
            Reporter: Jungtaek Lim


Unlike HadoopMapReduceCommitProtocol, ManifestFileCommitProtocol doesn't clean 
up incomplete output files for both cases: task is aborted as well as job is 
aborted.

In HadoopMapReduceCommitProtocol, it leverages stage directory to write 
intermediate files so once job is aborted it can simply delete stage directory 
to clean up everything. Even HadoopMapReduceCommitProtocol puts more effort on 
cleaning up intermediate files on task side if task is aborted.

ManifestFileCommitProtocol doesn't do anything for cleaning up but just 
maintains the metadata which list of complete output files are written. It 
should be better if ManifestFileCommitProtocol can do the best effort to clean 
up: not sure it can do job level cleanup since it doesn't leverage stage 
directory, but it's clear that it can still put best effort to do task level 
cleanup.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to