steveloughran commented on PR #36564:
URL: https://github.com/apache/spark/pull/36564#issuecomment-2225964952

   Task commits MUST be atomic and a second attempt MUST be able to supercede 
the first. 
   
   Hadoop FileOutputCommitter v1 (on hdfs, localfs, abfs, *not* gcs) uses 
atomic rename for this, deleting task completed data path first. A retry will 
delete that dest path and rename their own work directory to it.
   
   S3A committers PUT a manifest to the task path; relies on PUT being atomic.
   Mapreduce Manifest committer does write of manifest and then rename; this is 
atomic on GCS so safe there (dir rename is non atomic)
   
   The application MUST NOT constrain which of two attempts told to commit 
succeeds, only that the second one MUST report success.
   
   Why so? Because if task attempt TA1 stops responding after being told to 
commit, then TA2 will be told to commit, and reports success. But if TA1 is 
somehow suspended and performs its atomic commit after TA2, then it will be the 
TA1 manifest which is processed.
   
   If you are encountering problems with jobs where task failures are 
unrecoverable, that means there is something wrong with the task commit 
algorithm.
   
   What were you seeing it with?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to