In https://issues.apache.org/jira/browse/SPARK-39195, OutputCommitCoordinator was modified to fail a stage if an authorized committer task fails.
We run our spark jobs on a k8s cluster managed by karpenter and mostly built from spot instances. As a result, our executors are frequently killed. With the above change, that leads to expensive stage failures at the final write stage. I think I understand why the above is needed when using FileOutputCommitter, but it seems like we can handle things like the magic s3a committer differently. For those, we could instead abort the task attempt, which will the data files that are awaiting the final PUT operation, and remove them from the list of files to be completed during the job commit phase Does this seem reasonable? I think the change could go in OutputCommitCoordinator (as a case in the taskCompleted block), but there are other options as well Any other ideas on how stop individual failures of authorized committer tasks from failing the whole job?