In https://issues.apache.org/jira/browse/SPARK-39195,
OutputCommitCoordinator was modified to fail a stage if an authorized
committer task fails.

We run our spark jobs on a k8s cluster managed by karpenter and mostly
built from spot instances. As a result, our executors are frequently
killed. With the above change, that leads to expensive stage failures at
the final write stage.

I think I understand why the above is needed when using
FileOutputCommitter, but it seems like we can handle things like the magic
s3a committer differently. For those, we could instead abort the task
attempt, which will the data files that are awaiting the final PUT
operation, and remove them from the list of files to be completed during
the job commit phase

Does this seem reasonable? I think the change could go in
OutputCommitCoordinator (as a case in the taskCompleted block), but there
are other options as well

Any other ideas on how stop individual failures of authorized committer
tasks from failing the whole job?

Reply via email to