[
https://issues.apache.org/jira/browse/SPARK-19790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-19790:
---------------------------------
Labels: bulk-closed (was: )
> OutputCommitCoordinator should not allow another task to commit after an
> ExecutorFailure
> ----------------------------------------------------------------------------------------
>
> Key: SPARK-19790
> URL: https://issues.apache.org/jira/browse/SPARK-19790
> Project: Spark
> Issue Type: Bug
> Components: Scheduler
> Affects Versions: 2.1.0
> Reporter: Imran Rashid
> Priority: Major
> Labels: bulk-closed
>
> The OutputCommitCoordinator resets the allowed committer when the task fails.
>
> https://github.com/apache/spark/blob/8aa560b75e6b083b2a890c52301414285ba35c3d/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala#L143
> However, if a task fails because of an ExecutorFailure, we actually have no
> idea what the status is of the task. The task may actually still be running,
> and perhaps successfully commit its output. By allowing another task to
> commit its output, there is a chance that multiple tasks commit, which can
> result in corrupt output. This would be particularly problematic when
> commit() is an expensive operation, eg. moving files on S3.
> For other task failures, we can allow other tasks to commit. But with an
> ExecutorFailure, its not clear what the right thing to do is. The only safe
> thing to do may be to fail the job.
> This is related to SPARK-19631, and was discovered during discussion on that
> PR https://github.com/apache/spark/pull/16959#discussion_r103549134
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]