[jira] [Commented] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes

Kay Ousterhout (JIRA) Thu, 23 Feb 2017 13:55:42 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15881358#comment-15881358
 ]


Kay Ousterhout commented on SPARK-19698:
----------------------------------------

I think this is the same issue as SPARK-19263 -- can you check to see if that 
fixes the problem / have you looked at that JIRA?  I wrote a super long 
description of the problem towards the end of the associated PR.

> Race condition in stale attempt task completion vs current attempt task 
> completion when task is doing persistent state changes
> ------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-19698
>                 URL: https://issues.apache.org/jira/browse/SPARK-19698
>             Project: Spark
>          Issue Type: Bug
>          Components: Mesos, Spark Core
>    Affects Versions: 2.0.0
>            Reporter: Charles Allen
>
> We have encountered a strange scenario in our production environment. Below 
> is the best guess we have right now as to what's going on.
> Potentially, the final stage of a job has a failure in one of the tasks (such 
> as OOME on the executor) which can cause tasks for that stage to be 
> relaunched in a second attempt.
> https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155
> keeps track of which tasks have been completed, but does NOT keep track of 
> which attempt those tasks were completed in. As such, we have encountered a 
> scenario where a particular task gets executed twice in different stage 
> attempts, and the DAGScheduler does not consider if the second attempt is 
> still running. This means if the first task attempt succeeded, the second 
> attempt can be cancelled part-way through its run cycle if all other tasks 
> (including the prior failed) are completed successfully.
> What this means is that if a task is manipulating some state somewhere (for 
> example: a upload-to-temporary-file-location, then delete-then-move on an 
> underlying s3n storage implementation) the driver can improperly shutdown the 
> running (2nd attempt) task between state manipulations, leaving the 
> persistent state in a bad state since the 2nd attempt never got to complete 
> its manipulations, and was terminated prematurely at some arbitrary point in 
> its state change logic (ex: finished the delete but not the move).
> This is using the mesos coarse grained executor. It is unclear if this 
> behavior is limited to the mesos coarse grained executor or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes

Reply via email to