[jira] [Commented] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure

Matt Cheah (JIRA) Fri, 26 Jun 2015 19:35:20 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14603913#comment-14603913
 ]


Matt Cheah commented on SPARK-8167:
-----------------------------------

One thought is to have, whenever a task fails from an executor lost failure, 
logic specific to YARN to ask the YarnAllocator (ApplicationMaster) if the 
executor that was just lost had been preempted. There might be some nasty race 
conditions here though, and would require invoking a blocking RPC call inside 
of TaskSetManager.executorLost, or something similar - which is on the message 
loop of RpcEndpoint. And invoking a blocking RPC call in the message loop is 
something that's probably not desirable.

> Tasks that fail due to YARN preemption can cause job failure
> ------------------------------------------------------------
>
>                 Key: SPARK-8167
>                 URL: https://issues.apache.org/jira/browse/SPARK-8167
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler, YARN
>    Affects Versions: 1.3.1
>            Reporter: Patrick Woody
>            Assignee: Matt Cheah
>            Priority: Blocker
>
> Tasks that are running on preempted executors will count as FAILED with an 
> ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if 
> a large resource shift is occurring, and the tasks get scheduled to executors 
> that immediately get preempted as well.
> The current workaround is to increase spark.task.maxFailures very high, but 
> that can cause delays in true failures. We should ideally differentiate these 
> task statuses so that they don't count towards the failure limit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure

Reply via email to