[jira] [Commented] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure

Jeff Zhang (JIRA) Thu, 30 Jul 2015 01:03:07 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14647313#comment-14647313
 ]


Jeff Zhang commented on SPARK-8167:
-----------------------------------

[~mcheah] What's the status of this ticket ?  I don't think blocking RPC call 
is a good idea.  I think we could just send executor preempted message to 
driver when the container is preempted. And let driver to decrease the 
numTaskAttemptFails. Although we lose some consistency here, at least we could 
avoid job failures due to preemption. And I think there's some gap between 2 
consecutive failed task attempt, very likely in the gap the driver has received 
the executor preempted message.  Thoughts ?

> Tasks that fail due to YARN preemption can cause job failure
> ------------------------------------------------------------
>
>                 Key: SPARK-8167
>                 URL: https://issues.apache.org/jira/browse/SPARK-8167
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler, YARN
>    Affects Versions: 1.3.1
>            Reporter: Patrick Woody
>            Assignee: Matt Cheah
>            Priority: Blocker
>
> Tasks that are running on preempted executors will count as FAILED with an 
> ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if 
> a large resource shift is occurring, and the tasks get scheduled to executors 
> that immediately get preempted as well.
> The current workaround is to increase spark.task.maxFailures very high, but 
> that can cause delays in true failures. We should ideally differentiate these 
> task statuses so that they don't count towards the failure limit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure

Reply via email to