[jira] [Commented] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure

2016-03-08 Thread Jakub Dubovsky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15185005#comment-15185005
 ] 

Jakub Dubovsky commented on SPARK-8167:
---

I'd like to have a question about this fix. In SparkUI in active stages view 
there is an entry saying this:

Tasks: Succeeded/Total
1480/2880 (1311 failed)

Does this number of failed tasks include those which "failed" because of 
preemption?
It's useful to know whether my job is failing (I should fix it) or resources 
are taken only (I should wait).

Thank you

> Tasks that fail due to YARN preemption can cause job failure
> 
>
> Key: SPARK-8167
> URL: https://issues.apache.org/jira/browse/SPARK-8167
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, YARN
>Affects Versions: 1.3.1
>Reporter: Patrick Woody
>Assignee: Matt Cheah
>Priority: Blocker
> Fix For: 1.6.0
>
>
> Tasks that are running on preempted executors will count as FAILED with an 
> ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if 
> a large resource shift is occurring, and the tasks get scheduled to executors 
> that immediately get preempted as well.
> The current workaround is to increase spark.task.maxFailures very high, but 
> that can cause delays in true failures. We should ideally differentiate these 
> task statuses so that they don't count towards the failure limit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure

2015-08-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660871#comment-14660871
 ] 

Apache Spark commented on SPARK-8167:
-

User 'mccheah' has created a pull request for this issue:
https://github.com/apache/spark/pull/8007

 Tasks that fail due to YARN preemption can cause job failure
 

 Key: SPARK-8167
 URL: https://issues.apache.org/jira/browse/SPARK-8167
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, YARN
Affects Versions: 1.3.1
Reporter: Patrick Woody
Assignee: Matt Cheah
Priority: Blocker

 Tasks that are running on preempted executors will count as FAILED with an 
 ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if 
 a large resource shift is occurring, and the tasks get scheduled to executors 
 that immediately get preempted as well.
 The current workaround is to increase spark.task.maxFailures very high, but 
 that can cause delays in true failures. We should ideally differentiate these 
 task statuses so that they don't count towards the failure limit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure

2015-07-30 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647313#comment-14647313
 ] 

Jeff Zhang commented on SPARK-8167:
---

[~mcheah] What's the status of this ticket ?  I don't think blocking RPC call 
is a good idea.  I think we could just send executor preempted message to 
driver when the container is preempted. And let driver to decrease the 
numTaskAttemptFails. Although we lose some consistency here, at least we could 
avoid job failures due to preemption. And I think there's some gap between 2 
consecutive failed task attempt, very likely in the gap the driver has received 
the executor preempted message.  Thoughts ?

 Tasks that fail due to YARN preemption can cause job failure
 

 Key: SPARK-8167
 URL: https://issues.apache.org/jira/browse/SPARK-8167
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, YARN
Affects Versions: 1.3.1
Reporter: Patrick Woody
Assignee: Matt Cheah
Priority: Blocker

 Tasks that are running on preempted executors will count as FAILED with an 
 ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if 
 a large resource shift is occurring, and the tasks get scheduled to executors 
 that immediately get preempted as well.
 The current workaround is to increase spark.task.maxFailures very high, but 
 that can cause delays in true failures. We should ideally differentiate these 
 task statuses so that they don't count towards the failure limit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure

2015-06-26 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603913#comment-14603913
 ] 

Matt Cheah commented on SPARK-8167:
---

One thought is to have, whenever a task fails from an executor lost failure, 
logic specific to YARN to ask the YarnAllocator (ApplicationMaster) if the 
executor that was just lost had been preempted. There might be some nasty race 
conditions here though, and would require invoking a blocking RPC call inside 
of TaskSetManager.executorLost, or something similar - which is on the message 
loop of RpcEndpoint. And invoking a blocking RPC call in the message loop is 
something that's probably not desirable.

 Tasks that fail due to YARN preemption can cause job failure
 

 Key: SPARK-8167
 URL: https://issues.apache.org/jira/browse/SPARK-8167
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, YARN
Affects Versions: 1.3.1
Reporter: Patrick Woody
Assignee: Matt Cheah
Priority: Blocker

 Tasks that are running on preempted executors will count as FAILED with an 
 ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if 
 a large resource shift is occurring, and the tasks get scheduled to executors 
 that immediately get preempted as well.
 The current workaround is to increase spark.task.maxFailures very high, but 
 that can cause delays in true failures. We should ideally differentiate these 
 task statuses so that they don't count towards the failure limit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure

2015-06-25 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601678#comment-14601678
 ] 

Matt Cheah commented on SPARK-8167:
---

[~joshrosen] any thoughts on this?

 Tasks that fail due to YARN preemption can cause job failure
 

 Key: SPARK-8167
 URL: https://issues.apache.org/jira/browse/SPARK-8167
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, YARN
Affects Versions: 1.3.1
Reporter: Patrick Woody
Assignee: Matt Cheah
Priority: Blocker

 Tasks that are running on preempted executors will count as FAILED with an 
 ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if 
 a large resource shift is occurring, and the tasks get scheduled to executors 
 that immediately get preempted as well.
 The current workaround is to increase spark.task.maxFailures very high, but 
 that can cause delays in true failures. We should ideally differentiate these 
 task statuses so that they don't count towards the failure limit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure

2015-06-24 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600012#comment-14600012
 ] 

Matt Cheah commented on SPARK-8167:
---

What's curious here as I'm trying to design this is that it's not immediately 
obvious how to transfer the exit code of the executor from the remote machine 
back to the driver. If the Executor dies, the driver immediately sees the 
connection as dropped and just removes the Executor without question as to what 
the exit code was; it is hard to know what the exit code is in YARN mode in 
particular.

Does anyone have any thoughts as to how to get the exit code of the executor to 
the driver, in yarn-client mode?

 Tasks that fail due to YARN preemption can cause job failure
 

 Key: SPARK-8167
 URL: https://issues.apache.org/jira/browse/SPARK-8167
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, YARN
Affects Versions: 1.3.1
Reporter: Patrick Woody
Assignee: Matt Cheah
Priority: Blocker

 Tasks that are running on preempted executors will count as FAILED with an 
 ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if 
 a large resource shift is occurring, and the tasks get scheduled to executors 
 that immediately get preempted as well.
 The current workaround is to increase spark.task.maxFailures very high, but 
 that can cause delays in true failures. We should ideally differentiate these 
 task statuses so that they don't count towards the failure limit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure

2015-06-23 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597956#comment-14597956
 ] 

Matt Cheah commented on SPARK-8167:
---

I'm starting to work on this now, sorry for the delay.

 Tasks that fail due to YARN preemption can cause job failure
 

 Key: SPARK-8167
 URL: https://issues.apache.org/jira/browse/SPARK-8167
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, YARN
Affects Versions: 1.3.1
Reporter: Patrick Woody
Priority: Blocker

 Tasks that are running on preempted executors will count as FAILED with an 
 ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if 
 a large resource shift is occurring, and the tasks get scheduled to executors 
 that immediately get preempted as well.
 The current workaround is to increase spark.task.maxFailures very high, but 
 that can cause delays in true failures. We should ideally differentiate these 
 task statuses so that they don't count towards the failure limit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure

2015-06-08 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577727#comment-14577727
 ] 

Matt Cheah commented on SPARK-8167:
---

To be clear this is independent of SPARK-7451. SPARK-7451 helps for the case 
that executors die too many times from preemption, but it doesn't not help if 
the exact same task gets preempted many times.

 Tasks that fail due to YARN preemption can cause job failure
 

 Key: SPARK-8167
 URL: https://issues.apache.org/jira/browse/SPARK-8167
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, YARN
Affects Versions: 1.3.1
Reporter: Patrick Woody

 Tasks that are running on preempted executors will count as FAILED with an 
 ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if 
 a large resource shift is occurring, and the tasks get scheduled to executors 
 that immediately get preempted as well.
 The current workaround is to increase spark.task.maxFailures very high, but 
 that can cause delays in true failures. We should ideally differentiate these 
 task statuses so that they don't count towards the failure limit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org