[jira] [Commented] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure
[ https://issues.apache.org/jira/browse/SPARK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15185005#comment-15185005 ] Jakub Dubovsky commented on SPARK-8167: --- I'd like to have a question about this fix. In SparkUI in active stages view there is an entry saying this: Tasks: Succeeded/Total 1480/2880 (1311 failed) Does this number of failed tasks include those which "failed" because of preemption? It's useful to know whether my job is failing (I should fix it) or resources are taken only (I should wait). Thank you > Tasks that fail due to YARN preemption can cause job failure > > > Key: SPARK-8167 > URL: https://issues.apache.org/jira/browse/SPARK-8167 > Project: Spark > Issue Type: Bug > Components: Scheduler, YARN >Affects Versions: 1.3.1 >Reporter: Patrick Woody >Assignee: Matt Cheah >Priority: Blocker > Fix For: 1.6.0 > > > Tasks that are running on preempted executors will count as FAILED with an > ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if > a large resource shift is occurring, and the tasks get scheduled to executors > that immediately get preempted as well. > The current workaround is to increase spark.task.maxFailures very high, but > that can cause delays in true failures. We should ideally differentiate these > task statuses so that they don't count towards the failure limit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure
[ https://issues.apache.org/jira/browse/SPARK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660871#comment-14660871 ] Apache Spark commented on SPARK-8167: - User 'mccheah' has created a pull request for this issue: https://github.com/apache/spark/pull/8007 Tasks that fail due to YARN preemption can cause job failure Key: SPARK-8167 URL: https://issues.apache.org/jira/browse/SPARK-8167 Project: Spark Issue Type: Bug Components: Scheduler, YARN Affects Versions: 1.3.1 Reporter: Patrick Woody Assignee: Matt Cheah Priority: Blocker Tasks that are running on preempted executors will count as FAILED with an ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if a large resource shift is occurring, and the tasks get scheduled to executors that immediately get preempted as well. The current workaround is to increase spark.task.maxFailures very high, but that can cause delays in true failures. We should ideally differentiate these task statuses so that they don't count towards the failure limit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure
[ https://issues.apache.org/jira/browse/SPARK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647313#comment-14647313 ] Jeff Zhang commented on SPARK-8167: --- [~mcheah] What's the status of this ticket ? I don't think blocking RPC call is a good idea. I think we could just send executor preempted message to driver when the container is preempted. And let driver to decrease the numTaskAttemptFails. Although we lose some consistency here, at least we could avoid job failures due to preemption. And I think there's some gap between 2 consecutive failed task attempt, very likely in the gap the driver has received the executor preempted message. Thoughts ? Tasks that fail due to YARN preemption can cause job failure Key: SPARK-8167 URL: https://issues.apache.org/jira/browse/SPARK-8167 Project: Spark Issue Type: Bug Components: Scheduler, YARN Affects Versions: 1.3.1 Reporter: Patrick Woody Assignee: Matt Cheah Priority: Blocker Tasks that are running on preempted executors will count as FAILED with an ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if a large resource shift is occurring, and the tasks get scheduled to executors that immediately get preempted as well. The current workaround is to increase spark.task.maxFailures very high, but that can cause delays in true failures. We should ideally differentiate these task statuses so that they don't count towards the failure limit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure
[ https://issues.apache.org/jira/browse/SPARK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603913#comment-14603913 ] Matt Cheah commented on SPARK-8167: --- One thought is to have, whenever a task fails from an executor lost failure, logic specific to YARN to ask the YarnAllocator (ApplicationMaster) if the executor that was just lost had been preempted. There might be some nasty race conditions here though, and would require invoking a blocking RPC call inside of TaskSetManager.executorLost, or something similar - which is on the message loop of RpcEndpoint. And invoking a blocking RPC call in the message loop is something that's probably not desirable. Tasks that fail due to YARN preemption can cause job failure Key: SPARK-8167 URL: https://issues.apache.org/jira/browse/SPARK-8167 Project: Spark Issue Type: Bug Components: Scheduler, YARN Affects Versions: 1.3.1 Reporter: Patrick Woody Assignee: Matt Cheah Priority: Blocker Tasks that are running on preempted executors will count as FAILED with an ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if a large resource shift is occurring, and the tasks get scheduled to executors that immediately get preempted as well. The current workaround is to increase spark.task.maxFailures very high, but that can cause delays in true failures. We should ideally differentiate these task statuses so that they don't count towards the failure limit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure
[ https://issues.apache.org/jira/browse/SPARK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601678#comment-14601678 ] Matt Cheah commented on SPARK-8167: --- [~joshrosen] any thoughts on this? Tasks that fail due to YARN preemption can cause job failure Key: SPARK-8167 URL: https://issues.apache.org/jira/browse/SPARK-8167 Project: Spark Issue Type: Bug Components: Scheduler, YARN Affects Versions: 1.3.1 Reporter: Patrick Woody Assignee: Matt Cheah Priority: Blocker Tasks that are running on preempted executors will count as FAILED with an ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if a large resource shift is occurring, and the tasks get scheduled to executors that immediately get preempted as well. The current workaround is to increase spark.task.maxFailures very high, but that can cause delays in true failures. We should ideally differentiate these task statuses so that they don't count towards the failure limit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure
[ https://issues.apache.org/jira/browse/SPARK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600012#comment-14600012 ] Matt Cheah commented on SPARK-8167: --- What's curious here as I'm trying to design this is that it's not immediately obvious how to transfer the exit code of the executor from the remote machine back to the driver. If the Executor dies, the driver immediately sees the connection as dropped and just removes the Executor without question as to what the exit code was; it is hard to know what the exit code is in YARN mode in particular. Does anyone have any thoughts as to how to get the exit code of the executor to the driver, in yarn-client mode? Tasks that fail due to YARN preemption can cause job failure Key: SPARK-8167 URL: https://issues.apache.org/jira/browse/SPARK-8167 Project: Spark Issue Type: Bug Components: Scheduler, YARN Affects Versions: 1.3.1 Reporter: Patrick Woody Assignee: Matt Cheah Priority: Blocker Tasks that are running on preempted executors will count as FAILED with an ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if a large resource shift is occurring, and the tasks get scheduled to executors that immediately get preempted as well. The current workaround is to increase spark.task.maxFailures very high, but that can cause delays in true failures. We should ideally differentiate these task statuses so that they don't count towards the failure limit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure
[ https://issues.apache.org/jira/browse/SPARK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597956#comment-14597956 ] Matt Cheah commented on SPARK-8167: --- I'm starting to work on this now, sorry for the delay. Tasks that fail due to YARN preemption can cause job failure Key: SPARK-8167 URL: https://issues.apache.org/jira/browse/SPARK-8167 Project: Spark Issue Type: Bug Components: Scheduler, YARN Affects Versions: 1.3.1 Reporter: Patrick Woody Priority: Blocker Tasks that are running on preempted executors will count as FAILED with an ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if a large resource shift is occurring, and the tasks get scheduled to executors that immediately get preempted as well. The current workaround is to increase spark.task.maxFailures very high, but that can cause delays in true failures. We should ideally differentiate these task statuses so that they don't count towards the failure limit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure
[ https://issues.apache.org/jira/browse/SPARK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577727#comment-14577727 ] Matt Cheah commented on SPARK-8167: --- To be clear this is independent of SPARK-7451. SPARK-7451 helps for the case that executors die too many times from preemption, but it doesn't not help if the exact same task gets preempted many times. Tasks that fail due to YARN preemption can cause job failure Key: SPARK-8167 URL: https://issues.apache.org/jira/browse/SPARK-8167 Project: Spark Issue Type: Bug Components: Scheduler, YARN Affects Versions: 1.3.1 Reporter: Patrick Woody Tasks that are running on preempted executors will count as FAILED with an ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if a large resource shift is occurring, and the tasks get scheduled to executors that immediately get preempted as well. The current workaround is to increase spark.task.maxFailures very high, but that can cause delays in true failures. We should ideally differentiate these task statuses so that they don't count towards the failure limit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org