[jira] [Commented] (SPARK-19941) Spark should not schedule tasks on executors on decommissioning YARN nodes
[ https://issues.apache.org/jira/browse/SPARK-19941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15932088#comment-15932088 ] Saisai Shao commented on SPARK-19941: - I think this scenario is quite similar to container preemption. In container preemption scenario, AM can be informed from RM which containers will be preempted in the next 15 seconds (by default), and AM could react based on such information. I made a similar PR to avoid scheduling tasks on the executors going to be preempted. But finally it got rejected because the main reason is that let to be preempted executors idle for 15 seconds is too long and waste the resources. In your description the executors will be idle for 60 seconds before decommission, so this will really waste the resource if most of the works could be done in 1 minutes on this executors. Also I'm not sure why the job will be hang as you mentioned before. I think the failed tasks will be rerun again. So IMHO I think it is better not to handle this scenario unless there's some bad problems we met. Sometimes the effort of rerun tasks is smaller than wasting the resources. > Spark should not schedule tasks on executors on decommissioning YARN nodes > -- > > Key: SPARK-19941 > URL: https://issues.apache.org/jira/browse/SPARK-19941 > Project: Spark > Issue Type: Improvement > Components: Scheduler, YARN >Affects Versions: 2.1.0 > Environment: Hadoop 2.8.0-rc1 >Reporter: Karthik Palaniappan > > Hadoop 2.8 added a mechanism to gracefully decommission Node Managers in > YARN: https://issues.apache.org/jira/browse/YARN-914 > Essentially you can mark nodes to be decommissioned, and let them a) finish > work in progress and b) finish serving shuffle data. But no new work will be > scheduled on the node. > Spark should respect when NMs are set to decommissioned, and similarly > decommission executors on those nodes by not scheduling any more tasks on > them. > It looks like in the future YARN may inform the app master when containers > will be killed: https://issues.apache.org/jira/browse/YARN-3784. However, I > don't think Spark should schedule based on a timeout. We should gracefully > decommission the executor as fast as possible (which is the spirit of > YARN-914). The app master can query the RM for NM statuses (if it doesn't > already have them) and stop scheduling on executors on NMs that are > decommissioning. > Stretch feature: The timeout may be useful in determining whether running > further tasks on the executor is even helpful. Spark may be able to tell that > shuffle data will not be consumed by the time the node is decommissioned, so > it is not worth computing. The executor can be killed immediately. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19941) Spark should not schedule tasks on executors on decommissioning YARN nodes
[ https://issues.apache.org/jira/browse/SPARK-19941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15931165#comment-15931165 ] Sean Owen commented on SPARK-19941: --- I'm not sure I agree with that. If the app wants N executors, as far as YARN is concerned it needs those containers, and YARN would wait for it to finish. Is that not the desired semantics here? otherwise, how is this different from simple preemption, where YARN wants to force the container to stop? > Spark should not schedule tasks on executors on decommissioning YARN nodes > -- > > Key: SPARK-19941 > URL: https://issues.apache.org/jira/browse/SPARK-19941 > Project: Spark > Issue Type: Improvement > Components: Scheduler, YARN >Affects Versions: 2.1.0 > Environment: Hadoop 2.8.0-rc1 >Reporter: Karthik Palaniappan > > Hadoop 2.8 added a mechanism to gracefully decommission Node Managers in > YARN: https://issues.apache.org/jira/browse/YARN-914 > Essentially you can mark nodes to be decommissioned, and let them a) finish > work in progress and b) finish serving shuffle data. But no new work will be > scheduled on the node. > Spark should respect when NMs are set to decommissioned, and similarly > decommission executors on those nodes by not scheduling any more tasks on > them. > It looks like in the future YARN may inform the app master when containers > will be killed: https://issues.apache.org/jira/browse/YARN-3784. However, I > don't think Spark should schedule based on a timeout. We should gracefully > decommission the executor as fast as possible (which is the spirit of > YARN-914). The app master can query the RM for NM statuses (if it doesn't > already have them) and stop scheduling on executors on NMs that are > decommissioning. > Stretch feature: The timeout may be useful in determining whether running > further tasks on the executor is even helpful. Spark may be able to tell that > shuffle data will not be consumed by the time the node is decommissioned, so > it is not worth computing. The executor can be killed immediately. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19941) Spark should not schedule tasks on executors on decommissioning YARN nodes
[ https://issues.apache.org/jira/browse/SPARK-19941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15930433#comment-15930433 ] Karthik Palaniappan commented on SPARK-19941: - Yeah, I could have been more clear. The application *should* continue, but the driver should drain executors *on decommissioning nodes* similar to how YARN is draining the NMs. All other executors should continue running. > Spark should not schedule tasks on executors on decommissioning YARN nodes > -- > > Key: SPARK-19941 > URL: https://issues.apache.org/jira/browse/SPARK-19941 > Project: Spark > Issue Type: Improvement > Components: Scheduler, YARN >Affects Versions: 2.1.0 > Environment: Hadoop 2.8.0-rc1 >Reporter: Karthik Palaniappan > > Hadoop 2.8 added a mechanism to gracefully decommission Node Managers in > YARN: https://issues.apache.org/jira/browse/YARN-914 > Essentially you can mark nodes to be decommissioned, and let them a) finish > work in progress and b) finish serving shuffle data. But no new work will be > scheduled on the node. > Spark should respect when NMs are set to decommissioned, and similarly > decommission executors on those nodes by not scheduling any more tasks on > them. > It looks like in the future YARN may inform the app master when containers > will be killed: https://issues.apache.org/jira/browse/YARN-3784. However, I > don't think Spark should schedule based on a timeout. We should gracefully > decommission the executor as fast as possible (which is the spirit of > YARN-914). The app master can query the RM for NM statuses (if it doesn't > already have them) and stop scheduling on executors on NMs that are > decommissioning. > Stretch feature: The timeout may be useful in determining whether running > further tasks on the executor is even helpful. Spark may be able to tell that > shuffle data will not be consumed by the time the node is decommissioned, so > it is not worth computing. The executor can be killed immediately. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19941) Spark should not schedule tasks on executors on decommissioning YARN nodes
[ https://issues.apache.org/jira/browse/SPARK-19941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15923750#comment-15923750 ] Sean Owen commented on SPARK-19941: --- In this scenario YARN is not trying to preempt applications, right? then it should wait for the executor to finish. I don't see that this state means the app should stop. That is, the point of a decommissioning state is to not just tell apps they need to stop. It makes a bit more sense for dynamic allocation, because the app has some permission (sometimes) to stop an executor and restart one. But there too if Spark is using an executor then the decommissioning NM can wait for it to be done. It does make sense for the driver to somehow not schedule work on an executor that is going to be shut down by YARN, if it can get a heads up. This sounds like something that only would work in Hadoop 2.8, so would not be possible to use without some reflection or until 2.8 is required here. > Spark should not schedule tasks on executors on decommissioning YARN nodes > -- > > Key: SPARK-19941 > URL: https://issues.apache.org/jira/browse/SPARK-19941 > Project: Spark > Issue Type: Bug > Components: Scheduler, YARN >Affects Versions: 2.1.0 > Environment: Hadoop 2.8.0-rc1 >Reporter: Karthik Palaniappan > > Hadoop 2.8 added a mechanism to gracefully decommission Node Managers in > YARN: https://issues.apache.org/jira/browse/YARN-914 > Essentially you can mark nodes to be decommissioned, and let them a) finish > work in progress and b) finish serving shuffle data. But no new work will be > scheduled on the node. > Spark should respect when NMs are set to decommissioned, and similarly > decommission executors on those nodes by not scheduling any more tasks on > them. > It looks like in the future YARN may inform the app master when containers > will be killed: https://issues.apache.org/jira/browse/YARN-3784. However, I > don't think Spark should schedule based on a timeout. We should gracefully > decommission the executor as fast as possible (which is the spirit of > YARN-914). The app master can query the RM for NM statuses (if it doesn't > already have them) and stop scheduling on executors on NMs that are > decommissioning. > Stretch feature: The timeout may be useful in determining whether running > further tasks on the executor is even helpful. Spark may be able to tell that > shuffle data will not be consumed by the time the node is decommissioned, so > it is not worth computing. The executor can be killed immediately. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19941) Spark should not schedule tasks on executors on decommissioning YARN nodes
[ https://issues.apache.org/jira/browse/SPARK-19941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15923363#comment-15923363 ] Karthik Palaniappan commented on SPARK-19941: - To repro: Set up Spark on YARN (Hadoop 2.8). Configure YARN to have a list of included and excluded nodes (yarn.resourcemanager.nodes.include-path and yarn.resourcemanager.nodes.exclude-path). Start with all nodes included, and node excluded. Run a spark job. I used: ``` spark-submit --class org.apache.spark.examples.SparkPi spark-examples.jar 10 ``` While the job is running, add all nodes to the excluded file, and run `yarn rmadmin -refreshNodes -g 3600 -client`. Expected: Spark does not schedule any more tasks on executors, and they exit after being idle for 60s. The job hangs. Actual: Spark continues to schedule tasks and the job completes successfully. The nodes are only decommissioned when the job finishes. A less dramatic example is to just decommission a subset of the nodes and expect that tasks are not scheduled on executors on those hosts. > Spark should not schedule tasks on executors on decommissioning YARN nodes > -- > > Key: SPARK-19941 > URL: https://issues.apache.org/jira/browse/SPARK-19941 > Project: Spark > Issue Type: Bug > Components: Scheduler, YARN >Affects Versions: 2.2.0 > Environment: Hadoop 2.8.0-rc1 >Reporter: Karthik Palaniappan > > Hadoop 2.8 added a mechanism to gracefully decommission Node Managers in > YARN: https://issues.apache.org/jira/browse/YARN-914 > Essentially you can mark nodes to be decommissioned, and let them a) finish > work in progress and b) finish serving shuffle data. But no new work will be > scheduled on the node. > Spark should respect when NMs are set to decommissioned, and similarly > decommission executors on those nodes by not scheduling any more tasks on > them. > It looks like in the future YARN may inform the app master when containers > will be killed: https://issues.apache.org/jira/browse/YARN-3784. However, I > don't think Spark should schedule based on a timeout. We should gracefully > decommission the executor as fast as possible (which is the spirit of > YARN-914). The app master can query the RM for NM statuses (if it doesn't > already have them) and stop scheduling on executors on NMs that are > decommissioning. > Stretch feature: The timeout may be useful in determining whether running > further tasks on the executor is even helpful. Spark may be able to tell that > shuffle data will not be consumed by the time the node is decommissioned, so > it is not worth computing. The executor can be killed immediately. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org