[ https://issues.apache.org/jira/browse/SPARK-19941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15930433#comment-15930433 ]
Karthik Palaniappan commented on SPARK-19941: --------------------------------------------- Yeah, I could have been more clear. The application *should* continue, but the driver should drain executors *on decommissioning nodes* similar to how YARN is draining the NMs. All other executors should continue running. > Spark should not schedule tasks on executors on decommissioning YARN nodes > -------------------------------------------------------------------------- > > Key: SPARK-19941 > URL: https://issues.apache.org/jira/browse/SPARK-19941 > Project: Spark > Issue Type: Improvement > Components: Scheduler, YARN > Affects Versions: 2.1.0 > Environment: Hadoop 2.8.0-rc1 > Reporter: Karthik Palaniappan > > Hadoop 2.8 added a mechanism to gracefully decommission Node Managers in > YARN: https://issues.apache.org/jira/browse/YARN-914 > Essentially you can mark nodes to be decommissioned, and let them a) finish > work in progress and b) finish serving shuffle data. But no new work will be > scheduled on the node. > Spark should respect when NMs are set to decommissioned, and similarly > decommission executors on those nodes by not scheduling any more tasks on > them. > It looks like in the future YARN may inform the app master when containers > will be killed: https://issues.apache.org/jira/browse/YARN-3784. However, I > don't think Spark should schedule based on a timeout. We should gracefully > decommission the executor as fast as possible (which is the spirit of > YARN-914). The app master can query the RM for NM statuses (if it doesn't > already have them) and stop scheduling on executors on NMs that are > decommissioning. > Stretch feature: The timeout may be useful in determining whether running > further tasks on the executor is even helpful. Spark may be able to tell that > shuffle data will not be consumed by the time the node is decommissioned, so > it is not worth computing. The executor can be killed immediately. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org