[ https://issues.apache.org/jira/browse/SPARK-32040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon updated SPARK-32040: --------------------------------- Description: *Background:* I have a cluster (2.4.5) using standalone mode orchestrated by Nomad jobs running on EC2. We deploy a Scala web server as a long running jar via {{spark-submit}} in client mode. Sometimes we get into a state where the application ends up with 0 cores due to our in-house autoscaler scaling down and killing workers without checking if any of the cores in the worker were allocated to existing applications. These applications then end up with 0 cores, even though there are healthy workers in the cluster. However, if i submit a new application or register a new worker in the cluster, only then will the master finally reallocate cores to the application. This is problematic, because the long running 0 core application is stuck. Could this be related to the fact that {{schedule()}} is only triggered by new workers / new applications as commented here? https://github.com/apache/spark/blob/v2.4.5/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L721-L724 If that is the case, should the application be calling {{schedule()}} when removing workers after calling {{timeOutWorkers()}}? https://github.com/apache/spark/blob/v2.4.5/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L417 The downscaling causes me to see this in my logs, so i am fairly certain {{timeOutWorkers()}} is being called: {code} 20/06/08 11:40:56 INFO Master: Application app-20200608114056-0006 requested to set total executors to 1. 20/06/08 11:40:56 INFO Master: Launching executor app-20200608114056-0006/0 on worker worker-20200608113523-<IP_ADDRESS>-7077 20/06/08 11:41:44 WARN Master: Removing worker-20200608113523-<IP_ADDRESS>-7077 because we got no heartbeat in 60 seconds 20/06/08 11:41:44 INFO Master: Removing worker worker-20200608113523-<IP_ADDRESS>-7077 on <IP_ADDRESS>:7077 20/06/08 11:41:44 INFO Master: Telling app of lost executor: 0 20/06/08 11:41:44 INFO Master: Telling app of lost worker: worker-20200608113523-10.158.242.213-7077 {code} was: *Background:* I have a cluster (2.4.5) using standalone mode orchestrated by Nomad jobs running on EC2. We deploy a Scala web server as a long running jar via {{spark-submit}} in client mode. Sometimes we get into a state where the application ends up with 0 cores due to our in-house autoscaler scaling down and killing workers without checking if any of the cores in the worker were allocated to existing applications. These applications then end up with 0 cores, even though there are healthy workers in the cluster. However, if i submit a new application or register a new worker in the cluster, only then will the master finally reallocate cores to the application. This is problematic, because the long running 0 core application is stuck. Could this be related to the fact that {{schedule()}} is only triggered by new workers / new applications as commented here? https://github.com/apache/spark/blob/v2.4.5/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L721-L724 If that is the case, should the application be calling {{schedule()}} when removing workers after calling {{timeOutWorkers()}}? https://github.com/apache/spark/blob/v2.4.5/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L417 The downscaling causes me to see this in my logs, so i am fairly certain {[timeOutWorkers()}} is being called: {code} 20/06/08 11:40:56 INFO Master: Application app-20200608114056-0006 requested to set total executors to 1. 20/06/08 11:40:56 INFO Master: Launching executor app-20200608114056-0006/0 on worker worker-20200608113523-<IP_ADDRESS>-7077 20/06/08 11:41:44 WARN Master: Removing worker-20200608113523-<IP_ADDRESS>-7077 because we got no heartbeat in 60 seconds 20/06/08 11:41:44 INFO Master: Removing worker worker-20200608113523-<IP_ADDRESS>-7077 on <IP_ADDRESS>:7077 20/06/08 11:41:44 INFO Master: Telling app of lost executor: 0 20/06/08 11:41:44 INFO Master: Telling app of lost worker: worker-20200608113523-10.158.242.213-7077 {code} > Idle cores not being allocated > ------------------------------ > > Key: SPARK-32040 > URL: https://issues.apache.org/jira/browse/SPARK-32040 > Project: Spark > Issue Type: Bug > Components: Scheduler > Affects Versions: 2.4.5 > Reporter: t oo > Priority: Major > > *Background:* > I have a cluster (2.4.5) using standalone mode orchestrated by Nomad jobs > running on EC2. We deploy a Scala web server as a long running jar via > {{spark-submit}} in client mode. Sometimes we get into a state where the > application ends up with 0 cores due to our in-house autoscaler scaling down > and killing workers without checking if any of the cores in the worker were > allocated to existing applications. These applications then end up with 0 > cores, even though there are healthy workers in the cluster. > However, if i submit a new application or register a new worker in the > cluster, only then will the master finally reallocate cores to the > application. This is problematic, because the long running 0 core > application is stuck. > Could this be related to the fact that {{schedule()}} is only triggered by > new > workers / new applications as commented here? > https://github.com/apache/spark/blob/v2.4.5/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L721-L724 > If that is the case, should the application be calling {{schedule()}} when > removing workers after calling {{timeOutWorkers()}}? > https://github.com/apache/spark/blob/v2.4.5/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L417 > The downscaling causes me to see this in my logs, so i am fairly certain > {{timeOutWorkers()}} is being called: > {code} > 20/06/08 11:40:56 INFO Master: Application app-20200608114056-0006 requested > to set total executors to 1. > 20/06/08 11:40:56 INFO Master: Launching executor app-20200608114056-0006/0 > on worker worker-20200608113523-<IP_ADDRESS>-7077 > 20/06/08 11:41:44 WARN Master: Removing > worker-20200608113523-<IP_ADDRESS>-7077 because we got no heartbeat in 60 > seconds > 20/06/08 11:41:44 INFO Master: Removing worker > worker-20200608113523-<IP_ADDRESS>-7077 on <IP_ADDRESS>:7077 > 20/06/08 11:41:44 INFO Master: Telling app of lost executor: 0 > 20/06/08 11:41:44 INFO Master: Telling app of lost worker: > worker-20200608113523-10.158.242.213-7077 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org