t oo created SPARK-32040: ---------------------------- Summary: Idle cores not being allocated Key: SPARK-32040 URL: https://issues.apache.org/jira/browse/SPARK-32040 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 2.4.5 Reporter: t oo
Background: I have a cluster (2.4.5) using standalone mode orchestrated by Nomad jobs running on EC2. We deploy a Scala web server as a long running jar via `spark-submit` in client mode. Sometimes we get into a state where the application ends up with 0 cores due to our in-house autoscaler scaling down and killing workers without checking if any of the cores in the worker were allocated to existing applications. These applications then end up with 0 cores, even though there are healthy workers in the cluster. However, if i submit a new application or register a new worker in the cluster, only then will the master finally reallocate cores to the application. This is problematic, because the long running 0 core application is stuck. Could this be related to the fact that `schedule()` is only triggered by new workers / new applications as commented here? [https://github.com/apache/spark/blob/v2.4.5/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L721-L724] If that is the case, should the application be calling `schedule()` when removing workers after calling `timeOutWorkers()`? [https://github.com/apache/spark/blob/v2.4.5/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L417] The downscaling causes me to see this in my logs, so i am fairly certain `timeOutWorkers()` is being called: ``` 20/06/08 11:40:56 INFO Master: Application app-20200608114056-0006 requested to set total executors to 1. 20/06/08 11:40:56 INFO Master: Launching executor app-20200608114056-0006/0 on worker worker-20200608113523-<IP_ADDRESS>-7077 20/06/08 11:41:44 WARN Master: Removing worker-20200608113523-<IP_ADDRESS>-7077 because we got no heartbeat in 60 seconds 20/06/08 11:41:44 INFO Master: Removing worker worker-20200608113523-<IP_ADDRESS>-7077 on <IP_ADDRESS>:7077 20/06/08 11:41:44 INFO Master: Telling app of lost executor: 0 20/06/08 11:41:44 INFO Master: Telling app of lost worker: worker-20200608113523-10.158.242.213-7077 ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org