[ 
https://issues.apache.org/jira/browse/SPARK-32040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32040:
---------------------------------
    Description: 
*Background:*

I have a cluster (2.4.5) using standalone mode orchestrated by Nomad jobs 
running on EC2. We deploy a Scala web server as a long running jar via 
{{spark-submit}} in client mode. Sometimes we get into a state where the 
application ends up with 0 cores due to our in-house autoscaler scaling down 
and killing workers without checking if any of the cores in the worker were 
allocated to existing applications. These applications then end up with 0 
cores, even though there are healthy workers in the cluster. 

However, if i submit a new application or register a new worker in the 
cluster, only then will the master finally reallocate cores to the 
application. This is problematic, because the long running 0 core 
application is stuck. 

Could this be related to the fact that {{schedule()}} is only triggered by new 
workers / new applications as commented here? 
https://github.com/apache/spark/blob/v2.4.5/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L721-L724

If that is the case, should the application be calling {{schedule()}} when 
removing workers after calling {{timeOutWorkers()}}? 
https://github.com/apache/spark/blob/v2.4.5/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L417

The downscaling causes me to see this in my logs, so i am fairly certain 
{{timeOutWorkers()}} is being called: 

{code}
20/06/08 11:40:56 INFO Master: Application app-20200608114056-0006 requested 
to set total executors to 1. 
20/06/08 11:40:56 INFO Master: Launching executor app-20200608114056-0006/0 
on worker worker-20200608113523-<IP_ADDRESS>-7077 
20/06/08 11:41:44 WARN Master: Removing 
worker-20200608113523-<IP_ADDRESS>-7077 because we got no heartbeat in 60 
seconds 
20/06/08 11:41:44 INFO Master: Removing worker 
worker-20200608113523-<IP_ADDRESS>-7077 on <IP_ADDRESS>:7077 
20/06/08 11:41:44 INFO Master: Telling app of lost executor: 0 
20/06/08 11:41:44 INFO Master: Telling app of lost worker: 
worker-20200608113523-10.158.242.213-7077 
{code}


  was:
*Background:*

I have a cluster (2.4.5) using standalone mode orchestrated by Nomad jobs 
running on EC2. We deploy a Scala web server as a long running jar via 
{{spark-submit}} in client mode. Sometimes we get into a state where the 
application ends up with 0 cores due to our in-house autoscaler scaling down 
and killing workers without checking if any of the cores in the worker were 
allocated to existing applications. These applications then end up with 0 
cores, even though there are healthy workers in the cluster. 

However, if i submit a new application or register a new worker in the 
cluster, only then will the master finally reallocate cores to the 
application. This is problematic, because the long running 0 core 
application is stuck. 

Could this be related to the fact that {{schedule()}} is only triggered by new 
workers / new applications as commented here? 
https://github.com/apache/spark/blob/v2.4.5/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L721-L724

If that is the case, should the application be calling {{schedule()}} when 
removing workers after calling {{timeOutWorkers()}}? 
https://github.com/apache/spark/blob/v2.4.5/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L417

The downscaling causes me to see this in my logs, so i am fairly certain 
{[timeOutWorkers()}} is being called: 

{code}
20/06/08 11:40:56 INFO Master: Application app-20200608114056-0006 requested 
to set total executors to 1. 
20/06/08 11:40:56 INFO Master: Launching executor app-20200608114056-0006/0 
on worker worker-20200608113523-<IP_ADDRESS>-7077 
20/06/08 11:41:44 WARN Master: Removing 
worker-20200608113523-<IP_ADDRESS>-7077 because we got no heartbeat in 60 
seconds 
20/06/08 11:41:44 INFO Master: Removing worker 
worker-20200608113523-<IP_ADDRESS>-7077 on <IP_ADDRESS>:7077 
20/06/08 11:41:44 INFO Master: Telling app of lost executor: 0 
20/06/08 11:41:44 INFO Master: Telling app of lost worker: 
worker-20200608113523-10.158.242.213-7077 
{code}



> Idle cores not being allocated
> ------------------------------
>
>                 Key: SPARK-32040
>                 URL: https://issues.apache.org/jira/browse/SPARK-32040
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 2.4.5
>            Reporter: t oo
>            Priority: Major
>
> *Background:*
> I have a cluster (2.4.5) using standalone mode orchestrated by Nomad jobs 
> running on EC2. We deploy a Scala web server as a long running jar via 
> {{spark-submit}} in client mode. Sometimes we get into a state where the 
> application ends up with 0 cores due to our in-house autoscaler scaling down 
> and killing workers without checking if any of the cores in the worker were 
> allocated to existing applications. These applications then end up with 0 
> cores, even though there are healthy workers in the cluster. 
> However, if i submit a new application or register a new worker in the 
> cluster, only then will the master finally reallocate cores to the 
> application. This is problematic, because the long running 0 core 
> application is stuck. 
> Could this be related to the fact that {{schedule()}} is only triggered by 
> new 
> workers / new applications as commented here? 
> https://github.com/apache/spark/blob/v2.4.5/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L721-L724
> If that is the case, should the application be calling {{schedule()}} when 
> removing workers after calling {{timeOutWorkers()}}? 
> https://github.com/apache/spark/blob/v2.4.5/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L417
> The downscaling causes me to see this in my logs, so i am fairly certain 
> {{timeOutWorkers()}} is being called: 
> {code}
> 20/06/08 11:40:56 INFO Master: Application app-20200608114056-0006 requested 
> to set total executors to 1. 
> 20/06/08 11:40:56 INFO Master: Launching executor app-20200608114056-0006/0 
> on worker worker-20200608113523-<IP_ADDRESS>-7077 
> 20/06/08 11:41:44 WARN Master: Removing 
> worker-20200608113523-<IP_ADDRESS>-7077 because we got no heartbeat in 60 
> seconds 
> 20/06/08 11:41:44 INFO Master: Removing worker 
> worker-20200608113523-<IP_ADDRESS>-7077 on <IP_ADDRESS>:7077 
> 20/06/08 11:41:44 INFO Master: Telling app of lost executor: 0 
> 20/06/08 11:41:44 INFO Master: Telling app of lost worker: 
> worker-20200608113523-10.158.242.213-7077 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to