[ 
https://issues.apache.org/jira/browse/SPARK-30191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max  Xie updated SPARK-30191:
-----------------------------
    Description: 
I run spark on yarn.  I found that when driver lost its executors because of 
machine hardware problem and all of service includes nodemanager, executor on 
the node has been killed,  it means that Resourcemanager can't update the 
containers info on the node until Resourcemanager try to remove the node,   but 
it always takes 10 mins or longger, and in the meantime, AM don't add the new 
resource request and driver missing the executors.

So maybe AM should add the factor `numExecutorsExiting` in YarnAllocator's 
method `

updateResourceRequests`  to optimize it.

 

 

 

  was:
I run spark on yarn.  I found that when driver lost its executors because of 
machine hardware problem and all of service includes nodemanager, executor on 
the node has killed,  it means that Resourcemanager can't update the containers 
info on the node until Resourcemanager try to remove the node,   but it always 
takes 10 mins or longger, and in the meantime, AM don't add the new resource 
request and driver missing the executors.

So maybe AM should add the factor `numExecutorsExiting` in YarnAllocator's 
method `

updateResourceRequests`  to optimize it.

 

 

 


> AM should update pending resource request faster when driver lost executor
> --------------------------------------------------------------------------
>
>                 Key: SPARK-30191
>                 URL: https://issues.apache.org/jira/browse/SPARK-30191
>             Project: Spark
>          Issue Type: Improvement
>          Components: YARN
>    Affects Versions: 2.4.4
>            Reporter: Max  Xie
>            Priority: Minor
>
> I run spark on yarn.  I found that when driver lost its executors because of 
> machine hardware problem and all of service includes nodemanager, executor on 
> the node has been killed,  it means that Resourcemanager can't update the 
> containers info on the node until Resourcemanager try to remove the node,   
> but it always takes 10 mins or longger, and in the meantime, AM don't add the 
> new resource request and driver missing the executors.
> So maybe AM should add the factor `numExecutorsExiting` in YarnAllocator's 
> method `
> updateResourceRequests`  to optimize it.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to