[ 
https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16427181#comment-16427181
 ] 

Attila Zsolt Piros commented on SPARK-16630:
--------------------------------------------

[~tgraves] what about stopping YARN backlisting when a configured limit with 
the default of "spark.executor.instances" * 
"spark.yarn.backlisting.default.executor.instances.size.weight" (a better name 
for the weight is welcomed) limit is reached (including all the blacklisted 
nodes even stage and task level blacklisted nodes) and in case of dynamic 
allocation the default is Int.MaxValue so there is no limit at all?

This idea comes from the calculation of the default for maxNumExecutorFailures.

> Blacklist a node if executors won't launch on it.
> -------------------------------------------------
>
>                 Key: SPARK-16630
>                 URL: https://issues.apache.org/jira/browse/SPARK-16630
>             Project: Spark
>          Issue Type: Improvement
>          Components: YARN
>    Affects Versions: 1.6.2
>            Reporter: Thomas Graves
>            Priority: Major
>
> On YARN, its possible that a node is messed or misconfigured such that a 
> container won't launch on it.  For instance if the Spark external shuffle 
> handler didn't get loaded on it , maybe its just some other hardware issue or 
> hadoop configuration issue. 
> It would be nice we could recognize this happening and stop trying to launch 
> executors on it since that could end up causing us to hit our max number of 
> executor failures and then kill the job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to