Thomas Graves created SPARK-21383:
-------------------------------------

             Summary: YARN: can allocate to many containers
                 Key: SPARK-21383
                 URL: https://issues.apache.org/jira/browse/SPARK-21383
             Project: Spark
          Issue Type: Bug
          Components: YARN
    Affects Versions: 2.0.0
            Reporter: Thomas Graves


The YarnAllocator doesn't properly track containers being launched but not yet 
running.  If it takes time to launch the containers on the NM they don't show 
up as numExecutorsRunning, but they are already out of the Pending list, so if 
the allocateResources call happens again it can think it has missing executors 
even when it doesn't (they just haven't been launched yet).

This was introduced by SPARK-12447 

Where it check for missing:
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L297

Only updates the numRunningExecutors after NM has started it:
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L524

Thus if the NM is slow or the network is slow, it can miscount and start 
additional executors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to