Github user mariahualiu commented on the issue:

    https://github.com/apache/spark/pull/17854
  
    Let me describe what I've seen when using 2500 executors.
    
    1. In the first a few (2~3) requests, AM received all (in this case 2500) 
containers from Yarn. 
    2. In a few seconds, 2500 launch container commands were sent out. 
    3. It took 3~4 minutes to start an executor on an NM (most of the time was 
spent on container localization: downloading spark jar, application jar and 
etc. from the hdfs staging folder). 
    4. A large number of executors tried to retrieve spark properties from 
driver but failed to connect. A massive removing failed executors happened. It 
seems to me RemoveExecutor is handled by the same single thread that responds 
to RetrieveSparkProps and RegisterExecutor. As a result, this thread was even 
busier, and more executors cannot connect/register/etc.
    5. YarnAllocator requested more containers to make up for the failed ones. 
More executors tried to retrieve spark props and register. However the thread 
was still overwhelmed by the previous round of executors and cannot respond. 
    
    In some cases, we got 5000 executor failures and the application retried 
and eventually failed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to