Github user mariahualiu commented on the issue:
https://github.com/apache/spark/pull/17854
@tgravescs I used the default spark.network.timeout (120s). When an
executor cannot connect the driver, here is the executor log:
17/05/01 11:18:25 INFO [main] spark.SecurityManager
Github user mariahualiu commented on the issue:
https://github.com/apache/spark/pull/17854
Now I can comfortably use 2500 executors. But when I pushed the executor
count to 3000, I saw a lot of heartbeat timeout errors. It is something else we
can improve, probably another jira
Github user mariahualiu commented on the issue:
https://github.com/apache/spark/pull/17854
I re-ran the same application adding these configurations "--conf
spark.yarn.scheduler.heartbeat.interval-ms=15000 --conf
spark.yarn.launchContainer.count.simultaneously=50". Though
Github user mariahualiu commented on the issue:
https://github.com/apache/spark/pull/17854
Let me describe what I've seen when using 2500 executors.
1. In the first a few (2~3) requests, AM received all (in this case 2500)
containers from Yarn.
2. In a few seconds, 2500
Github user mariahualiu commented on the issue:
https://github.com/apache/spark/pull/17854
@squito yes, I capped the number of resources in updateResourceRequests so
that YarnAllocator asks for less number of resources in each iteration. When
allocation fails one iteration
GitHub user mariahualiu opened a pull request:
https://github.com/apache/spark/pull/17854
[SPARK-20564][Deploy] Reduce massive executor failures when executor count
is large (>2000)
## What changes were proposed in this pull request?
In applications that use over 2