[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

2017-05-09 Thread mariahualiu
Github user mariahualiu commented on the issue: https://github.com/apache/spark/pull/17854 @tgravescs I used the default spark.network.timeout (120s). When an executor cannot connect the driver, here is the executor log: 17/05/01 11:18:25 INFO [main] spark.SecurityManager

[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

2017-05-06 Thread mariahualiu
Github user mariahualiu commented on the issue: https://github.com/apache/spark/pull/17854 Now I can comfortably use 2500 executors. But when I pushed the executor count to 3000, I saw a lot of heartbeat timeout errors. It is something else we can improve, probably another jira

[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

2017-05-06 Thread mariahualiu
Github user mariahualiu commented on the issue: https://github.com/apache/spark/pull/17854 I re-ran the same application adding these configurations "--conf spark.yarn.scheduler.heartbeat.interval-ms=15000 --conf spark.yarn.launchContainer.count.simultaneously=50". Though

[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

2017-05-06 Thread mariahualiu
Github user mariahualiu commented on the issue: https://github.com/apache/spark/pull/17854 Let me describe what I've seen when using 2500 executors. 1. In the first a few (2~3) requests, AM received all (in this case 2500) containers from Yarn. 2. In a few seconds, 2500

[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

2017-05-06 Thread mariahualiu
Github user mariahualiu commented on the issue: https://github.com/apache/spark/pull/17854 @squito yes, I capped the number of resources in updateResourceRequests so that YarnAllocator asks for less number of resources in each iteration. When allocation fails one iteration

[GitHub] spark pull request #17854: [SPARK-20564][Deploy] Reduce massive executor fai...

2017-05-03 Thread mariahualiu
GitHub user mariahualiu opened a pull request: https://github.com/apache/spark/pull/17854 [SPARK-20564][Deploy] Reduce massive executor failures when executor count is large (>2000) ## What changes were proposed in this pull request? In applications that use over 2