Github user vanzin commented on the issue:
https://github.com/apache/spark/pull/17854
@mariahualiu do you plan to address any of the feedback here? If not, this
should probably be closed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on
Github user vanzin commented on the issue:
https://github.com/apache/spark/pull/17854
The reason why `spark.yarn.containerLauncherMaxThreads` does not work here
is because it only control how many threads simultaneously send a container
start command to YARN; that is usually a much
Github user foxish commented on the issue:
https://github.com/apache/spark/pull/17854
In Kubernetes/Spark, we see fairly similar behavior in the scenario
described. When the simultaneous container launching is not throttled, it is
capable of DOSing the system. Our solution so far is
Github user mariahualiu commented on the issue:
https://github.com/apache/spark/pull/17854
@tgravescs I used the default spark.network.timeout (120s). When an
executor cannot connect the driver, here is the executor log:
17/05/01 11:18:25 INFO [main] spark.SecurityManager:
Github user squito commented on the issue:
https://github.com/apache/spark/pull/17854
> It took 3~4 minutes to start an executor on an NM (most of the time was
spent on container localization: downloading spark jar, application jar and
etc. from the hdfs staging folder).
I
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17854
also what is the exact error/stack trace you see when you say "failed to
connect"?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well.
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17854
what is your network timeout (spark.network.timeout) set to?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does
Github user mariahualiu commented on the issue:
https://github.com/apache/spark/pull/17854
Now I can comfortably use 2500 executors. But when I pushed the executor
count to 3000, I saw a lot of heartbeat timeout errors. It is something else we
can improve, probably another jira.
Github user mariahualiu commented on the issue:
https://github.com/apache/spark/pull/17854
I re-ran the same application adding these configurations "--conf
spark.yarn.scheduler.heartbeat.interval-ms=15000 --conf
spark.yarn.launchContainer.count.simultaneously=50". Though it took 50
Github user mariahualiu commented on the issue:
https://github.com/apache/spark/pull/17854
Let me describe what I've seen when using 2500 executors.
1. In the first a few (2~3) requests, AM received all (in this case 2500)
containers from Yarn.
2. In a few seconds, 2500
Github user mariahualiu commented on the issue:
https://github.com/apache/spark/pull/17854
@squito yes, I capped the number of resources in updateResourceRequests so
that YarnAllocator asks for less number of resources in each iteration. When
allocation fails one iteration, the
Github user vanzin commented on the issue:
https://github.com/apache/spark/pull/17854
> Although looking at it maybe I'm missing how its supposed to handle
network failure?
Spark has never really handled network failure. If the connection between
the driver and the executor
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17854
> If that's what you mean, there's no need for retrying. No RPC calls retry
anymore. See #16503 (comment) for an explanation.
I see, I guess with the way we have the rpc implemented it
Github user vanzin commented on the issue:
https://github.com/apache/spark/pull/17854
What do you mean by "not retrying"? Do you mean this line:
```
ref.ask[Boolean](RegisterExecutor(executorId, self, hostname, cores,
extractLogUrls))
```
If that's what you
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17854
I took a quick look at the registerExecutor call in
CoarseGrainedExecutorBackend and its not retrying at all. We should change
that to retry. We retry heartbeats and many other things so it
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17854
to slow down launching you could just set
spark.yarn.containerLauncherMaxThreads to be smaller. that isn't guaranteed
but neither is this really. Just an alternative or something you can do
Github user squito commented on the issue:
https://github.com/apache/spark/pull/17854
also cc @tgravescs @vanzin
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user squito commented on the issue:
https://github.com/apache/spark/pull/17854
It looks to me like this is actually making 2 behavior changes:
1) throttle the requests for new containers, as you describe in your
description
2) drop newly received containers if they
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17854
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76494/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17854
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17854
**[Test build #76494 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76494/testReport)**
for PR 17854 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17854
**[Test build #76494 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76494/testReport)**
for PR 17854 at commit
Github user squito commented on the issue:
https://github.com/apache/spark/pull/17854
Jenkins, ok to test
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so,
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17854
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
24 matches
Mail list logo