Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18874
+1
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the
Github user squito commented on the issue:
https://github.com/apache/spark/pull/18874
@srowen you have a good point about a case that becomes worse after this
change. Still I think this change is better on balance.
btw, there are more even more odd cases with dynamic
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18874
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18874
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80645/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18874
**[Test build #80645 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80645/testReport)**
for PR 18874 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18874
**[Test build #80645 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80645/testReport)**
for PR 18874 at commit
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18874
The minimum count is still needed, its needed between stages when the
number of tasks goes below the minimum count. Its either going to keep minimum
number of executors or enough executors to
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/18874
Seems not-unreasonable to me given the current problem statement. It does
solve the possible problem about 0 executors, and then some.
The possible impact to a normal app is like: run a
Github user squito commented on the issue:
https://github.com/apache/spark/pull/18874
This change makes sense to me.
Tom's last comment about resetting that timeout every time one task is
scheduled I think explains how you get in this situation and why you don't
actually
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18874
so I think the issue with the locality is that it resets the time (3s
wait) whenever it schedules any task at the particular locality level (in this
case node local) on any node. So it can take
Github user vanzin commented on the issue:
https://github.com/apache/spark/pull/18874
I think the fix makes sense; the part that is not clear is why this is
happening, since the default locality timeout is 3s and the default executor
idle timeout is 60s, so they really shouldn't
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18874
Also note that I would like to investigating making the locality logic in
the scheduler better as I don't think it should take 60+ seconds for it to fall
back to use a node for rack local.
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18874
To answer a few of your last questions.
It doesn't hurt the common case, the common case is all your executors have
tasks on them as long as there are tasks to run. Normally scheduler can
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18874
I've updated the description in
https://issues.apache.org/jira/browse/SPARK-21656 to join all my comments here
together, hopefully that clarifies it.
---
If your project is set up for it, you
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/18874
@tgravescs that's actually progress. You're no longer saying that the goal
is to keep a few executors around just in case
(https://issues.apache.org/jira/browse/SPARK-21656) or that the problem is
Github user jerryshao commented on the issue:
https://github.com/apache/spark/pull/18874
I think the current fix is a feasible and simple solution for the scenarios
mentioned above. As far as I understand from the comments above, ideally this
problem should not be happened, but in a
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18874
I suggest you go understand the code.
I've already explained this multiple times. You get 0 executors by there
being delays when an executors doesn't have a task scheduled. say you
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/18874
How do you reach 0 executors when there is still a task to schedule? That
if anything is the bug, but it isn't what's contemplated here, so, confused.
I disagree, the rest of your scenarios
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18874
I'm saying you have a stage running that has > 0 tasks to run. If dynamic
allocation has already got all the executors it originally thought it needed
and they all idle timeout then you have 0
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/18874
Why is 0 executors a 'deadlock'? if there is no work to do, 0 executors is
fine. If there is work to do, of course, at least 1 executor should not time
out. Is that what you're claiming happens?
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18874
There is nothing in the code stopping your from you idle timeouting all of
your executors.. thus executors are 0 and you deadlock. 0 executors = deadlock
= definite bug. We definitely
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/18874
Going to 0 executors is not a bug, if you set the min to 0. A deadlock is a
bug. But, nothing in the JIRA or here suggests there's a deadlock -- what do
you mean?
---
If your project is set up
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18874
is going to 0 executors and allowing a deadlock a bug?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/18874
That is correct behavior, as defined by the idle timeout and the min number
of executors, which are already configured. I do not understand why going to
the small number that the config explicitly
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18874
the bug is that with idle timeout the number of executors can go to a very
small number, even zero and we never look back to make sure that doesn't happen.
---
If your project is set up for it,
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/18874
Doesn't this make the 'target' effectively the minimum?
As I say on the JIRA I still do not see a behavior that needs fixing here.
---
If your project is set up for it, you can reply to this
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18874
@yoonlee95 please update with unit tests
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18874
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80360/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18874
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18874
**[Test build #80360 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80360/testReport)**
for PR 18874 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18874
**[Test build #80360 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80360/testReport)**
for PR 18874 at commit
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18874
ok to test
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18874
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
33 matches
Mail list logo