Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/17088
gentle ping @sitalkedia
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user sitalkedia commented on the issue:
https://github.com/apache/spark/pull/17088
Filed a JIRA SPARK-20091 to allow running multiple concurrent attempts of a
stage.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17088
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75184/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17088
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17088
**[Test build #75184 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75184/testReport)**
for PR 17088 at commit
Github user sitalkedia commented on the issue:
https://github.com/apache/spark/pull/17088
>> The ResultTasks (lets call them reducers) say in Stage 1.0 are running.
One of them gets a fetchFailure. This restarts the ShuffleMapTasks for that
executor in Stage 0.1. If during the time
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17088
**[Test build #75184 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75184/testReport)**
for PR 17088 at commit
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17088
> (a) even the existing behavior will make you do unnecessary work for
transient failures and (b) this just slightly increases the amount of work that
has to be repeated for those transient
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17088
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17088
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74771/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17088
**[Test build #74771 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74771/testReport)**
for PR 17088 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17088
**[Test build #74771 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74771/testReport)**
for PR 17088 at commit
Github user sitalkedia commented on the issue:
https://github.com/apache/spark/pull/17088
jenkins retest this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17088
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17088
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74768/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17088
**[Test build #74768 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74768/testReport)**
for PR 17088 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17088
**[Test build #74768 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74768/testReport)**
for PR 17088 at commit
Github user kayousterhout commented on the issue:
https://github.com/apache/spark/pull/17088
Ok that makes sense. I wanted to make sure that there wasn't some bug in
SlaveLost (which might lead to a simpler fix than this) but @squito's
description makes it clear that there are a
Github user sitalkedia commented on the issue:
https://github.com/apache/spark/pull/17088
+1 on that. In our case, we are not seeing the SlaveLost message in most of
the cases and even if we do, it is delayed and we received fetch failure before
that. So, as @squito pointed out we
Github user squito commented on the issue:
https://github.com/apache/spark/pull/17088
@kayousterhout I don't think https://github.com/apache/spark/pull/14931 is
really a complete answer to this.
(a) we only get that from standalone mode, no other cluster managers (yarn
does
Github user kayousterhout commented on the issue:
https://github.com/apache/spark/pull/17088
One meta question here: why aren't we getting a SlaveLost message in this
case? I'm asking since there's already code in #14931 to un-register shuffle
service files when we get a SlaveLost
Github user squito commented on the issue:
https://github.com/apache/spark/pull/17088
One thing which I noticed while making sense of what was going in the code
(even before) -- IIRC, spark standalone is a bit of a special case. I think it
used to be the case that to run multiple
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17088
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17088
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74710/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17088
**[Test build #74710 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74710/testReport)**
for PR 17088 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17088
**[Test build #74710 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74710/testReport)**
for PR 17088 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17088
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74694/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17088
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17088
**[Test build #74694 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74694/testReport)**
for PR 17088 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17088
**[Test build #74694 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74694/testReport)**
for PR 17088 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17088
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74691/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17088
**[Test build #74691 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74691/testReport)**
for PR 17088 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17088
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17088
**[Test build #74691 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74691/testReport)**
for PR 17088 at commit
Github user sitalkedia commented on the issue:
https://github.com/apache/spark/pull/17088
>> Its clear to me there is an important reason why users would want a
higher limit, so lets make it a config.
@squito - I already have a PR (very old) to do that - can you take a look
Github user squito commented on the issue:
https://github.com/apache/spark/pull/17088
first, I think we should change the hard-coded limit of 4 stage retries.
Its clear to me there is an important reason why users would want a higher
limit, so lets make it a config. That is a very
Github user sitalkedia commented on the issue:
https://github.com/apache/spark/pull/17088
>> Rolling upgrades can take longer then 15 seconds to restart NMs. You can
have intermittent issues that last > 1 minute. If it took 1 hour to generate
that output I want it to retry really
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17088
Note alternatively we could change it to not fail on fetch failure. This
would seem better to me since there is no reason to throw away all the work you
have done but I'm sure that is a much
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17088
In this particular case are your map tasks fast or slow. If they are really
fast rerunning everything now makes sense, if each of those took 1 hour+ to
run, failing all when they don't need
Github user sitalkedia commented on the issue:
https://github.com/apache/spark/pull/17088
@tgravescs - I agree this might cause additional work in situations where
shuffle fetch is transient like you mentioned above. But in those cases, IMO,
users should tune the shuffle retry
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17088
fyi, this is somewhat related to https://github.com/apache/spark/pull/17113
I mention it because I think both depend on how we handle failures and
retries. This and that together could cause
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/17088
+CC @tgravescs You might be interested in this given your comments on on
the blacklisting PR.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub
Github user kayousterhout commented on the issue:
https://github.com/apache/spark/pull/17088
Can you please file a JIRA for the flaky jenkins failure?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17088
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17088
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73664/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17088
**[Test build #73664 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73664/testReport)**
for PR 17088 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17088
**[Test build #73664 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73664/testReport)**
for PR 17088 at commit
Github user sitalkedia commented on the issue:
https://github.com/apache/spark/pull/17088
Jenkins retest this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17088
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73640/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17088
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17088
**[Test build #73640 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport)**
for PR 17088 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17088
**[Test build #73640 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport)**
for PR 17088 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17088
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17088
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73638/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17088
**[Test build #73636 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73636/testReport)**
for PR 17088 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17088
**[Test build #73638 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73638/testReport)**
for PR 17088 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17088
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73636/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17088
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17088
**[Test build #73638 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73638/testReport)**
for PR 17088 at commit
Github user sitalkedia commented on the issue:
https://github.com/apache/spark/pull/17088
>> Why is this a no-op when the shuffle service isn't enabled? It looks
like you mark the slave as lost in all cases?
@kayousterhout - You are right. It's kind of confusing that we are
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17088
**[Test build #73636 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73636/testReport)**
for PR 17088 at commit
Github user kayousterhout commented on the issue:
https://github.com/apache/spark/pull/17088
Why is this a no-op when the shuffle service isn't enabled? It looks like
you mark the slave as lost in all cases?
---
If your project is set up for it, you can reply to this email and have
62 matches
Mail list logo