[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-05-11 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17088 gentle ping @sitalkedia --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-24 Thread sitalkedia
Github user sitalkedia commented on the issue: https://github.com/apache/spark/pull/17088 Filed a JIRA SPARK-20091 to allow running multiple concurrent attempts of a stage. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17088 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75184/ Test PASSed. ---

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17088 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-24 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17088 **[Test build #75184 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75184/testReport)** for PR 17088 at commit

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-24 Thread sitalkedia
Github user sitalkedia commented on the issue: https://github.com/apache/spark/pull/17088 >> The ResultTasks (lets call them reducers) say in Stage 1.0 are running. One of them gets a fetchFailure. This restarts the ShuffleMapTasks for that executor in Stage 0.1. If during the time

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-24 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17088 **[Test build #75184 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75184/testReport)** for PR 17088 at commit

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-22 Thread tgravescs
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/17088 > (a) even the existing behavior will make you do unnecessary work for transient failures and (b) this just slightly increases the amount of work that has to be repeated for those transient

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17088 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17088 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74771/ Test PASSed. ---

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17088 **[Test build #74771 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74771/testReport)** for PR 17088 at commit

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17088 **[Test build #74771 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74771/testReport)** for PR 17088 at commit

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-18 Thread sitalkedia
Github user sitalkedia commented on the issue: https://github.com/apache/spark/pull/17088 jenkins retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17088 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17088 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74768/ Test FAILed. ---

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17088 **[Test build #74768 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74768/testReport)** for PR 17088 at commit

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-17 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17088 **[Test build #74768 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74768/testReport)** for PR 17088 at commit

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-17 Thread kayousterhout
Github user kayousterhout commented on the issue: https://github.com/apache/spark/pull/17088 Ok that makes sense. I wanted to make sure that there wasn't some bug in SlaveLost (which might lead to a simpler fix than this) but @squito's description makes it clear that there are a

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-17 Thread sitalkedia
Github user sitalkedia commented on the issue: https://github.com/apache/spark/pull/17088 +1 on that. In our case, we are not seeing the SlaveLost message in most of the cases and even if we do, it is delayed and we received fetch failure before that. So, as @squito pointed out we

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-17 Thread squito
Github user squito commented on the issue: https://github.com/apache/spark/pull/17088 @kayousterhout I don't think https://github.com/apache/spark/pull/14931 is really a complete answer to this. (a) we only get that from standalone mode, no other cluster managers (yarn does

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-17 Thread kayousterhout
Github user kayousterhout commented on the issue: https://github.com/apache/spark/pull/17088 One meta question here: why aren't we getting a SlaveLost message in this case? I'm asking since there's already code in #14931 to un-register shuffle service files when we get a SlaveLost

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-17 Thread squito
Github user squito commented on the issue: https://github.com/apache/spark/pull/17088 One thing which I noticed while making sense of what was going in the code (even before) -- IIRC, spark standalone is a bit of a special case. I think it used to be the case that to run multiple

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17088 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17088 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74710/ Test PASSed. ---

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-16 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17088 **[Test build #74710 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74710/testReport)** for PR 17088 at commit

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-16 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17088 **[Test build #74710 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74710/testReport)** for PR 17088 at commit

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17088 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74694/ Test FAILed. ---

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17088 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-16 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17088 **[Test build #74694 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74694/testReport)** for PR 17088 at commit

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-16 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17088 **[Test build #74694 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74694/testReport)** for PR 17088 at commit

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17088 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74691/ Test FAILed. ---

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-16 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17088 **[Test build #74691 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74691/testReport)** for PR 17088 at commit

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17088 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-16 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17088 **[Test build #74691 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74691/testReport)** for PR 17088 at commit

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-15 Thread sitalkedia
Github user sitalkedia commented on the issue: https://github.com/apache/spark/pull/17088 >> Its clear to me there is an important reason why users would want a higher limit, so lets make it a config. @squito - I already have a PR (very old) to do that - can you take a look

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-15 Thread squito
Github user squito commented on the issue: https://github.com/apache/spark/pull/17088 first, I think we should change the hard-coded limit of 4 stage retries. Its clear to me there is an important reason why users would want a higher limit, so lets make it a config. That is a very

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-07 Thread sitalkedia
Github user sitalkedia commented on the issue: https://github.com/apache/spark/pull/17088 >> Rolling upgrades can take longer then 15 seconds to restart NMs. You can have intermittent issues that last > 1 minute. If it took 1 hour to generate that output I want it to retry really

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-06 Thread tgravescs
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/17088 Note alternatively we could change it to not fail on fetch failure. This would seem better to me since there is no reason to throw away all the work you have done but I'm sure that is a much

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-06 Thread tgravescs
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/17088 In this particular case are your map tasks fast or slow. If they are really fast rerunning everything now makes sense, if each of those took 1 hour+ to run, failing all when they don't need

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-02 Thread sitalkedia
Github user sitalkedia commented on the issue: https://github.com/apache/spark/pull/17088 @tgravescs - I agree this might cause additional work in situations where shuffle fetch is transient like you mentioned above. But in those cases, IMO, users should tune the shuffle retry

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-02 Thread tgravescs
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/17088 fyi, this is somewhat related to https://github.com/apache/spark/pull/17113 I mention it because I think both depend on how we handle failures and retries. This and that together could cause

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-02 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/17088 +CC @tgravescs You might be interested in this given your comments on on the blacklisting PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-03-01 Thread kayousterhout
Github user kayousterhout commented on the issue: https://github.com/apache/spark/pull/17088 Can you please file a JIRA for the flaky jenkins failure? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-02-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17088 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-02-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17088 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73664/ Test PASSed. ---

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-02-28 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17088 **[Test build #73664 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73664/testReport)** for PR 17088 at commit

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-02-28 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17088 **[Test build #73664 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73664/testReport)** for PR 17088 at commit

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-02-28 Thread sitalkedia
Github user sitalkedia commented on the issue: https://github.com/apache/spark/pull/17088 Jenkins retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-02-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17088 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73640/ Test FAILed. ---

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-02-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17088 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-02-28 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17088 **[Test build #73640 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport)** for PR 17088 at commit

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-02-28 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17088 **[Test build #73640 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport)** for PR 17088 at commit

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-02-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17088 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-02-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17088 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73638/ Test FAILed. ---

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-02-28 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17088 **[Test build #73636 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73636/testReport)** for PR 17088 at commit

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-02-28 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17088 **[Test build #73638 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73638/testReport)** for PR 17088 at commit

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-02-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17088 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73636/ Test FAILed. ---

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-02-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17088 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-02-28 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17088 **[Test build #73638 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73638/testReport)** for PR 17088 at commit

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-02-28 Thread sitalkedia
Github user sitalkedia commented on the issue: https://github.com/apache/spark/pull/17088 >> Why is this a no-op when the shuffle service isn't enabled? It looks like you mark the slave as lost in all cases? @kayousterhout - You are right. It's kind of confusing that we are

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-02-28 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17088 **[Test build #73636 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73636/testReport)** for PR 17088 at commit

[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...

2017-02-28 Thread kayousterhout
Github user kayousterhout commented on the issue: https://github.com/apache/spark/pull/17088 Why is this a no-op when the shuffle service isn't enabled? It looks like you mark the slave as lost in all cases? --- If your project is set up for it, you can reply to this email and have