[GitHub] spark issue #17088: [SPARK-19753][CORE] All shuffle files on a host should b...
Github user sitalkedia commented on the issue: https://github.com/apache/spark/pull/17088 >> Also, does the issue here only arise when the shuffle service is enabled? That is correct. For case, when shuffle service is not enabled, this change should be a no-op. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17088: [SPARK-19753][CORE] All shuffle files on a host should b...
Github user kayousterhout commented on the issue: https://github.com/apache/spark/pull/17088 Can you update the JIRA and PR description to say "un-register the output locations" (or similar) instead of "remove the files"? The current description is misleading since nothing is actually getting removed from the machine -- instead you're just updating the scheduler's state. Also, does the issue here only arise when the shuffle service is enabled? It seems like when the shuffle service is not being used, DAGScheduler.handleExecutorLost should be called for each lost executors (i.e., for all of the executors on a host, when a whole host is lost), which will correctly un-register all of the output on that executor. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17088: [SPARK-19753][CORE] All shuffle files on a host should b...
Github user markhamstra commented on the issue: https://github.com/apache/spark/pull/17088 Even if I completely agreed that removing all of the shuffle files on a host was the correct design choice, I'd still be hesitant to merge this right now. That is simply because we have recently merged other changes related to fetch failure handling, and I'd really like to see some more time pass with just those changes in the code before we introduce more fetch failure changes. I don't want to get in the situation of merging this PR then getting reports of fetch failure bugs in master a week later, and not knowing whether to place the blame on this PR or the other recent fetch failure changes. That needn't preclude more discussion of this PR or possibly merging it after we have a little more experience and confidence with the code already in master. /cc @kayousterhout @squito --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17088: [SPARK-19753][CORE] All shuffle files on a host should b...
Github user sitalkedia commented on the issue: https://github.com/apache/spark/pull/17088 >> This is quite drastic for a fetch failure : spark already has mechanisms in place to detect executor/host failure - which take care of these failure modes. Unfortunately, mechanisms already in place are not sufficient. Let's imagine a situation where the shuffle service become unresponsive or OOMs, in that case, we will not see any host failure, still the driver will receive fetch failure. Current model assumes all shuffle output for an executor is lost, however, since the shuffle service serves all the shuffle files on that host, we should mark all the shuffle files on that host as unavailable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17088: [SPARK-19753][CORE] All shuffle files on a host should b...
Github user ConeyLiu commented on the issue: https://github.com/apache/spark/pull/17088 I agree with @mridulm, file fetch failure does not imply the executor down or all the executor of the host down. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17088: [SPARK-19753][CORE] All shuffle files on a host should b...
Github user sitalkedia commented on the issue: https://github.com/apache/spark/pull/17088 >> fetch failure does not imply lost executor - it could be a transient issue. Similarly, executor loss does not imply host loss. You are right, it could be transient, but we do have retries on the shuffle client to detect transient failure. In case driver receives a fetch failure, we always assume that the output is lost. The current model assumes the output is lost for a particular executor, which makes sense only if the shuffle service is disabled and the executors are serving the shuffle files themselves. However, in case the external shuffle service is enabled, a fetch failure means all output on that host should be marked unavailable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17088: [SPARK-19753][CORE] All shuffle files on a host should b...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/17088 fetch failure does not imply lost executor - it could be a transient issue. Similarly, executor loss does not imply host loss. This is quite drastic for a fetch failure : spark already has mechanisms in place to detect executor/host failure - which take care of these failure modes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17088: [SPARK-19753][CORE] All shuffle files on a host should b...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17088 **[Test build #73533 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73533/testReport)** for PR 17088 at commit [`74ca88b`](https://github.com/apache/spark/commit/74ca88bc1d2b67cc12ea32a3cd344ec0259500a9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org