[GitHub] spark issue #17088: [SPARK-19753][CORE] All shuffle files on a host should b...

2017-02-28 Thread sitalkedia
Github user sitalkedia commented on the issue:

https://github.com/apache/spark/pull/17088
  
>> Also, does the issue here only arise when the shuffle service is enabled?

That is correct. For case, when shuffle service is not enabled, this change 
should be a no-op.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17088: [SPARK-19753][CORE] All shuffle files on a host should b...

2017-02-28 Thread kayousterhout
Github user kayousterhout commented on the issue:

https://github.com/apache/spark/pull/17088
  
Can you update the JIRA and PR description to say "un-register the output 
locations" (or similar) instead of "remove the files"?  The current description 
is misleading since nothing is actually getting removed from the machine -- 
instead you're just updating the scheduler's state.

Also, does the issue here only arise when the shuffle service is enabled?  
It seems like when the shuffle service is not being used, 
DAGScheduler.handleExecutorLost should be called for each lost executors (i.e., 
for all of the executors on a host, when a whole host is lost), which will 
correctly un-register all of the output on that executor.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17088: [SPARK-19753][CORE] All shuffle files on a host should b...

2017-02-27 Thread markhamstra
Github user markhamstra commented on the issue:

https://github.com/apache/spark/pull/17088
  
Even if I completely agreed that removing all of the shuffle files on a 
host was the correct design choice, I'd still be hesitant to merge this right 
now. That is simply because we have recently merged other changes related to 
fetch failure handling, and I'd really like to see some more time pass with 
just those changes in the code before we introduce more fetch failure changes. 
I don't want to get in the situation of merging this PR then getting reports of 
fetch failure bugs in master a week later, and not knowing whether to place the 
blame on this PR or the other recent fetch failure changes.

That needn't preclude more discussion of this PR or possibly merging it 
after we have a little more experience and confidence with the code already in 
master.

/cc @kayousterhout  @squito 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17088: [SPARK-19753][CORE] All shuffle files on a host should b...

2017-02-27 Thread sitalkedia
Github user sitalkedia commented on the issue:

https://github.com/apache/spark/pull/17088
  
>> This is quite drastic for a fetch failure : spark already has mechanisms 
in place to detect executor/host failure - which take care of these failure 
modes.

Unfortunately,  mechanisms already in place are not sufficient. Let's 
imagine a situation where the shuffle service become unresponsive or OOMs, in 
that case, we will not see any host failure, still the driver will receive 
fetch failure. Current model assumes all shuffle output for an executor is 
lost, however, since the shuffle service serves all the shuffle files on that 
host, we should mark all the shuffle files on that host as unavailable. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17088: [SPARK-19753][CORE] All shuffle files on a host should b...

2017-02-27 Thread ConeyLiu
Github user ConeyLiu commented on the issue:

https://github.com/apache/spark/pull/17088
  
I agree with @mridulm, file fetch failure does not imply the executor down 
or all the executor of the host down.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17088: [SPARK-19753][CORE] All shuffle files on a host should b...

2017-02-27 Thread sitalkedia
Github user sitalkedia commented on the issue:

https://github.com/apache/spark/pull/17088
  
>> fetch failure does not imply lost executor - it could be a transient 
issue.
Similarly, executor loss does not imply host loss.

You are right, it could be transient, but we do have retries on the shuffle 
client to detect transient failure. In case driver receives a fetch failure, we 
always assume that the output is lost. The current model assumes the output is 
lost for a particular executor, which makes sense only if the shuffle service 
is disabled and the executors are serving the shuffle files themselves. 
However,  in case the external shuffle service is enabled, a fetch failure 
means all output on that host should be marked unavailable. 





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17088: [SPARK-19753][CORE] All shuffle files on a host should b...

2017-02-27 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/17088
  
fetch failure does not imply lost executor - it could be a transient issue.
Similarly, executor loss does not imply host loss.

This is quite drastic for a fetch failure : spark already has mechanisms in 
place to detect executor/host failure - which take care of these failure modes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17088: [SPARK-19753][CORE] All shuffle files on a host should b...

2017-02-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17088
  
**[Test build #73533 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73533/testReport)**
 for PR 17088 at commit 
[`74ca88b`](https://github.com/apache/spark/commit/74ca88bc1d2b67cc12ea32a3cd344ec0259500a9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org