[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-12 Thread kayousterhout
Github user kayousterhout commented on the issue:

https://github.com/apache/spark/pull/15249
  
Awesome nice work!! Exciting to see this in!  Let me know when the other 
component, which blacklists across different stages, is ready for review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-12 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/15249
  
merged to master, thanks everyone


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15249
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66822/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15249
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15249
  
**[Test build #66822 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66822/consoleFull)**
 for PR 15249 at commit 
[`4501e6c`](https://github.com/apache/spark/commit/4501e6c089f99f2cc62443cca668f77fea2745aa).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15249
  
**[Test build #66822 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66822/consoleFull)**
 for PR 15249 at commit 
[`4501e6c`](https://github.com/apache/spark/commit/4501e6c089f99f2cc62443cca668f77fea2745aa).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15249
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66820/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15249
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15249
  
**[Test build #66820 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66820/consoleFull)**
 for PR 15249 at commit 
[`445cc97`](https://github.com/apache/spark/commit/445cc9700e05c9197577bbefa34795400736d0b0).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15249
  
**[Test build #66820 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66820/consoleFull)**
 for PR 15249 at commit 
[`445cc97`](https://github.com/apache/spark/commit/445cc9700e05c9197577bbefa34795400736d0b0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-11 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/15249
  
@kayousterhout good idea about running performance tests, I hadn't run them 
on a recent rev.  I confirmed that the issue in 
https://github.com/apache/spark/pull/14871 was no longer present (just to be 
sure, I also ran a test where I re-introduced the issue, and did see a big drop 
performance).

I agree with your assessments of the current situation, and appreciate you 
driving this forward.  I'll send an email to the dev list shortly -- please add 
anything I may overlooked.

thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15249
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/3/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15249
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15249
  
**[Test build #3 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/3/consoleFull)**
 for PR 15249 at commit 
[`c805a0b`](https://github.com/apache/spark/commit/c805a0ba5b2d90062b04d043ffdaa2dda559e136).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15249
  
**[Test build #3 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/3/consoleFull)**
 for PR 15249 at commit 
[`c805a0b`](https://github.com/apache/spark/commit/c805a0ba5b2d90062b04d043ffdaa2dda559e136).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-07 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/15249
  
As I mentioned before, this is definitely a huge step in the right 
direction !

Having said that, I want to ensure we dont aggressively blacklist executors 
and nodes - at scale, I have seen enough tasks fail which are completely 
recoverable on retry. I dont have access to those jobs or infra anymore - so 
cant run a validation run unfortunately.

If the consensus is that we can risk the change and get broader testing to 
iron out the kinks, I am fine with that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-07 Thread kayousterhout
Github user kayousterhout commented on the issue:

https://github.com/apache/spark/pull/15249
  
@tgravescs @mridulm To avoid being stuck in analysis paralysis for this 
feature, I'd propose the following:

(1) We merge this PR.  I think we're mostly in agreement that the behavior 
here is, for the most part, a big step in the right direction.  There are some 
cases Mridul brought up where there is a concern about being too eager in 
permanently blacklisting hosts / executors, but it seems like we don't know if 
(/ don't think!) the scenarios where this is problematic are very common.

(2) Someone (Imran?) email the dev list describing the new functionality 
(and change from the old behavior) and asking for feedback from folks who are 
running large clusters where failures are more common.  I think having folks 
try this out will be more helpful than us guessing when it will and won't be 
useful.  We can use the feedback to refine the behavior and decide which of the 
many proposed improvements are most important.

We've been discussing pros and cons of this approach for quite a while, and 
I don't think we're going to get to a point where we think the approach is 
*perfect*.  I think we should expect to iterate on this more in the future as 
folks use it.

Thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-07 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/15249
  
The question to me comes down to how many and how often do you expect 
temporary resource issues. At some point if its just from that much skew you 
should probably fix your configs and it would be by luck if you work or don't 
(based on other tasks finishing in executor)  before you hit the max task 
failures.  If its transient temporary network issues, retry could work, if they 
are long lived network issue I expect a lot of things to fail. 

Unfortunately I don't have enough data on how often this happens or what 
exactly happens on Spark to know what would definitely help.  I know on MR that 
allowing multiple attempts on the same node sometimes works for at least some 
of these temporary conditions.  I would think the same would apply to Spark 
although as mentioned the difference might be in the re-launch container time 
vs just send task retry. Perhaps we should just change the defaults to allow 
more then one task attempt per executor and accordingly increase 
maxTaskAttemptsPerNode.  then lets run with it and see what happens so we can 
get some more data and enhance from there. If we want to be paranoid, leave the 
existing blacklisting functionality in as a fallback in case new doesn't work.

Also if we are worried about resource limits, ie we blacklist to many 
executors/nodes for to long then perhaps we should add in a fail safe that is 
like a max % blacklisted. Once you hit that percent you don't blacklist anymore.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-06 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/15249
  
@mridulm we had considered that approach earlier on as well -- I don't 
think it works because you can also have resources which are not totally 
broken, but are flaky for a long period of time.  Simplest example is one bad 
disk out of many; some tasks may succeed though a bunch will fail.  I've seen 
users hit this.  But could be even more nuanced even, eg. a bad sector, flaky 
network connection, etc.

In those cases, its intentional that in this implementation, one success 
does *not* un-blacklist anything.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-06 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/15249
  

Thinking more, and based on what @squito mentioned, I was considering the 
following :

Since we are primarily dealing with executor or nodes which are 'bad' as 
opposed to recoverable failures due to resource contention, prevention of 
degenerate corner cases which existing blacklist is for, etc :

Can we assume a successful task execution on a node will imply healthy node 
?
What about at executor level ?

Proposal is to keep the pr as is for the most part, but :
- Clear nodeToExecsWithFailures when an task on an node succeeds. Same for 
nodeToBlacklistedTaskIndexes.
- Not sure if we want to reset execToFailures for an executor (not clearing 
would imply we are handling resource starvation case implicitly imo).
- If possible - allow for speculative tasks to be scheduled on blacklisted 
nodes/executors if it is possible for countTowardsTaskFailures to be overriden 
to false in those cases (if not, ignore this - since it will add towards number 
of failures per app).
 
The rationale behind this is that successful tasks indicate past failures 
were not indicative of bad nodes/executors, but rather transient failures. And 
speculative tasks also sort of work as probe tasks to determine if the 
node/executor has recovered and is healthy.

I hope I am not missing anything - any thoughts @squito, @kayousterhout, 
@tgravescs ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-06 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/15249
  
If I understood the change correctly,  a node can get blacklisted for a
taskset if sufficient (even different) tasks fail on executers on it.
Which can potentially cause all nodes to be blacklisted.

Or do you think this is contrived scenario that can't occur in practice?  I
don't have sufficient context for motivating usecases/scenarios for this
geature.

On Oct 6, 2016 3:54 PM, "Kay Ousterhout"  wrote:

> @mridulm  re: job failures, can you elaborate
> on the job failure scenario you're concerned about?
>
> Jobs can only fail when some tasks are unschedulable, which can happen if
> a task is permanently blacklisted on all available nodes. This can only
> happen when the number of nodes is smaller than the maximum number of
> failures for a particular tax attempt, and also seems like it's very
> similar to existing behavior: currently, if a task is blacklisted (even
> though the blacklist is temporary) on all nodes, the job will be failed (
> https://github.com/apache/spark/blob/master/core/src/
> main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L595).
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> , or 
mute
> the thread
> 

> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-06 Thread kayousterhout
Github user kayousterhout commented on the issue:

https://github.com/apache/spark/pull/15249
  
@mridulm re: job failures, can you elaborate on the job failure scenario 
you're concerned about?

Jobs can only fail when some tasks are unschedulable, which can happen if a 
task is permanently blacklisted on all available nodes.  This can only happen 
when the number of nodes is smaller than the maximum number of failures for a 
particular tax attempt, and also seems like it's very similar to existing 
behavior: currently, if a task is blacklisted (even though the blacklist is 
temporary) on all nodes, the job will be failed 
(https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L595).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15249
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15249
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66462/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15249
  
**[Test build #66462 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66462/consoleFull)**
 for PR 15249 at commit 
[`34eff27`](https://github.com/apache/spark/commit/34eff27bf25d80d4b6d8a31e7cbbadd2794d2e9c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-06 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/15249
  
@squito I am hoping we _can_ remove the old code/functionality actually (it 
is klunky very specific to single executor resource contention/shutdown usecase 
- unfortunately common enough to warrant its introduction), and subsume it with 
a better design/impl - perhaps as part of your work (in this and other pr's).

@kayousterhout I believe my concern with (2) is that the blacklist is 
(currently) permanent for task/taskset on an executor/node. For jobs running on 
larger number of executors, this will perhaps not be too much of an issue 
(other than a degradation in performance); but as the executor/node count 
decreases, we increase probability of job failures even if the transient 
failures are recoverable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-06 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/15249
  
to be clear, when I proposed leaving the old feature in place, my intent 
was *not* to make them interact nicely at all.  you wouldn't even be able to 
use the two features together.  The idea was just to not break old use-cases, 
if we decided it was really important to still support.

Definitely not ideal, but I think it would be OK just b/c the old feature 
wasn't documented at all.  We'd obviously need to follow up with the right fix 
for that use-case so that they were compatible.  But we wouldn't need to rush 
the "right" fix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-06 Thread kayousterhout
Github user kayousterhout commented on the issue:

https://github.com/apache/spark/pull/15249
  
@tgravescs no decision here yet.

@mridulm the main question for (2), though, is are the consequences a 
deal-breaker?  It doesn't seem disastrous if a task needs to run on a non-local 
machine instead of getting re-tried on a machine where it already failed but 
might succeed later on.  Also, it seems likely that the task has a higher 
probability of completing sooner if it runs on another machine compared to 
re-running (after a delay) on a machine where it already failed.  What are the 
situations you're most concerned about with the new approach?

If we leave the existing mechanism in, one concern (besides the additional 
complexity) is the interaction between the new host-level blacklisting and the 
old executor-level blacklisting.  There could be a scenario where the 
executor-level timeout keeps tasks from getting re-tried on the same executor 
for some period of time, so they run on other executors on the same host, which 
causes the host to be permanently blacklisted, so the fact that the executor 
blacklist would eventually re-allow the task is irrelevant.  I think we'd need 
to change the old executor blacklist timeout to be a host blacklist timeout for 
this to work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-06 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/15249
  
For preemption Spark is not counting those as task failures anymore. 

So I'm not sure if we decided on what to do. Are we leaving the old 
functionality as is or adding a new config for time between attempt retries or 
other?  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-06 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/15249
  
@tgravescs re(1): It was typically observed when yarn is killing the 
executor.
Usually when it run over the memory limits (not sure if it was happening 
during pre-emption also).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-06 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/15249
  
@kayousterhout 

Agree with (1) - permanent blacklist will effectively work the same way for 
executor shutdown.

Re(2) - A task failure is not necessarily only due to resource restriction 
or (1) : it could also be a byzantine failure, interaction (not necessarily 
contention) with other tasks running on the executor/node, issues up/down the 
stack (particularly MT-safety), external library issues, etc.

If it is recoverable, then a timeout + retry will alleviate it without 
needing computation on a different executor/node.
If it is not recoverable (within reasonable time) then current logic of 
permanent blacklist works.

Unfortunately, determining which is the problem. As @tgravescs mentioned, a 
resource contention can be a long lived issue as well at times.

Ideally, if blacklist timeout is < scheduler delay, then retry can help - 
if not, it depends on job characterstics (how many partitions, etc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15249
  
**[Test build #66462 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66462/consoleFull)**
 for PR 15249 at commit 
[`34eff27`](https://github.com/apache/spark/commit/34eff27bf25d80d4b6d8a31e7cbbadd2794d2e9c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-06 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/15249
  
@tgravescs yeah there is an interaction w/ locality, but I think it can 
wait for a follow up.  This was in the design doc in the follow up section, 
though I didn't file a jira for it.

> Delay-Scheduling takes blacklisting into account.  If all executors at 
the preferred level are blacklisted, don’t wait and immediately move on to 
the next level.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-06 Thread kayousterhout
Github user kayousterhout commented on the issue:

https://github.com/apache/spark/pull/15249
  
@squito re: the config message, thanks for the long explanation.  That 
makes sense and I can't think of a better error message.  The current one is 
very clear in telling the user how to fix the issue, so even if the user 
doesn't totally understand the issue, it seems OK.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-06 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/15249
  
I agree with Kay's summary above, just one addition.  For (2) Temporary 
Resource Contention & the approach in this PR -- perhaps its obvious, but 
another consequence of this approach is that you lose resources for computing 
tasks, even if task-locality was never a consideration.  One of your executors 
is temporarily in trouble, so it fails a bunch of tasks, and then gets 
blacklisted from the entire taskset.  10 seconds later, its back to an OK 
state, but even if your taskset takes hours, you'd never take advantage of that 
other executor.

I think the situation w/ (1) & this PR is fine.

Also I realized this was discussed in the design doc some under the [Flaky 
Apps 
Section](https://docs.google.com/document/d/1R2CVKctUZG9xwD67jkRdhBR4sCgccPR2dhTYSRXFEmg/edit#heading=h.3yb336nr3vy1)
 (not that it adds much more than what we have discussed here).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-06 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/15249
  
for 1) We definitely run multiple executors on a node on Yarn but I would 
certainly hope this isn't a huge issue.   Perhaps @mridulm  can clarify on when 
he was seeing this, but I would assume it is only if YARN kills the executor, 
otherwise if spark is shutting it down there shouldn't be any tasks on it.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-06 Thread kayousterhout
Github user kayousterhout commented on the issue:

https://github.com/apache/spark/pull/15249
  
I thought about this a little more and had some offline discussion with 
Imran.  @mridulm, I re-read all of your comments and it sounds like there are 
two issues that are addressed by the old blacklisting mechanism.  I've 
attempted to summarize the state of things here:

### (1) Executor Shutdown

**Problem**: Sometimes executors are in the process of being killed, and 
tasks get scheduled on them during this period.  It's bad if we don't do 
anything about this, because (e.g., due to locality) a task could repeatedly 
get re-scheduled on that executor, eventually causing the task to exhaust its 
max number of failures and the job to be aborted.

**Approach in current Spark**: A per-task blacklist avoids re-scheduling 
tasks on the same executor for a configurable period of time (by default, 0).

**Approach in this PR**: After a configurable number of failures (default 
1), the executor will be permanently blacklisted.  For the executor that's 
shutting down, any tasks run on it will be permanently blacklisted from that 
executor, and the executor may eventually be permanently blacklisted for the 
task set.  This seems like a non-issue since the executor is shutting down 
anyway, so an an executor level, this new approach works at least as well as 
the old approach.  

On a HOST level, the new approach is different: the failures on that 
executor will count towards the max number of failures on the host where the 
executor is/was running.  This could be problematic if there are other 
executors on the same host.  For example, if the max failures per host is 1, or 
multiple executors on one host have this shutdown issue, the entire host will 
be permanently blacklisted for the task set.  Does this seem like an issue to 
folks? I'm not very familiar with YARN's allocation policies, but it seems like 
if it's common to have many executors per host, a user would probably set max 
failures per host to be > max failures per executor.  In this case, the new 
host-level behavior is only problematic if (1) multiple executors on the host 
have this being-shutdown-issue AND (2) YARN allocates more executors on the 
host after that.

### (2) Temporary Resource Contention**

**Problem**: Sometimes machines have temporary resource contention; e.g., 
disk or memory is temporarily full with data from another job.  If we don't do 
anything about this, tasks will repeatedly get re-scheduled on the bad executor 
(e.g., due to locality), eventually causing the task to exhaust its max number 
of failures and the job to be aborted.

**Approach in current Spark**: A per-task blacklist avoids re-scheduling 
tasks on the same executor for a configurable period of time (by default, 0).  
This allows tasks to eventually get a chance to use the executor again (but as 
@tgravescs pointed out, this timeout may be hard to configure, and needs to be 
balanced with the locality wait time, since if the timeout is > locality wait 
timeout, the task will probably get scheduled on a different machine anyway).

**Approach in this PR**: After a configurable number of attempts, tasks 
will be permanently blacklisted from the temporarily contended executor (or 
host) and will be run a different machine, even though the task may succeed on 
the host later.  The biggest consequence of this seems to be that the task may 
be forced to run on a non-local machine (and it may need to wait for the 
locality wait timer to expire before being scheduled).

@mridulm are these issues summarized correctly above?  If so, can you 
elaborate on why the approach in this PR isn't sufficient?  I agree with Imran 
that, if the approach in this PR doesn't seem sufficient for these two cases, 
we should just leave in the old mechanism.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-06 Thread kayousterhout
Github user kayousterhout commented on the issue:

https://github.com/apache/spark/pull/15249
  
Sorry something weird seems to have happened yesterday where github 
published half of my review! Anyway the rest is above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-06 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/15249
  
Actually my comment about locality wait makes me wonder if that should be 
taking blacklisting into account as well here, something I hadn't looked at 
closely before. There is no reason to wait for locality if its blacklisted for 
longer then the wait time.  I guess that could be an optimization later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-06 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/15249
  
Sorry I haven't followed this PR since it was split off the main one.  
Response might be a bit split as its talking about various responses if 
something doesn't make sense let me know.

> b) A few seconds to 10's of seconds is usually enough if the problem is 
due to memory or disk |pressures.

Seems this would vary a lot. I've seen resource pressures last for hours 
rather then seconds.  Another application has all disks pegged on the node 
while its doing a huge shuffle.  I'm not really sure on the memory side, I 
assume it was a skewed task and in this case other tasks finished and you just 
happened to have enough memory to finish now?  If this is the case why not just 
run it on another executor or node anyway, seems like odds would be about the 
same. I guess if you had the locality wait high enough it might not try another 
executor first or if you had a small enough number of executors it could be an 
issue.

Really the temporary resource thing kind of falls into what I was talking 
about in the design with allowing more then 1 task attempt failure per executor 
(which is why I wanted it configurable).  On MapReduce we have seen this but on 
MR generally you have a few seconds because it has to relaunch an entire jvm. 
So one option that seems like its the same as the prior blacklisting would be 
to have that (spark.blacklist.task.maxTaskAttemptsPerExecutor) > 1 and add an 
additional timeout between attempts which would be basically same as 
spark.scheduler.executorTaskBlacklistTime.  thoughts?

I think blacklisting executor is ok especially in cases where you have a 
bad disk because YARN should handle some of these cases for you and if you 
create another executor on that node, it could give you a different list of 
disks leaving out the bad disk.  It could have also been a transient type issue 
or one mridul mentioned with resources.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-05 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/15249
  

@squito I agree, you are right - even the functionality is blacklisting in 
both cases, the problem each is trying to tackle are quite different.
Transient issues (resource, shutdown, etc) versus more permanent failure 
modes (bad memory, disk error, disk full, etc).

You are correct that existing config is trying to handle former, and is 
suboptimal for latter (actually, it is not designed to handle it). So the value 
user will need to set, as you elaborated, if he is to use this config is going 
to be fairly meaningless (and affects its ability to handle the former).

I think it is logical to split implementation for the two usecases - with 
only config namespace being same (since it is blacklist'ing from user pov !).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-05 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/15249
  
@kayousterhout When an executor or node is shutting down it is actually at 
driver level (not just taskset level) - since all tasks would fail on executors 
when they are shutting down.
But if the issue is transient resource issue, then other tasks in the 
taskset can succeed (skew in data for example).

The primary motivation for adding blacklist initially was the former - 
executor shutdown; but it got used to tackle the latter as well, when tasks 
fail due to resource issue due to skew (and keep getting scheduled on the same 
executor due to locality info).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15249
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66426/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15249
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15249
  
**[Test build #66426 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66426/consoleFull)**
 for PR 15249 at commit 
[`354f36b`](https://github.com/apache/spark/commit/354f36bd36c7615883c08542eea333704e421164).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15249
  
**[Test build #66426 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66426/consoleFull)**
 for PR 15249 at commit 
[`354f36b`](https://github.com/apache/spark/commit/354f36bd36c7615883c08542eea333704e421164).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-05 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/15249
  
@kayousterhout on the topic of that error msg in the config validation 
(sorry github is being weird about letting me respond directly to your comment:

> I see -- I didn't realize executor-level blacklisting still works. I know 
we had a long prior discussion about this, but I'm now wondering if we should 
actually allow this setting? Or do you think (similar to the discussion with 
Mridul below) that it never makes sense to have only executor-level 
blacklisting?
>
> Sorry for dragging this on b/c I didn't understand the config!!

Well, the blacklisting will "work" in other settings, meaning executors 
could get blacklisted and tasks won't run there.  But I think a better way to 
think about it is -- what failure modes is spark robust to?  The major 
motivation I see is making it safe to one bad disk on one bad node (of course 
it should also generalize to making it safe to `n` bad nodes).  I don't see 
much motivation for using this feature without even getting that level of 
safety, which is why I think the validation makes sense.  But thats not to say 
you are getting *nothing* from the feature without that level of safety.

To make this a little more concrete, consider what we need to tell users to 
get that level of safety with the current configuration.  They need to set 
* "spark.scheduler.executorTaskBlacklistTime" > the runtime of an entire 
taskset, so once there is a failure, the task never gets rescheduled on the 
same executor  (in practice I tell users to just set it super high, eg. 1 day 
just to choose something.)
* "spark.task.maxFailures" > number of executors that will ever be on one 
node for the lifetime of your app.

The second condition is the really crummy part.  That means users have to 
update that setting as they reconfigure the executors for their job (users 
commonly play with executor sizes to see how it effects performance, eg. 1 big 
executor with all the resources on a node or more smaller ones), and with 
dynamic allocation and executors coming and going on a node this becomes 
basically impossible.

Anyway the point is, if "spark.blacklist.task.maxTaskAttemptsPerNode" >= 
"spark.task.maxFailures", then you are back in this territory where it safety 
depends on how many executors you have per node and how they come and go.  In 
many cases it'll work just fine.  But we *know* there are perfectly natural 
deployments which won't be safe with that configuration, so I think it makes 
sense to fail-fast.  As you pointed out a while back, the user is probably 
unlikely to realize they are in this situation unless we tell them loudly.

Drawing this line at one bad node is somewhat arbitrary.  Users might 
really want to be safe to 2 bad nodes, in which case they'd need to think 
through the conditions for themselves.  Or maybe they're OK with even less.  
But it also seems like a pretty reasonable expectation.

I could also put the escape-hatch back in ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-05 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/15249
  
@mridulm on yarn's bad disk detection -- yes, you are right, it is very 
rudimentary check for bad disks.  It really can't catch everything (and we've 
seen that in practice).  I was just pointing out at least one case where you 
know some executors will be good and some won't.  You certainly still need 
node-level blacklisting.

On the bigger topic of the what to do about the timeouts -- I'm now 
thinking that we should really think of the legacy 
"spark.scheduler.executorTaskBlacklistTime" as orthogonal to the new 
"spark.blacklist.*".  The new feature is about dealing w/ resources that are 
bad for a long period of time (eg., hardware failure).  The old feature was 
about trying to cope w/ resource contention.  I may have been using (abusing) 
it to deal w/ bad hardware, but that is only b/c it is the only thing there was.

Rather than try to shoe-horn the resoure contention back in this at the 
11th hour might be a mistake.  Perhaps it makes the most sense to just leave 
the old feature in, beside this one.  It'll still be undocumented (and I'll 
remove the logic that ties the configs together), so it can still wait for a 
cleaner fix, but existing use cases aren't broken.  Maybe that is short 
timeouts for taskset-level blacklisting, or maybe its something else entirely.  
When I put the old feature back, I can update names & add comments to make this 
distinction clear.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-05 Thread kayousterhout
Github user kayousterhout commented on the issue:

https://github.com/apache/spark/pull/15249
  
Re: executor blacklisting, one more reason I've heard for this (I think 
from Imran) is that tasks can fail on an executor because of memory pressure -- 
in which case the task may succeed on other executors that have fewer RDDs 
in-memory. I'm inclined to keep it since it seems useful, and if someone 
doesn't care about it, they can always set the max executor failures = max host 
failures.

@mridulm re: (a), it sounds like for this case, task-set-level 
blacklisting, as Imran suggested, is actually preferable to task-level 
blacklisting (since the issues you mentioned all seem like issues that would 
affect all tasks).  Is that correct?  Trying to understand the issue before 
figuring out how to fix it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-05 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/15249
  

@squito Re:(a)

The earlier behavior was used fairly extensively in various properties at 
yahoo (web search, groups, bigml, image search, etc). It was added for a 
specific failure mode which used to happen quite commonly for larger jobs.
It was not documented so that it can be replaced with a more principled and 
general approach which subsumes it. Once current pr is merged, the ability to 
run those (and similar) jobs will get affected.


Re: (b) 
I looked at yarn's bad disk detection - and if I am not wrong, it is 
limited to checking for ability to create task directories at launch time ? If 
yes, that might not be sufficient for our needs - and is a good reason to keep 
your current logic in place and not make it only node level.





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-05 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/15249
  
(a) right, this is a behavior change ... seemed fair since earlier behavior 
was undocumented, and I don't see a strong reason to maintain the exact same 
behavior as before.  I think its fair for us to change behavior here, though we 
should try to support general use cases (as I was discussing above).  timeout 
is not enforced whatsoever in this pr (the only reason its here at all is that 
it was easier for me to pull those bits from the full change in here as well).

(b) executor blacklisting is a somewhat odd middle-ground, you're right.  
One motivating case from yarn's bad disk detection -- it'll exclude the bad 
disk from future containers, but not existing ones.  so you can have one node 
with some good containers, and some bad ones.  Admittedly this solution still 
isn't great in that case, since the default confs will lead to the entire node 
getting pushed into the blacklist with just 2 bad executors.  I've also seen 
executors behaving badly though others on the same node are fine, without any 
clear reason, so its meant to handle these poorly understood cases.  
Admittedly, for the main goals, things work fine if we only had blacklisting at 
the node level.

(c) -- yup, those changes are already in the larger pr.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-05 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/15249
  
@squito Thanks for clarifying, that makes some of the choices clearer !

A few points to better understand :

a) timeout does not seem to be enforced in this pr.
Which means it is not compatible with the earlier blacklisting we supported 
(which was primarily to handle executor shutdown, transient resource issues, 
etc).


b) So if I am not wrong, is the motivation to also blacklist at an executor 
level (instead of only at node level) to handle cases where we have executor 
using a resource (disk/gpu/etc) which is 'broken'/full causing tasks to fail 
only on that executor - while other executors on the node are fine ?


c) If the node or executor is not seeing a transient resource issue, but is 
in a more permanent failure state, should we think about blacklist'ing it at a 
driver level ?


c.1) Follow up on that would be to push this info to yarn/mesos/standlone - 
to blacklist future acquisition of executors on that node - until it is 
'resolved' (via timeout ? some other notifications ?)).

I might have missed some of the context behind the change (particularly 
given it was spun off from an earlier pr). Thanks !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15249
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66394/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15249
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15249
  
**[Test build #66394 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66394/consoleFull)**
 for PR 15249 at commit 
[`9086106`](https://github.com/apache/spark/commit/9086106fa0dfdce8358f50ea81c0e6f14ee3a85a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15249
  
**[Test build #66394 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66394/consoleFull)**
 for PR 15249 at commit 
[`9086106`](https://github.com/apache/spark/commit/9086106fa0dfdce8358f50ea81c0e6f14ee3a85a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-05 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/15249
  
I forgot to add that I had turned off blacklisting by default, I agree with 
your suggestion Kay.  I pushed another commit which updated the docs as well.  
There are some other small style things and a couple of added comments etc. in 
there too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15249
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66367/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15249
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15249
  
**[Test build #66367 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66367/consoleFull)**
 for PR 15249 at commit 
[`a6c863f`](https://github.com/apache/spark/commit/a6c863f2462986b66a93f0beac3bb1f163afa50d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15249
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15249
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66358/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15249
  
**[Test build #66358 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66358/consoleFull)**
 for PR 15249 at commit 
[`89d3c5e`](https://github.com/apache/spark/commit/89d3c5eb44939c38b0be14a6fc10c2139d0126ab).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15249
  
**[Test build #66367 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66367/consoleFull)**
 for PR 15249 at commit 
[`a6c863f`](https://github.com/apache/spark/commit/a6c863f2462986b66a93f0beac3bb1f163afa50d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-04 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/15249
  
@kayousterhout @mridulm thanks for the feedback.  obviously still need to 
figure out the timeout thing but otherwise think I've addressed things.  will 
do another pass in the morning.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-04 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/15249
  
@mridulm on the questions about expiry from blacklists, you are not missing 
anything -- this explictly does not do any timeouts at the taskset level (this 
is mentioned in the design doc).  The timeout code you see is mostly just 
incremental stuff as a step towards https://github.com/apache/spark/pull/14079, 
but doesn't actually add any value here.

The primary motivation for blacklisting that I've seen is actually quite 
different from the use case you are describing -- its not to help deal w/ 
resource contention, but to deal w/ truly broken resources (a bad disk in all 
the cases I can think of).  In fact, in these cases, 1 hour is really short -- 
users really want something more like 6-12 hours probably.  But 1 hr really 
isn't so bad, it just means that the bad resources need to be "rediscovered" 
that often, with a scheduling hiccup while that happens.

This is really different from the use case you are describing -- its a form 
of back off to deal w/ resource contention.  I have actually talked to a couple 
of different folks about doing something like this recently and think it would 
be great, though I see problems with this approach, since it allows other tasks 
to still be scheduled on those executors, and also the time isn't relative to 
the task runtime etc.

Nonetheless, an issue here might be that the old option serves some purpose 
which is no longer supported.  Do we need to add it back in?  Just adding the 
logic for the timeouts again is pretty easy, though
(a) I need to figure out the right place to do it so that it doesn't impact 
scheduling performance

and more importantly

(b) I really worry about being able to configure things so that 
blacklisting can actually handle totally broken resources.  Eg., say that you 
set the timeout to 10s.  If your tasks take 1 minute each, then your one bad 
executor might cycle through the leftover tasks, fail them all, pass the 
timeout, and repeat that cycle a few times till you go over 
spark.task.maxFailures.  I don't see a good way to deal w/ while setting a 
sensible a timeout for the entire application.

Two other workarounds:

(2) just enable the timeout per-task when the legacy configuration is used. 
 Leave it undocumented.  We don't change behavior then, but configuration is 
kind of a mess, and it'll be a headache to continue to maintain this

(3) Add a timeout just to *taskset* level blacklisting.  So its a behavior 
change from the existing blacklisting, which has a timeout per *task*.  This 
removes the interaction w/ spark.task.maxFailures that we've always got to 
tiptoe around.  I also think it might satisfy your use case even better.  I 
still don't think its a great solution to the problem, and we need something 
else for handling this sort of backoff better, so I don't feel great about it 
getting shoved into this feature.

I'm thinking (3) is the best but will give it a bit more thought.  Also 
@kayousterhout @tgravescs @markhamstra for opinions as well since this is a 
bigger design point to consider.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15249
  
**[Test build #66358 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66358/consoleFull)**
 for PR 15249 at commit 
[`89d3c5e`](https://github.com/apache/spark/commit/89d3c5eb44939c38b0be14a6fc10c2139d0126ab).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-09-30 Thread kayousterhout
Github user kayousterhout commented on the issue:

https://github.com/apache/spark/pull/15249
  
This mostly looks good -- I made a bunch of cosmetic comments.  Sorry for 
the delay -- I'll be quicker on the next review so we can get this in!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-09-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15249
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66175/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-09-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15249
  
Build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-09-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15249
  
**[Test build #66175 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66175/consoleFull)**
 for PR 15249 at commit 
[`5568973`](https://github.com/apache/spark/commit/5568973d12b4027ca15d3cc4b27118e00c1c829b).
 * This patch passes all tests.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-09-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15249
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-09-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15249
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66176/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-09-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15249
  
**[Test build #66176 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66176/consoleFull)**
 for PR 15249 at commit 
[`9c9d816`](https://github.com/apache/spark/commit/9c9d8165cc0d220d511ea1855d3f53d97277dce1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-09-30 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/15249
  
thanks for the reviews @markhamstra & @kayousterhout , just pushed an update


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-09-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15249
  
**[Test build #66176 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66176/consoleFull)**
 for PR 15249 at commit 
[`9c9d816`](https://github.com/apache/spark/commit/9c9d8165cc0d220d511ea1855d3f53d97277dce1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-09-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15249
  
**[Test build #66175 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66175/consoleFull)**
 for PR 15249 at commit 
[`5568973`](https://github.com/apache/spark/commit/5568973d12b4027ca15d3cc4b27118e00c1c829b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-09-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15249
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-09-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15249
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65986/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-09-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15249
  
**[Test build #65986 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65986/consoleFull)**
 for PR 15249 at commit 
[`21e6789`](https://github.com/apache/spark/commit/21e678995c6e06aa7892b23edc9d04b6f7e731e3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-09-27 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/15249
  
Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-09-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15249
  
**[Test build #65986 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65986/consoleFull)**
 for PR 15249 at commit 
[`21e6789`](https://github.com/apache/spark/commit/21e678995c6e06aa7892b23edc9d04b6f7e731e3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-09-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15249
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65979/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-09-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15249
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-09-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15249
  
**[Test build #65979 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65979/consoleFull)**
 for PR 15249 at commit 
[`21e6789`](https://github.com/apache/spark/commit/21e678995c6e06aa7892b23edc9d04b6f7e731e3).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-09-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15249
  
**[Test build #65979 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65979/consoleFull)**
 for PR 15249 at commit 
[`21e6789`](https://github.com/apache/spark/commit/21e678995c6e06aa7892b23edc9d04b6f7e731e3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-09-26 Thread kayousterhout
Github user kayousterhout commented on the issue:

https://github.com/apache/spark/pull/15249
  
This is awesome to separate this out.  I should have time to review this 
tomorrow and then hopefully we can (finally) merge this in the next few days!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-09-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15249
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65938/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-09-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15249
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-09-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15249
  
**[Test build #65938 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65938/consoleFull)**
 for PR 15249 at commit 
[`882b385`](https://github.com/apache/spark/commit/882b385c966112c0345fce7fe92e3a0aa31ed22d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-09-26 Thread djvulee
Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/15249
  
I would say this is a very important PR. As our experience, sometimes we 
just need to skip some nodes for the bad disks,the exist blacklist mechanism 
effects little. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-09-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15249
  
**[Test build #65938 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65938/consoleFull)**
 for PR 15249 at commit 
[`882b385`](https://github.com/apache/spark/commit/882b385c966112c0345fce7fe92e3a0aa31ed22d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-09-26 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/15249
  
@kayousterhout after your suggestion to pull out a helper for blacklisting 
with a TAskSet, I thought it might make sense to actually pull out everythign 
related to TaskSets, so that we can make progress in smaller increments, and 
hopefully its easier to review.

I left in some changes which really only make sense in the larger context, 
but hopefully they are relatively clear.

it *should* be straight-forward to merge this one first and then return to 
the original one, which will become a much smaller diff.  But if you think at 
this point its easier to just do the whole thing in one shot, thats fine too -- 
if so I can just close this and we can just focus on 
https://github.com/apache/spark/pull/14079. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org