[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2015-11-25 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-159744567
  
today, I hit this problem again. In a 14 nodes cluster, there is one node 
which failed to install a library, then I have no way to finished the job. The 
failed tasks will be scheduled to the same host repeated, and fail the job 
after 4 tries.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2015-04-07 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-90656662
  
Close this now, will re-open if needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2015-04-07 Thread davies
Github user davies closed the pull request at:

https://github.com/apache/spark/pull/3541


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2015-01-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-72128864
  
  [Test build #26340 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26340/consoleFull)
 for   PR 3541 at commit 
[`482804b`](https://github.com/apache/spark/commit/482804b9511567aaa02e20890361136b28812ac3).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2015-01-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-72128871
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26340/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2015-01-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-72119497
  
  [Test build #26340 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26340/consoleFull)
 for   PR 3541 at commit 
[`482804b`](https://github.com/apache/spark/commit/482804b9511567aaa02e20890361136b28812ac3).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-12 Thread suyanNone
Github user suyanNone commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-66863621
  
To be frankly, I am fresh in spark.
If have local fileSystem accessed errors, should we identify and mark as 
host-failed?  may wrapAs ReadFileException or WriteFileException or sth
and I not sure is work for distinguish from exception in spark...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-10 Thread mridulm
Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-66424149
  
On Thu, Dec 4, 2014 at 2:57 AM, Davies Liu  wrote:

> @davies  I am not sure I completely understood
> your comment.
> Sorry for that, maybe I didnot explain it clearly.
>
> As detailed above, there are multiple reasons why a task can fail - and
> quite a lot of them are non-fatal from 'rescheduling the task on same 
host'
> point of view : in particular race in spark between reporting executor
> going down, shutdown hooks running and task schedules due to locality
> preference.
> So we need per-executor blacklist - note that this is just a temporary -
> to either allow the executor to recover (in case task failures are due to
> transient reasons), or allow task to get scheduled elsewhere in meantime
> (if schedule locality constraints can be satisfied).
>
> Agreed that the executor based blacklist worked for you, and I think the
> host based blacklist will also work for you (there is a little regression
> about locality).
>

It is not a small regression - if you have 4 - 8 executors on a host (as is
common here) : this change will blacklist all of them instead of
blacklisting a single executor.
This is fairly severe regression : which is why I said I am -1 on modifying
existing behavior unless new functionality allows for existing feature to
continue to work as currently expected to.
The thing to understand is executor blacklist is not subsumed by host
blacklist other than in a very crude model.



>  A different set of criterion would apply when we want to do host level
> blacklist - when we have determined that the node is unusable, and so task
> fails on all executors in the node : due to NODE_LOCAL locality level, we
> would keep trying other executors on the same node in case executor
> blacklist kicks in; so in case the node is temporarily unusable, executor
> black list might not help.
>
> So we need host based blacklist.
>

Yes, the reasons why we need host blacklist are valid and separate from why
we need executor blacklist.
They might overlap in some degenerate cases (since obviously host level
issues do impact executors too) : executor blacklist is more fine grained -
while host level issues are more coarser in comparison.
While executor blacklist might alleviate lack of host blacklist to some
extent (as exists currently), it is suboptimal to do so : so need for host
blacklist is justified.




>  The timeout based temporary executor blacklist we currently have is
> still a stop gap solution which solves immediate problems observed at that
> time : without which spark was becoming unusable in large enough
> multi-tennet clusters.
>
> Agreed.
>
> If we want to it to a host level and do a principled solution - then we
> need a lot of other pieces to be put into place (since currently we only
> take task scheduling into account; which is insufficient).
> Top of my head - remove it from rdd replication, de-allocate executors
> already on the node, moving existing rdd blocks away from the executors on
> the node, blacklisting the node from further allocation requests (yarn,
> mesos), and so on. I am sure @kayousterhout
>  might have other thoughts on this.
>
> Agreed. Figure out the failure domain is a hard thing in distributed
> environment, I'm doubt that who can contribute a principled solution to
> retry the failed tasks in the best position in near term (such as
> reschedule it in same executor, different executor on same host, different
> host, different rack).
>
> I think the host based blacklist is the simplest solution and work well in
> most failure cases.
>
> Unfortunately, I do not have the bandwidth to engage on this; so I am
> hoping the right thing gets done. Whatever it is, I am -1 on removing
> executor level blacklist - that is something we heavily depend on to get
> our jobs to work. A better solution while not regressing on this
> functionality is most welcome !
>
> Really appreciate your comments here, to have a better solution. Could you
> raise a detailed cases that the host based blacklist will break you job?
> Maybe there are some cases I did not figure out in your situation, please
> correct me.
>


The primary reason for executor blacklist, as @kayousterhout
 also referred to, were initially quite
simple :
Task gets submitted to same executor repeatedly due to locality constraint
- but keeps failing on the executor since the executor might be in
inconsistent state (like in m

[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-09 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-66396977
  
@suyanNone Yes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-09 Thread suyanNone
Github user suyanNone commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-66396760
  
@davies  IIUC, current executor-backlist or host-backlist is all for one 
TaskSetManager. There is a chance that tow or more taskSetManager run on the 
same host or executor... right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-02 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-65312874
  
@mridulm In the case of one executor restarted on failure, the executor id 
will be changed, put the previous executor-id in blacklist does make sense. It 
only help when task failed on the particular status of a executor (such as OOM, 
too many open files). In this cases, host based blacklist also could help.

One situation that executor-id based blacklist will better than host based 
is that you have only one host and several executors on it. In this case, you 
will see some delay of scheduling once task failed. Also, executor-id based 
blacklist may improve data locality if reschedule the failed task into same 
host, the performance gain should be small, because the percentage of failed 
tasks should be small.  

It's more common that user will have multiple hosts than multiple executors 
on single host, and host based blacklist can help in most of failure cases, it 
should be turned on by default. It will be good to have host based blacklist 
together with executor-id based, but it will complicated the code a lot, do the 
benefits in above worth it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-02 Thread mridulm
Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-65309912
  
Note: I am ignoring deterministic failure reasons here (which will fail on 
any host and usually points to bug in user or spark codebase).
Task failure could be due to a variety of transient reasons - which could 
be directly related to the task in question, indirectly related to it, or even 
completely unrelated to it.
For example: 
- What other tasks are running on the executor and how it interacts with 
the failed task.
- What data is currently cached on the executor and the impact it has on 
resource utilization (rdd, broadcast, buffers, gc, etc).
- What the current state of the executor is (in process of shutdown, but 
not yet informed the driver about it).
... among others.

Also note that when you have resource constraints enforced (particularly in 
yarn - where memory limits are aggressively enforced) - one or more of the 
above can interact with that to cause further non deterministic failures : 
which is why we have more hacks like memory overheads to help alleviate (though 
not eliminate) them.

Since we have limits on number of times an application can have executors 
failures, number of times a task can fail before the application is failed, etc 
- we need executor level blacklist.
Note, this does not mean we do not need host level blacklist ! I can 
definitely see value in that if the issues above are host level - as pointed 
out, lack of hdd space, bad memory or cpu, thermal issues, etc.

Ideally, as I mentioned in the past, we need a better way to identify and 
blacklist executors/hosts/racks.
What we currently have is a stop gap hack - and upgrading that from 
executor level to host level does not solve problems (it causes regressions 
actually in our workloads - since we are not missing a replica completely for 
dfs data and the other replicas might not be in our allocated hosts/executors).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-02 Thread kayousterhout
Github user kayousterhout commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-65303921
  
@mridulm I think maybe what @davies was asking (which I'm also wondering 
about) was: why would a task fail on executor A but not on executor B, if 
they're both on the same host?  Many of the reasons I can think of for 
something failing are host-specific (e.g., bad disk) and not executor-specific.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-02 Thread mridulm
Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-65301702
  
@davies when you can have multiple executors per host or executor restarted 
on host on failure, then this can manifest  ... please refer to the comments 
that @kayousterhout referenced to above on why it was added.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-02 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-65278813
  
@mridulm Thanks for the feedback. Could you provide some details about 
configurations (such as the number of executors on a host and the number of 
hosts)? And in which case the host level blacklist is worse than executor level?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-02 Thread mridulm
Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-65272551
  
Thx @kayousterhout for the ping !
We are fairly aggressively using blacklisting executors - not hosts.
The assumption that a task failed on an executor in a host will imply it 
will fail on all executors on the host is incorrect - particularly in 
multi-tennet environment : where number of executors per host can be fairly 
high : which is exactly the reason why I added this hack to begin with.

Instead of modifying the existing behavior of executor level blacklist, it 
would be better to add to it - add a way to configure host level blacklisting 
in addition to existing functionality.

I am -1 on changing the existing behavior.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-65163730
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24006/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-65163725
  
  [Test build #24006 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24006/consoleFull)
 for   PR 3541 at commit 
[`6daec2d`](https://github.com/apache/spark/commit/6daec2da02de9d60bc56391bf8d9388fbcf19a92).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-65156189
  
  [Test build #538 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/538/consoleFull)
 for   PR 3541 at commit 
[`5ff8227`](https://github.com/apache/spark/commit/5ff82273a92dc83d59f28a6d19fc5f9f50098015).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  trait ConnectionFactory extends Serializable `
  * `class MatrixFactorizationModel(`
  * `class CompressedSerializer(FramedSerializer):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-65152580
  
  [Test build #24006 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24006/consoleFull)
 for   PR 3541 at commit 
[`6daec2d`](https://github.com/apache/spark/commit/6daec2da02de9d60bc56391bf8d9388fbcf19a92).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-65148808
  
  [Test build #24003 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24003/consoleFull)
 for   PR 3541 at commit 
[`37d8fe0`](https://github.com/apache/spark/commit/37d8fe0b6920824c774a48a0583be43c98a6703c).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-65148816
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24003/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-65147008
  
  [Test build #24000 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24000/consoleFull)
 for   PR 3541 at commit 
[`592dd4f`](https://github.com/apache/spark/commit/592dd4f7f8e4aef8e6124fdbc4097b94ef1690c7).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-65147017
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24000/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-65145903
  
  [Test build #538 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/538/consoleFull)
 for   PR 3541 at commit 
[`5ff8227`](https://github.com/apache/spark/commit/5ff82273a92dc83d59f28a6d19fc5f9f50098015).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-65144979
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24004/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-01 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-65141381
  
I made a JIRA for enabling this for the next version of Spark: 
https://issues.apache.org/jira/browse/SPARK-4681


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-01 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-65141318
  
@kayousterhout for naming I think the best thing is to advertise a new name 
in the docs, and if someone has set the old name we just accept that also.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-01 Thread kayousterhout
Github user kayousterhout commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-65141229
  
@pwendell @aarondav what about the naming though -- what do we usually do 
for consistency in this kind of scenario? The old config name is executor 
blacklist timeout, which is technically not correct after this patch (it's host 
blacklist timeout).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-01 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-65140988
  
@aarondav Good catch! I just realized that this blacklist is not `enabled` 
by default.

Should we increase it to 10 seconds or 1 minutes?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-01 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-65140880
  
Hey this is a good idea. Also another thing - we should turn this feature 
on by default in the master branch and for 1.3+ IMO. I don't see any reason why 
a user wouldn't wont this behavior.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-65140169
  
  [Test build #24003 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24003/consoleFull)
 for   PR 3541 at commit 
[`37d8fe0`](https://github.com/apache/spark/commit/37d8fe0b6920824c774a48a0583be43c98a6703c).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-01 Thread kayousterhout
Github user kayousterhout commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-65140196
  
Yeah, @mridulm made this "secret" based on an assumption that it was a 
stopgap solution: 

"Regarding timeout variable name : I was considering a very specific 
variable name to allow for a better/future approach to handling this issue - 
and at that time allow us to retire this variable without potential variable 
name conflicts (spark.scheduler.blacklistTimeout implies a more general 
black-list handling, which this is not unfortunately); IMO, this is a stop gap 
solution until we add support for a better black list approach which handles 
both executors and blocks.

But until we have that, this will atleast unblock us - thankfully, this is 
not something which a lot of users are hitting (but is fairly common in our 
case unfortunately).

Given this, should we expose this in documentation ?"

Seems like we should document this now if more folks are running into it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-01 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-65139319
  
Ah, so this appears to be a secret feature anyway, as you have to set this 
undocumented timeout property to a nonzero value to enable it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-01 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/3541#discussion_r21121570
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -274,12 +274,12 @@ private[spark] class TaskSetManager(
* Is this re-execution of a failed task on an executor it already 
failed in before
* EXECUTOR_TASK_BLACKLIST_TIMEOUT has elapsed ?
*/
-  private def executorIsBlacklisted(execId: String, taskId: Int): Boolean 
= {
-if (failedExecutors.contains(taskId)) {
-  val failed = failedExecutors.get(taskId).get
+  private def executorIsBlacklisted(host: String, taskId: Int): Boolean = {
+if (failedHosts.contains(taskId)) {
+  val hosts = failedHosts.get(taskId).get
 
-  return failed.contains(execId) &&
-clock.getTime() - failed.get(execId).get < 
EXECUTOR_TASK_BLACKLIST_TIMEOUT
+  return hosts.contains(host) &&
+clock.getTime() - hosts.get(host).get < 
EXECUTOR_TASK_BLACKLIST_TIMEOUT
--- End diff --

I think we can not change the name of config variable. How about just call 
it `TASK_BLACKLIST_TIMEOUT` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-01 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-65138728
  
@aarondav The blacklist have timeout in it, so eventually the tasks could 
be launched on all hosts.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-01 Thread kayousterhout
Github user kayousterhout commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-65138690
  
@aarondav no -- the idea with this was that eventually the blacklist 
timeout will expire.  When @mridulm originally added this feature, the 
motivation described was: "The reason I had to initially look into the problem 
was due to repeated task failures on an executor which was going to die anyway 
(was in process of cleanup I believe when tasks getting assigned to it or some 
such) : which killed our tasksets and so jobs." For that case, deadlock isn't 
so much an issue -- although you're right that it could happen in other cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org