[GitHub] spark issue #19039: [SPARK-21829][CORE] Enable config to permanently blackli...

2017-08-29 Thread LucaCanali
Github user LucaCanali commented on the issue:

https://github.com/apache/spark/pull/19039
  
@jiangxb1987 Indeed good suggestion by @jerryshao - I have replied on 
SPARK-21829 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19039: [SPARK-21829][CORE] Enable config to permanently blackli...

2017-08-28 Thread jiangxb1987
Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/19039
  
@LucaCanali Does the alternative approach suggested by @jerryshao sounds 
good for your case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19039: [SPARK-21829][CORE] Enable config to permanently blackli...

2017-08-24 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19039
  
The changes you made in `BlacklistTracker` seems break the design purpose 
of backlist. The blacklist in Spark as well as in MR/TEZ assumes bad 
nodes/executors will be back to normal in several hours, so it always has a 
timeout for blacklist.

In your case, the problem is not bad nodes/executors, it is that you don't 
what to start executors on some nodes (like slow nodes). This is more like a 
cluster manager problem rather than Spark problem. To summarize your problem, 
you want your Spark application runs on some specific nodes.

To solve your problem, for YARN you could use node label and Spark on YARN 
already support node label. You could google node label to know the details.

For standalone, simply you should not start worker on such nodes you don't 
want.

For Mesos I'm not sure, I guess it should also has similar approaches.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19039: [SPARK-21829][CORE] Enable config to permanently blackli...

2017-08-24 Thread LucaCanali
Github user LucaCanali commented on the issue:

https://github.com/apache/spark/pull/19039
  
Thanks @jiangxb1987 for the review. I have tried to address the comments in 
a new commit, in particular adding the configuration to internal/config and 
building a private function to handle processing of the node list in 
`spark.blacklist.alwaysBlacklistedNodes`. As for setting `_nodeBlacklist` I 
think it makes sense to use 
`_nodeBlacklist.set(nodeIdToBlacklistExpiryTime.keySet.toSet)` to keep it 
consistent with the rest of the code in BlacklistTracker. Also 
`nodeIdToBlacklistExpiryTime` needs to be initialized with the blacklisted 
nodes.

As for the usefulness of the feature, I understand your comment and I have 
added some comments in SPARK-21829. The need for this feature for me comes from 
a production issue, which I realize is not very common, but I guess can happen 
again in my environment and maybe in others'.
What we have is a shared YARN cluster and we have a workload that runs slow 
on a couple of nodes, however the nodes are fine to run other types of jobs, so 
we want to have them in the cluster. The actual problem comes from reading from 
an external file system, and apparently only for this specific workload (which 
is only one of many workloads that run on that cluster). What I have done as a 
workaround to make the job run faster so far is just killing the executors on 
the 2 "slow nodes" and the job could finish faster as it avoided the painfully 
slow long tail of execution on the affected nodes. The proposed patch/feature 
is an attempt to address this case in a more structured way than just going on 
the nodes and killing executors.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org