[GitHub] spark issue #19039: [SPARK-21829][CORE] Enable config to permanently blackli...
Github user LucaCanali commented on the issue: https://github.com/apache/spark/pull/19039 @jiangxb1987 Indeed good suggestion by @jerryshao - I have replied on SPARK-21829 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19039: [SPARK-21829][CORE] Enable config to permanently blackli...
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/19039 @LucaCanali Does the alternative approach suggested by @jerryshao sounds good for your case? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19039: [SPARK-21829][CORE] Enable config to permanently blackli...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19039 The changes you made in `BlacklistTracker` seems break the design purpose of backlist. The blacklist in Spark as well as in MR/TEZ assumes bad nodes/executors will be back to normal in several hours, so it always has a timeout for blacklist. In your case, the problem is not bad nodes/executors, it is that you don't what to start executors on some nodes (like slow nodes). This is more like a cluster manager problem rather than Spark problem. To summarize your problem, you want your Spark application runs on some specific nodes. To solve your problem, for YARN you could use node label and Spark on YARN already support node label. You could google node label to know the details. For standalone, simply you should not start worker on such nodes you don't want. For Mesos I'm not sure, I guess it should also has similar approaches. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19039: [SPARK-21829][CORE] Enable config to permanently blackli...
Github user LucaCanali commented on the issue: https://github.com/apache/spark/pull/19039 Thanks @jiangxb1987 for the review. I have tried to address the comments in a new commit, in particular adding the configuration to internal/config and building a private function to handle processing of the node list in `spark.blacklist.alwaysBlacklistedNodes`. As for setting `_nodeBlacklist` I think it makes sense to use `_nodeBlacklist.set(nodeIdToBlacklistExpiryTime.keySet.toSet)` to keep it consistent with the rest of the code in BlacklistTracker. Also `nodeIdToBlacklistExpiryTime` needs to be initialized with the blacklisted nodes. As for the usefulness of the feature, I understand your comment and I have added some comments in SPARK-21829. The need for this feature for me comes from a production issue, which I realize is not very common, but I guess can happen again in my environment and maybe in others'. What we have is a shared YARN cluster and we have a workload that runs slow on a couple of nodes, however the nodes are fine to run other types of jobs, so we want to have them in the cluster. The actual problem comes from reading from an external file system, and apparently only for this specific workload (which is only one of many workloads that run on that cluster). What I have done as a workaround to make the job run faster so far is just killing the executors on the 2 "slow nodes" and the job could finish faster as it avoided the painfully slow long tail of execution on the affected nodes. The proposed patch/feature is an attempt to address this case in a more structured way than just going on the nodes and killing executors. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org