[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17619 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...
Github user squito commented on the issue: https://github.com/apache/spark/pull/17619 for anyone watching this: @IgorBerman submitted an updated version of this here https://github.com/apache/spark/pull/20640 which I plan to merge unless there are any objections. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...
Github user IgorBerman commented on the issue: https://github.com/apache/spark/pull/17619 +1 here, we are running spark core jobs but with long running driver on Mesos. Sometimes executors fail which is normal(one of the reasons is temp port conflict). With time - less and less executors are valid for the driver, so it creates situation where Mesos cluster has free resource but no-one uses them --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...
Github user hantuzun commented on the issue: https://github.com/apache/spark/pull/17619 Even though we only run normal Spark jobs this PR is going to fix a case for us as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17619 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...
Github user squito commented on the issue: https://github.com/apache/spark/pull/17619 @andreimaximov that is still sort of the case for all cluster managers. You shouldn't get starvation, you should see the app actively fail (SPARK-15865 was the main change, though some small follow-on stuff after that). What else can you do if it seems there is something wrong with every node in your cluster? But if you're really seeing your app just *hang* in mesos in that situation -- yeah seems like something needs to be fixed in the spark-mesos interaction. unfortunately I won't have a clear picture of what needs to change without spending more time understanding what is there now ... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...
Github user andreimaximov commented on the issue: https://github.com/apache/spark/pull/17619 Not sure if this is still the case, but as of 4 months ago starvation could happen if enough failures occurred on each node so the entire cluster ended up blacklisted. Unlikely but possible for a long running app running on a sufficiently small cluster. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...
Github user squito commented on the issue: https://github.com/apache/spark/pull/17619 ok I think I understand. This sounds like the equivalent of some of the existing blacklisting behavior which current only exists on yarn -- when a request is made to yarn, the spark context tells yarn which nodes it has blacklisted: https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128 however, it still seems like there is a missing piece -- you have to tell mesos which nodes you don't want executors on, right? I also don't understand why you'd get *starvation* in your app with this -- shouldn't mesos be requesting executors on other nodes? anyway, I'm agreeing that something seems wrong with the mesos scheduling when there is a bad node, but I'm not certain this is the right fix, and I just don't know enough about the communication between mesos and spark to say exactly what should be done instead, sorry. @mgummelt can you comment? might actually be better to have this discussion on jira, since we're talking about general design, not specifics of this change --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...
Github user timout commented on the issue: https://github.com/apache/spark/pull/17619 That does exactly what is supposed to do. And you absolutely right it related to executors. I am sorry if it is not clear from my previous explanations. Let us say: Spark Streaming App - very long running app: Driver, started by marathon using docker image, schedules (in mesos meaning) executors using docker images.(net=HOST) (every executor started from docker image on some mesos agent) So if some recoverable error happens, for instance: ExecutorLostFailure (executor 40 exited caused by one of the running tasks) Reason: Remote RPC client disassociated...(I do not know how about others but it is relatively often in my env.) As result the executor will be dead and after 2 failures mesos agent node will be included in MesosCoarseGrainedSchedulerBackend black list and driver will never schedule (in mesos meaning) executor on it. So the app will starve... and notice will not die. That exactly what happened with my streams apps before that patch. That patch may be incompatible with master already but i can fix it if needed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...
Github user squito commented on the issue: https://github.com/apache/spark/pull/17619 Sorry I am only just looking at this now -- I am not so sure this is doing what you think. I think the notion of "task" in MesosCoarseGrainedSchedulerBackend might be something different, its really an "executor" in spark's terminology. Perhaps that code should have some additional comments explaining that. Tasks are still handled in spark's TaskScheduler / TaskSetManager etc. @mgummelt can you confirm my understanding? @timout please close this PR (unless I'm wrong about this code ...) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17619 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17619 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...
Github user andreimaximov commented on the issue: https://github.com/apache/spark/pull/17619 Is there an update on this PR? Doesn't seem possible to run Spark reliably on Mesos with this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17619 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org