[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...

2018-06-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17619
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...

2018-02-20 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/17619
  
for anyone watching this: @IgorBerman submitted an updated version of this 
here https://github.com/apache/spark/pull/20640 which I plan to merge unless 
there are any objections.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...

2018-02-19 Thread IgorBerman
Github user IgorBerman commented on the issue:

https://github.com/apache/spark/pull/17619
  
+1 here, we are running spark core jobs but with long running driver on 
Mesos. Sometimes executors fail which is normal(one of the reasons is temp port 
conflict). With time - less and less executors are valid for the driver, so it 
creates situation where Mesos cluster has free resource but no-one uses them


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...

2018-02-15 Thread hantuzun
Github user hantuzun commented on the issue:

https://github.com/apache/spark/pull/17619
  
Even though we only run normal Spark jobs this PR is going to fix a case 
for us as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...

2018-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17619
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...

2017-12-27 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/17619
  
@andreimaximov that is still sort of the case for all cluster managers.  
You shouldn't get starvation, you should see the app actively fail (SPARK-15865 
was the main change, though some small follow-on stuff after that).  What else 
can you do if it seems there is something wrong with every node in your cluster?

But if you're really seeing your app just *hang* in mesos in that situation 
-- yeah seems like something needs to be fixed in the spark-mesos interaction.  
unfortunately I won't have a clear picture of what needs to change without 
spending more time understanding what is there now ...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...

2017-12-27 Thread andreimaximov
Github user andreimaximov commented on the issue:

https://github.com/apache/spark/pull/17619
  
Not sure if this is still the case, but as of 4 months ago starvation could 
happen if enough failures occurred on each node so the entire cluster ended up 
blacklisted. Unlikely but possible for a long running app running on a 
sufficiently small cluster.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...

2017-12-27 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/17619
  
ok I think I understand.  This sounds like the equivalent of some of the 
existing blacklisting behavior which current only exists on yarn -- when a 
request is made to yarn, the spark context tells yarn which nodes it has 
blacklisted:


https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128

however, it still seems like there is a missing piece -- you have to tell 
mesos which nodes you don't want executors on, right?

I also don't understand why you'd get *starvation* in your app with this -- 
shouldn't mesos be requesting executors on other nodes?

anyway, I'm agreeing that something seems wrong with the mesos scheduling 
when there is a bad node, but I'm not certain this is the right fix, and I just 
don't know enough about the communication between mesos and spark to say 
exactly what should be done instead, sorry.
@mgummelt can you comment?

might actually be better to have this discussion on jira, since we're 
talking about general design, not specifics of this change


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...

2017-12-22 Thread timout
Github user timout commented on the issue:

https://github.com/apache/spark/pull/17619
  
That does exactly what is supposed to do. And you absolutely right it 
related to executors.
I am sorry if it is not clear from my previous explanations.
Let us say:
Spark Streaming App - very long running app:
 Driver, started by marathon using  docker image, schedules (in mesos 
meaning) executors using 
 docker images.(net=HOST) (every executor started from docker image on some 
mesos agent)
So if some recoverable error happens, for instance: 
ExecutorLostFailure (executor 40 exited caused by one of the running tasks) 
Reason: Remote RPC client disassociated...(I do not know how about others but 
it is relatively often in my env.)
As result the executor will be dead and after 2 failures mesos agent node 
will be included in MesosCoarseGrainedSchedulerBackend black list and driver 
will never schedule (in mesos meaning) executor on it. So the app will 
starve... and notice will not die.
That exactly what happened with my streams apps before that patch.

That patch may be incompatible with master already but i can fix it if 
needed.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...

2017-12-21 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/17619
  
Sorry I am only just looking at this now --

I am not so sure this is doing what you think.  I think the notion of 
"task" in MesosCoarseGrainedSchedulerBackend might be something different, its 
really an "executor" in spark's terminology.  Perhaps that code should have 
some additional comments explaining that.  Tasks are still handled in spark's 
TaskScheduler / TaskSetManager etc.

@mgummelt can you confirm my understanding?

@timout please close this PR (unless I'm wrong about this code ...)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...

2017-12-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17619
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...

2017-12-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17619
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...

2017-08-14 Thread andreimaximov
Github user andreimaximov commented on the issue:

https://github.com/apache/spark/pull/17619
  
Is there an update on this PR? Doesn't seem possible to run Spark reliably 
on Mesos with this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...

2017-04-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17619
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org