On Thu, Jul 9, 2015 at 12:32 PM, besil <sbernardine...@beintoo.com> wrote:

> Hi,
>
> We are experimenting scheduling errors due to mesos slave failing.
> It seems to be an open bug, more information can be found here.
>
> https://issues.apache.org/jira/browse/SPARK-3289
>
> According to this  link
> <
> https://mail-archives.apache.org/mod_mbox/mesos-user/201310.mbox/%3ccaakwvaxprrnrcdlazcybnmk1_9elyheodaf8urf8ssrlbac...@mail.gmail.com%3E
> >
> from mail archive, it seems that Spark doesn't reschedule LOST tasks to
> active executors, but keep trying rescheduling it on the failed host.
>

Are you running in fine-grained mode? In coarse-grained mode it seems that
Spark will notice a slave that fails repeatedly and would not accept offers
on that slave:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala#L188


>
> We would like to dynamically resize our Mesos cluster (adding or removing
> machines - using an AWS autoscaling group), but this bug kills our running
> applications if a Mesos slave running a Spark executor is shut down.
>

I think what you need is dynamic allocation, which should be available soon
(PR: 4984 <https://github.com/apache/spark/pull/4984>).


> Is any known workaround?
>
> Thank you
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Mesos-task-rescheduling-tp23740.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

--
Iulian Dragos

------
Reactive Apps on the JVM
www.typesafe.com

Reply via email to