On Thu, Jul 9, 2015 at 12:32 PM, besil <sbernardine...@beintoo.com> wrote:
> Hi, > > We are experimenting scheduling errors due to mesos slave failing. > It seems to be an open bug, more information can be found here. > > https://issues.apache.org/jira/browse/SPARK-3289 > > According to this link > < > https://mail-archives.apache.org/mod_mbox/mesos-user/201310.mbox/%3ccaakwvaxprrnrcdlazcybnmk1_9elyheodaf8urf8ssrlbac...@mail.gmail.com%3E > > > from mail archive, it seems that Spark doesn't reschedule LOST tasks to > active executors, but keep trying rescheduling it on the failed host. > Are you running in fine-grained mode? In coarse-grained mode it seems that Spark will notice a slave that fails repeatedly and would not accept offers on that slave: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala#L188 > > We would like to dynamically resize our Mesos cluster (adding or removing > machines - using an AWS autoscaling group), but this bug kills our running > applications if a Mesos slave running a Spark executor is shut down. > I think what you need is dynamic allocation, which should be available soon (PR: 4984 <https://github.com/apache/spark/pull/4984>). > Is any known workaround? > > Thank you > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Mesos-task-rescheduling-tp23740.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- -- Iulian Dragos ------ Reactive Apps on the JVM www.typesafe.com