[ https://issues.apache.org/jira/browse/SPARK-604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-604. ----------------------------- Resolution: Cannot Reproduce Stale at this point, without similar findings recently. > reconnect if mesos slaves dies > ------------------------------ > > Key: SPARK-604 > URL: https://issues.apache.org/jira/browse/SPARK-604 > Project: Spark > Issue Type: Bug > Components: Mesos > > when running on mesos, if a slave goes down, spark doesn't try to reassign > the work to another machine. Even if the slave comes back up, the job is > doomed. > Currently when this happens, we just see this in the driver logs: > 12/11/01 16:48:56 INFO mesos.MesosSchedulerBackend: Mesos slave lost: > 201210312057-1560611338-5050-24091-52 > Exception in thread "Thread-346" java.util.NoSuchElementException: key not > found: value: "201210312057-1560611338-5050-24091-52" > at scala.collection.MapLike$class.default(MapLike.scala:224) > at scala.collection.mutable.HashMap.default(HashMap.scala:43) > at scala.collection.MapLike$class.apply(MapLike.scala:135) > at scala.collection.mutable.HashMap.apply(HashMap.scala:43) > at > spark.scheduler.cluster.ClusterScheduler.slaveLost(ClusterScheduler.scala:255) > at > spark.scheduler.mesos.MesosSchedulerBackend.slaveLost(MesosSchedulerBackend.scala:275) > 12/11/01 16:48:56 INFO mesos.MesosSchedulerBackend: driver.run() returned > with code DRIVER_ABORTED -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org