[ https://issues.apache.org/jira/browse/SPARK-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-1848. ------------------------------ Resolution: Cannot Reproduce I think this is at least stale at this point. > Executors are mysteriously dying when using Spark on Mesos > ---------------------------------------------------------- > > Key: SPARK-1848 > URL: https://issues.apache.org/jira/browse/SPARK-1848 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Core > Affects Versions: 1.0.0 > Environment: Linux 3.8.0-35-generic #50~precise1-Ubuntu SMP Wed Dec 4 > 17:25:51 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux > java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mesos 0.18.0 > Spark Master > Reporter: Bouke van der Bijl > > Here's a logfile: https://gist.github.com/bouk/b4647e7ba62eb169a40a > We have 47 machines running Mesos that we're trying to run Spark jobs on, but > they fail at some point because tasks have to get rescheduled too often, > which is caused by Spark killing the tasks because of executors dying. When I > look at the stderr or stdout of the Mesos slaves, there seem to be no > indication of an error happening and sometimes I can see a "14/05/15 17:38:54 > INFO DAGScheduler: Ignoring possibly bogus ShuffleMapTask completion from > <id>" which would indicate that the executor just keeps going and hasn't > actually died. If I add a Thread.dumpStack() at the location where the job is > killed, this is the trace it returns: > at java.lang.Thread.dumpStack(Thread.java:1364) > at > org.apache.spark.scheduler.TaskSetManager.handleFailedTask(TaskSetManager.scala:588) > at > org.apache.spark.scheduler.TaskSetManager$$anonfun$executorLost$9.apply(TaskSetManager.scala:665) > at > org.apache.spark.scheduler.TaskSetManager$$anonfun$executorLost$9.apply(TaskSetManager.scala:664) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) > at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > at > org.apache.spark.scheduler.TaskSetManager.executorLost(TaskSetManager.scala:664) > at > org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87) > at > org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at org.apache.spark.scheduler.Pool.executorLost(Pool.scala:87) > at > org.apache.spark.scheduler.TaskSchedulerImpl.removeExecutor(TaskSchedulerImpl.scala:412) > at > org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:271) > at > org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:266) > at > org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.statusUpdate(MesosSchedulerBackend.scala:287) > What could cause this? Is this a set up problem with our cluster or a bug in > spark? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org