Bouke van der Bijl created SPARK-1848: -----------------------------------------
Summary: Executors are mysteriously dying when using Spark on Mesos Key: SPARK-1848 URL: https://issues.apache.org/jira/browse/SPARK-1848 Project: Spark Issue Type: Bug Components: Mesos, Spark Core Affects Versions: 1.0.0 Environment: Linux 3.8.0-35-generic #50~precise1-Ubuntu SMP Wed Dec 4 17:25:51 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux java version "1.7.0_51" Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) Mesos 0.18.0 Spark Master Reporter: Bouke van der Bijl Here's a logfile: https://gist.github.com/bouk/b4647e7ba62eb169a40a We have 47 machines running Mesos that we're trying to run Spark jobs on, but they fail at some point because tasks have to get rescheduled too often, which is caused by Spark killing the tasks because of executors dying. When I look at the stderr or stdout of the Mesos slaves, there seem to be no indication of an error happening and sometimes I can see a "14/05/15 17:38:54 INFO DAGScheduler: Ignoring possibly bogus ShuffleMapTask completion from <id>" which would indicate that the executor just keeps going and hasn't actually died. If I add a Thread.dumpStack() at the location where the job is killed, this is the trace it returns: at java.lang.Thread.dumpStack(Thread.java:1364) at org.apache.spark.scheduler.TaskSetManager.handleFailedTask(TaskSetManager.scala:588) at org.apache.spark.scheduler.TaskSetManager$$anonfun$executorLost$9.apply(TaskSetManager.scala:665) at org.apache.spark.scheduler.TaskSetManager$$anonfun$executorLost$9.apply(TaskSetManager.scala:664) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.scheduler.TaskSetManager.executorLost(TaskSetManager.scala:664) at org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87) at org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.Pool.executorLost(Pool.scala:87) at org.apache.spark.scheduler.TaskSchedulerImpl.removeExecutor(TaskSchedulerImpl.scala:412) at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:271) at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:266) at org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.statusUpdate(MesosSchedulerBackend.scala:287) What could cause this? Is this a set up problem with our cluster or a bug in spark? -- This message was sent by Atlassian JIRA (v6.2#6252)