Bouke van der Bijl created SPARK-1848:
-----------------------------------------

             Summary: Executors are mysteriously dying when using Spark on Mesos
                 Key: SPARK-1848
                 URL: https://issues.apache.org/jira/browse/SPARK-1848
             Project: Spark
          Issue Type: Bug
          Components: Mesos, Spark Core
    Affects Versions: 1.0.0
         Environment: Linux 3.8.0-35-generic #50~precise1-Ubuntu SMP Wed Dec 4 
17:25:51 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

java version "1.7.0_51"
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)

Mesos 0.18.0

Spark Master
            Reporter: Bouke van der Bijl


Here's a logfile: https://gist.github.com/bouk/b4647e7ba62eb169a40a

We have 47 machines running Mesos that we're trying to run Spark jobs on, but 
they fail at some point because tasks have to get rescheduled too often, which 
is caused by Spark killing the tasks because of executors dying. When I look at 
the stderr or stdout of the Mesos slaves, there seem to be no indication of an 
error happening and sometimes I can see a "14/05/15 17:38:54 INFO DAGScheduler: 
Ignoring possibly bogus ShuffleMapTask completion from <id>" which would 
indicate that the executor just keeps going and hasn't actually died. If I add 
a Thread.dumpStack() at the location where the job is killed, this is the trace 
it returns: 

        at java.lang.Thread.dumpStack(Thread.java:1364)
        at 
org.apache.spark.scheduler.TaskSetManager.handleFailedTask(TaskSetManager.scala:588)
        at 
org.apache.spark.scheduler.TaskSetManager$$anonfun$executorLost$9.apply(TaskSetManager.scala:665)
        at 
org.apache.spark.scheduler.TaskSetManager$$anonfun$executorLost$9.apply(TaskSetManager.scala:664)
        at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
        at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
        at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
        at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
        at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
        at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
        at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
        at 
org.apache.spark.scheduler.TaskSetManager.executorLost(TaskSetManager.scala:664)
        at 
org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
        at 
org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
        at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.Pool.executorLost(Pool.scala:87)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.removeExecutor(TaskSchedulerImpl.scala:412)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:271)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:266)
        at 
org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.statusUpdate(MesosSchedulerBackend.scala:287)

What could cause this? Is this a set up problem with our cluster or a bug in 
spark?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to