Aaron Davidson created SPARK-1769:
-------------------------------------

             Summary: Executor loss can cause race condition in Pool
                 Key: SPARK-1769
                 URL: https://issues.apache.org/jira/browse/SPARK-1769
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.0.0
            Reporter: Aaron Davidson


Loss of executors (in this case due to OOMs) exposes a race condition in 
Pool.scala, evident from this stack trace:

{code}
14/05/08 22:41:48 ERROR OneForOneStrategy:
java.lang.NullPointerException
        at 
org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
        at 
org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
        at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.Pool.executorLost(Pool.scala:87)
        at 
org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
        at 
org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
        at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.Pool.executorLost(Pool.scala:87)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.removeExecutor(TaskSchedulerImpl.scala:412)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:385)
        at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.removeExecutor(CoarseGrainedSchedulerBackend.scala:160)
        at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1$$anonfun$applyOrElse$5.apply(CoarseGrainedSchedulerBackend.scala:123)
        at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1$$anonfun$applyOrElse$5.apply(CoarseGrainedSchedulerBackend.scala:123)
        at scala.Option.foreach(Option.scala:236)
        at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:123)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{code}

Note that the line of code that throws this exception is here:
{code}
schedulableQueue.foreach(_.executorLost(executorId, host))
{code}

By the stack trace, it's not schedulableQueue that is null, but an element 
therein. As far as I could tell, we never add a null element to this queue. 
Rather, I could see that there removeSchedulable() and executorLost() were 
called at about the same time (via log messages), and suspect that since this 
ArrayBuffer is in no way synchronized, that we iterate through the list while 
it's in an incomplete state.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to