[ 
https://issues.apache.org/jira/browse/SPARK-3150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tatiana Borisova updated SPARK-3150:
------------------------------------

    Description: 
The issue happens when Spark is run standalone on a cluster.

When master and driver fall simultaneously on one node in a cluster, master 
tries to recover its state and restart spark driver.
While restarting driver, it falls with NPE exception (stacktrace is below).
After falling, it restarts and tries to recover its state and restart Spark 
driver again. It happens over and over in an infinite cycle.

Namely, Spark tries to read DriverInfo state from zookeeper, but after reading 
it happens to be null in DriverInfo.worker.

Stacktrace (on version 1.0.0, but reproduceable on version 1.0.2, too)

2014-08-14 21:44:59,519] ERROR  (akka.actor.OneForOneStrategy)
java.lang.NullPointerException
        at 
org.apache.spark.deploy.master.Master$$anonfun$completeRecovery$5.apply(Master.scala:448)
        at 
org.apache.spark.deploy.master.Master$$anonfun$completeRecovery$5.apply(Master.scala:448)
        at 
scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
        at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
        at 
scala.collection.TraversableLike$class.filter(TraversableLike.scala:263)
        at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
        at 
org.apache.spark.deploy.master.Master.completeRecovery(Master.scala:448)
        at 
org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:376)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

How to reproduce: kill all Spark processes when running Spark standalone on a 
cluster on some cluster node, where driver runs (kill driver, master and worker 
simultaneously).

  was:
The issue happens when Spark is run standalone on a cluster.

When master and driver fall simultaneously on one node in a cluster, master 
tries to recover its state and restart spark driver.
While restarting driver, it falls with NPE exception (stacktrace is below).
After falling, it restarts and tries to recover its state and restart Spark 
driver again. It happens over and over in an infinite cycle.

Namely, Spark tries to read DriverInfo state from zookeeper, but after reading 
it happens to be null in DriverInfo.worker.

Stacktrace (on version 1.0.0, but reproduceable on version 1.0.2, too)

2014-08-14 21:44:59,519] ERROR  (akka.actor.OneForOneStrategy)
java.lang.NullPointerException
        at 
org.apache.spark.deploy.master.Master$$anonfun$completeRecovery$5.apply(Master.scala:448)
        at 
org.apache.spark.deploy.master.Master$$anonfun$completeRecovery$5.apply(Master.scala:448)
        at 
scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
        at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
        at 
scala.collection.TraversableLike$class.filter(TraversableLike.scala:263)
        at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
        at 
org.apache.spark.deploy.master.Master.completeRecovery(Master.scala:448)
        at 
org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:376)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

How to reproduce: kill both master and driver processes on some cluster node 
when running Spark standalone on a cluster.


> NullPointerException in Spark recovery after simultaneous fall of master and 
> driver
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-3150
>                 URL: https://issues.apache.org/jira/browse/SPARK-3150
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.0.2
>         Environment:  Linux 3.2.0-23-generic x86_64
>            Reporter: Tatiana Borisova
>
> The issue happens when Spark is run standalone on a cluster.
> When master and driver fall simultaneously on one node in a cluster, master 
> tries to recover its state and restart spark driver.
> While restarting driver, it falls with NPE exception (stacktrace is below).
> After falling, it restarts and tries to recover its state and restart Spark 
> driver again. It happens over and over in an infinite cycle.
> Namely, Spark tries to read DriverInfo state from zookeeper, but after 
> reading it happens to be null in DriverInfo.worker.
> Stacktrace (on version 1.0.0, but reproduceable on version 1.0.2, too)
> 2014-08-14 21:44:59,519] ERROR  (akka.actor.OneForOneStrategy)
> java.lang.NullPointerException
>         at 
> org.apache.spark.deploy.master.Master$$anonfun$completeRecovery$5.apply(Master.scala:448)
>         at 
> org.apache.spark.deploy.master.Master$$anonfun$completeRecovery$5.apply(Master.scala:448)
>         at 
> scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
>         at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
>         at 
> scala.collection.TraversableLike$class.filter(TraversableLike.scala:263)
>         at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
>         at 
> org.apache.spark.deploy.master.Master.completeRecovery(Master.scala:448)
>         at 
> org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:376)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>         at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>         at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> How to reproduce: kill all Spark processes when running Spark standalone on a 
> cluster on some cluster node, where driver runs (kill driver, master and 
> worker simultaneously).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to