Job aborted due to stage failure: TID x failed for unknown reasons

Shannon Quinn Fri, 18 Jul 2014 11:31:24 -0700

Hi all,

I'm dealing with some strange error messages that I *think* comes downto a memory issue, but I'm having a hard time pinning it down and coulduse some guidance from the experts.

I have a 2-machine Spark (1.0.1) cluster. Both machines have 8 cores;one has 16GB memory, the other 32GB (which is the master). Myapplication involves computing pairwise pixel affinities in images,though the images I've tested so far only get as big as 1920x1200, andas small as 16x16.

I did have to change a few memory and parallelism settings, otherwise Iwas getting explicit OutOfMemoryExceptions. In spark-default.conf:


    spark.executor.memory    14g
    spark.default.parallelism    32
    spark.akka.frameSize        1000

In spark-env.sh:

    SPARK_DRIVER_MEMORY=10G

With those settings, however, I get a bunch of WARN statements about"Lost TIDs" (no task is successfully completed) in addition to lostExecutors, which are repeated 4 times until I finally get the followingerror message and crash:


---

14/07/18 12:06:20 INFO TaskSchedulerImpl: Cancelling stage 0

14/07/18 12:06:20 INFO DAGScheduler: Failed to run collect at/home/user/Programming/PySpark-Affinities/affinity.py:243

Traceback (most recent call last):

File "/home/user/Programming/PySpark-Affinities/affinity.py", line243, in <module>

    lambda x: np.abs(IMAGE.value[x[0]] - IMAGE.value[x[1]])

File"/net/antonin/home/user/Spark/spark-1.0.1-bin-hadoop2/python/pyspark/rdd.py",line 583, in collect

    bytesInJava = self._jrdd.collect().iterator()

File"/net/antonin/home/user/Spark/spark-1.0.1-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",line 537, in __call__File"/net/antonin/home/user/Spark/spark-1.0.1-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py",line 300, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling o27.collect.

: org.apache.spark.SparkException: Job aborted due to stage failure:Task 0.0:13 failed 4 times, most recent failure: *TID 32 on hostmaster.host.univ.edu failed for unknown reason*

Driver stacktrace:

atorg.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)atorg.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)atorg.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)atscala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

atorg.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)atorg.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)atorg.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)

    at scala.Option.foreach(Option.scala:236)

atorg.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)atorg.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)

    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
    at akka.actor.ActorCell.invoke(ActorCell.scala:456)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
    at akka.dispatch.Mailbox.run(Mailbox.scala:219)

atakka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)

    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

atscala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)atscala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)atscala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


14/07/18 12:06:20 INFO DAGScheduler: Executor lost: 4 (epoch 4)

14/07/18 12:06:20 INFO BlockManagerMasterActor: Trying to removeexecutor 4 from BlockManagerMaster.14/07/18 12:06:20 INFO BlockManagerMaster: Removed 4 successfully inremoveExecutor

user@master:~/Programming/PySpark-Affinities$

---

If I run the really small image instead (16x16), it *appears* to run tocompletion (gives me the output I expect without any exceptions beingthrown). However, in the stderr logs for the app that was run, it liststhe state as "KILLED" with the final message a "ERRORCoarseGrainedExecutorBackend: Driver Disassociated". If I run any largerimages, I get the exception I pasted above.

Furthermore, if I just do a spark-submit with master=local[*], asidefrom still needing to set the aforementioned memory options, it willwork for an image of *any* size (I've tested both machinesindependently; they both do this when running as local[*]), whereasworking on a cluster will result in the aforementioned crash at stage 0with anything but the smallest images.


Any ideas what is going on?

Thank you very much in advance!

Regards,
Shannon

Job aborted due to stage failure: TID x failed for unknown reasons

Reply via email to