Hello, *TL;DR: task crashes with OOM, but application gets stuck in infinite loop retrying the task over and over again instead of failing fast.*
Using Spark 1.4.0, standalone, with DataFrames on Java 7. I have an application that does some aggregations. I played around with shuffling settings, which led to the dreaded Java heap space error. See the stack trace at the end of this message. When this happens, I see 10's of executors in "EXITED" state, a couple in "LOADING" and one in "RUNNING". All of them are retrying the same task all over again, and keep failing with the same "Java heap space" error. This goes on for hours! Why doesn't the whole application fail, when the individual executors keep failing with the same error? Thanks, Romi K. --- end of the log in a failed task: 15/07/21 11:13:40 INFO executor.Executor: Finished task 117.0 in stage 218.1 (TID 305). 2000 bytes result sent to driver 15/07/21 11:13:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 306 15/07/21 11:13:41 INFO executor.Executor: Running task 0.0 in stage 219.1 (TID 306) 15/07/21 11:13:41 INFO spark.MapOutputTrackerWorker: Updating epoch to 420 and clearing cache 15/07/21 11:13:41 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 8 15/07/21 11:13:41 INFO storage.MemoryStore: ensureFreeSpace(5463) called with curMem=285917, maxMem=1406164008 15/07/21 11:13:41 INFO storage.MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 5.3 KB, free 1340.7 MB) 15/07/21 11:13:41 INFO broadcast.TorrentBroadcast: Reading broadcast variable 8 took 22 ms 15/07/21 11:13:41 INFO storage.MemoryStore: ensureFreeSpace(10880) called with curMem=291380, maxMem=1406164008 15/07/21 11:13:41 INFO storage.MemoryStore: Block broadcast_8 stored as values in memory (estimated size 10.6 KB, free 1340.7 MB) 15/07/21 11:13:41 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 136, fetching them 15/07/21 11:13:41 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker endpoint = AkkaRpcEndpointRef(Actor[akka.tcp:// sparkDriver@1.2.3.4:57490/user/MapOutputTracker#-99712578]) 15/07/21 11:13:41 INFO spark.MapOutputTrackerWorker: Got the output locations 15/07/21 11:13:41 INFO storage.ShuffleBlockFetcherIterator: Getting 182 non-empty blocks out of 182 blocks 15/07/21 11:13:41 INFO storage.ShuffleBlockFetcherIterator: Started 4 remote fetches in 28 ms 15/07/21 11:14:34 ERROR executor.Executor: Exception in task 0.0 in stage 219.1 (TID 306) java.lang.OutOfMemoryError: Java heap space at scala.collection.mutable.ResizableArray$class.ensureSize(ResizableArray.scala:99) at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:47) at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:83) at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:47) at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.Sort$$anonfun$doExecute$5$$anonfun$apply$5.apply(basicOperators.scala:192) at org.apache.spark.sql.execution.Sort$$anonfun$doExecute$5$$anonfun$apply$5.apply(basicOperators.scala:190) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) 15/07/21 11:14:34 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-0,5,main] java.lang.OutOfMemoryError: Java heap space at scala.collection.mutable.ResizableArray$class.ensureSize(ResizableArray.scala:99) at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:47) at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:83) at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:47) at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.Sort$$anonfun$doExecute$5$$anonfun$apply$5.apply(basicOperators.scala:192) at org.apache.spark.sql.execution.Sort$$anonfun$doExecute$5$$anonfun$apply$5.apply(basicOperators.scala:190) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) 15/07/21 11:14:34 INFO storage.DiskBlockManager: Shutdown hook called 15/07/21 11:14:34 INFO util.Utils: Shutdown hook called