Hello,

*TL;DR: task crashes with OOM, but application gets stuck in infinite loop
retrying the task over and over again instead of failing fast.*

Using Spark 1.4.0, standalone, with DataFrames on Java 7.
I have an application that does some aggregations. I played around with
shuffling settings, which led to the dreaded Java heap space error. See the
stack trace at the end of this message.

When this happens, I see 10's of executors in "EXITED" state, a couple in
"LOADING" and one in "RUNNING". All of them are retrying the same task all
over again, and keep failing with the same "Java heap space" error. This
goes on for hours!

Why doesn't the whole application fail, when the individual executors keep
failing with the same error?

Thanks,
Romi K.

---

end of the log in a failed task:

15/07/21 11:13:40 INFO executor.Executor: Finished task 117.0 in stage
218.1 (TID 305). 2000 bytes result sent to driver
15/07/21 11:13:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned
task 306
15/07/21 11:13:41 INFO executor.Executor: Running task 0.0 in stage 219.1
(TID 306)
15/07/21 11:13:41 INFO spark.MapOutputTrackerWorker: Updating epoch to 420
and clearing cache
15/07/21 11:13:41 INFO broadcast.TorrentBroadcast: Started reading
broadcast variable 8
15/07/21 11:13:41 INFO storage.MemoryStore: ensureFreeSpace(5463) called
with curMem=285917, maxMem=1406164008
15/07/21 11:13:41 INFO storage.MemoryStore: Block broadcast_8_piece0 stored
as bytes in memory (estimated size 5.3 KB, free 1340.7 MB)
15/07/21 11:13:41 INFO broadcast.TorrentBroadcast: Reading broadcast
variable 8 took 22 ms
15/07/21 11:13:41 INFO storage.MemoryStore: ensureFreeSpace(10880) called
with curMem=291380, maxMem=1406164008
15/07/21 11:13:41 INFO storage.MemoryStore: Block broadcast_8 stored as
values in memory (estimated size 10.6 KB, free 1340.7 MB)
15/07/21 11:13:41 INFO spark.MapOutputTrackerWorker: Don't have map outputs
for shuffle 136, fetching them
15/07/21 11:13:41 INFO spark.MapOutputTrackerWorker: Doing the fetch;
tracker endpoint = AkkaRpcEndpointRef(Actor[akka.tcp://
sparkDriver@1.2.3.4:57490/user/MapOutputTracker#-99712578])
15/07/21 11:13:41 INFO spark.MapOutputTrackerWorker: Got the output
locations
15/07/21 11:13:41 INFO storage.ShuffleBlockFetcherIterator: Getting 182
non-empty blocks out of 182 blocks
15/07/21 11:13:41 INFO storage.ShuffleBlockFetcherIterator: Started 4
remote fetches in 28 ms
15/07/21 11:14:34 ERROR executor.Executor: Exception in task 0.0 in stage
219.1 (TID 306)
java.lang.OutOfMemoryError: Java heap space
    at
scala.collection.mutable.ResizableArray$class.ensureSize(ResizableArray.scala:99)
    at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:47)
    at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:83)
    at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:47)
    at
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
    at
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
    at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
    at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
    at scala.collection.AbstractIterator.to(Iterator.scala:1157)
    at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
    at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
    at
org.apache.spark.sql.execution.Sort$$anonfun$doExecute$5$$anonfun$apply$5.apply(basicOperators.scala:192)
    at
org.apache.spark.sql.execution.Sort$$anonfun$doExecute$5$$anonfun$apply$5.apply(basicOperators.scala:190)
    at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
    at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
    at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
    at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
15/07/21 11:14:34 ERROR util.SparkUncaughtExceptionHandler: Uncaught
exception in thread Thread[Executor task launch worker-0,5,main]
java.lang.OutOfMemoryError: Java heap space
    at
scala.collection.mutable.ResizableArray$class.ensureSize(ResizableArray.scala:99)
    at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:47)
    at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:83)
    at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:47)
    at
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
    at
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
    at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
    at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
    at scala.collection.AbstractIterator.to(Iterator.scala:1157)
    at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
    at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
    at
org.apache.spark.sql.execution.Sort$$anonfun$doExecute$5$$anonfun$apply$5.apply(basicOperators.scala:192)
    at
org.apache.spark.sql.execution.Sort$$anonfun$doExecute$5$$anonfun$apply$5.apply(basicOperators.scala:190)
    at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
    at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
    at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
    at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
15/07/21 11:14:34 INFO storage.DiskBlockManager: Shutdown hook called
15/07/21 11:14:34 INFO util.Utils: Shutdown hook called

Reply via email to