We have been trying to solve memory issue with a spark job that processes 150GB of data (on disk). It does a groupBy operation; some of the executor will receive somehwere around (2-4M scala case objects) to work with. We are using following spark config:
"executorInstances": "15", "executorCores": "1", (we reduce it to one so single task gets all the executorMemory! at least that's the assumption here) "executorMemory": "15000m", "minPartitions": "2000", "taskCpus": "1", "executorMemoryOverhead": "1300", "shuffleManager": "tungsten-sort", "storageFraction": "0.4" This is a snippet of what we see in spark UI for a Job that fails. This is a *stage* of this job that fails. Stage IdPool NameDescriptionSubmittedDurationTasks: Succeeded/TotalInput OutputShuffle Read â–¾Shuffle WriteFailure Reason 5 (retry 15) prod <http://hdn7:18080/history/application_1454975800192_0447/stages/pool?poolname=prod> map at SparkDataJobs.scala:210 <http://hdn7:18080/history/application_1454975800192_0447/stages/stage?id=5&attempt=15> +details 2016/02/09 21:30:06 13 min 130/389 (16 failed) 1982.6 MB 818.7 MB org.apache.spark.shuffle.FetchFailedException: Error in opening FileSegmentManagedBuffer{file=/tmp/hadoop/nm-local-dir/usercache/fasd/appcache/application_1454975800192_0447/blockmgr-abb77b52-9761-457a-b67d-42a15b975d76/0c/shuffle_0_39_0.data, offset=11421300, length=2353} This is one of the single *task* attempt from above stage that threw OOM 2 22361 0 FAILED PROCESS_LOCAL 38 / nd1.mycom.local 2016/02/09 22:10:42 5.2 min 1.6 min 7.4 MB / 375509 java.lang.OutOfMemoryError: Java heap space +details java.lang.OutOfMemoryError: Java heap space at java.util.IdentityHashMap.resize(IdentityHashMap.java:469) at java.util.IdentityHashMap.put(IdentityHashMap.java:445) at org.apache.spark.util.SizeEstimator$SearchState.enqueue(SizeEstimator.scala:159) at org.apache.spark.util.SizeEstimator$$anonfun$visitSingleObject$1.apply(SizeEstimator.scala:203) at org.apache.spark.util.SizeEstimator$$anonfun$visitSingleObject$1.apply(SizeEstimator.scala:202) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:202) at org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:186) at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:54) at org.apache.spark.util.collection.SizeTracker$class.takeSample(SizeTracker.scala:78) at org.apache.spark.util.collection.SizeTracker$class.afterUpdate(SizeTracker.scala:70) at org.apache.spark.util.collection.SizeTrackingVector.$plus$eq(SizeTrackingVector.scala:3 None of above suggest that it went out ot 15GB of memory that I initially allocated? So what am i missing here. What's eating my memory. We tried executorJavaOpts to get heap dump but it doesn't seem to work. -XX:-HeapDumpOnOutOfMemoryError -XX:OnOutOfMemoryError='kill -3 %p' -XX:HeapDumpPath=/opt/cores/spark I don't see any cores being generated.. neither I can find Heap dump anywhere in logs. Also, how do I find yarn container ID from spark executor ID ? So that I can investigate yarn nodemanager and resourcemanager logs for particular container. PS - Job does not do any caching of intermediate RDD as each RDD is just used once for subsequent step. We use spark 1.5.2 over Yarn in yarn-client mode. Thanks -- [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] <https://twitter.com/Xactly> [image: Facebook] <https://www.facebook.com/XactlyCorp> [image: YouTube] <http://www.youtube.com/xactlycorporation>