Anomalous Spark RDD persistence behavior

Dave Jaffe Mon, 07 Nov 2016 14:08:34 -0800

I’ve been studying Spark RDD persistence with spark-perf 
(https://github.com/databricks/spark-perf), especially when the dataset size 
starts to exceed available memory.

I’m running Spark 1.6.0 on YARN with CDH 5.7. I have 10 NodeManager nodes, each
with 16 vcores and 32 GB of container memory. So I’m running 39 executors with
4 cores and 8 GB each (6 GB spark.executor.memory and 2 GB
spark.yarn.executor.memoryOverhead). I am using the default values for
spark.memory.fraction and spark.memory.storageFraction so I end up with 3.1 GB
available for caching RDDs, for a total of about 121 GB.

I’m running a single Random Forest test, with 500 features and up to 40 million
examples, with 1 partition per core or 156 total partitions. The code (at line
https://github.com/databricks/spark-perf/blob/master/mllib-tests/v1p5/src/main/scala/mllib/perf/MLAlgorithmTests.scala#L653)
caches the input RDD immediately after creation. At 30M examples this fits
into memory with all 156 partitions cached, with a total 113.4 GB in memory, or
4 blocks of about 745 MB each per executor. So far so good.

At 40M examples, I expected about 3 partitions to fit in memory per executor,
or 75% to be cached. However, I found only 3 partitions across the cluster were
cached, or 2%, for a total size in memory of 2.9GB. Three of the executors had
one block of 992 MB cached, with 2.1 GB free (enough for 2 more blocks). The
other 36 held no blocks, with 3.1 GB free (enough for 3 blocks). Why this
dramatic falloff?

Thinking this may improve if I changed the persistence to MEMORY_AND_DISK.
Unfortunately now the executor memory was exceeded (“Container killed by YARN
for exceeding memory limits. 8.9 GB of 8 GB physical memory used”) and the run
ground to a halt. Why does persisting to disk take more memory than caching to
memory?

Is this behavior expected as dataset size exceeds available memory?

Thanks in advance,

Dave Jaffe
Big Data Performance
VMware
dja...@vmware.com

Anomalous Spark RDD persistence behavior

Reply via email to