Hi All,

I am interested to collect() a large RDD so that I can run a learning
algorithm on it.  I've noticed that when I don't increase
SPARK_DRIVER_MEMORY I can run out of memory. I've also noticed that it
looks like the same fraction of memory is reserved for storage on the
driver as on the worker nodes, and that the web UI doesn't show any storage
usage on the driver.  Since that memory is reserved for storage, it seems
possible that it is not being used towards the collection of my RDD.

Is there a way to configure the memory management (
spark.storage.memoryFraction, spark.shuffle.memoryFraction) for the driver
separately from the workers?

Is there any reason to leave space for shuffle or storage on the driver?
It seems like I never see either of these used on the web UI, although I
may not be interpreting the UI correctly or my jobs may not trigger the use
case.

For context, I am using PySpark (so much of my processing happens outside
of the allocated memory in Java) and running the Spark 1.1.0 release
binaries.

best,
-Brad

Reply via email to