Hi All, I am interested to collect() a large RDD so that I can run a learning algorithm on it. I've noticed that when I don't increase SPARK_DRIVER_MEMORY I can run out of memory. I've also noticed that it looks like the same fraction of memory is reserved for storage on the driver as on the worker nodes, and that the web UI doesn't show any storage usage on the driver. Since that memory is reserved for storage, it seems possible that it is not being used towards the collection of my RDD.
Is there a way to configure the memory management ( spark.storage.memoryFraction, spark.shuffle.memoryFraction) for the driver separately from the workers? Is there any reason to leave space for shuffle or storage on the driver? It seems like I never see either of these used on the web UI, although I may not be interpreting the UI correctly or my jobs may not trigger the use case. For context, I am using PySpark (so much of my processing happens outside of the allocated memory in Java) and running the Spark 1.1.0 release binaries. best, -Brad