driver memory management

2014-09-28 Thread Brad Miller
Hi All,

I am interested to collect() a large RDD so that I can run a learning
algorithm on it.  I've noticed that when I don't increase
SPARK_DRIVER_MEMORY I can run out of memory. I've also noticed that it
looks like the same fraction of memory is reserved for storage on the
driver as on the worker nodes, and that the web UI doesn't show any storage
usage on the driver.  Since that memory is reserved for storage, it seems
possible that it is not being used towards the collection of my RDD.

Is there a way to configure the memory management (
spark.storage.memoryFraction, spark.shuffle.memoryFraction) for the driver
separately from the workers?

Is there any reason to leave space for shuffle or storage on the driver?
It seems like I never see either of these used on the web UI, although I
may not be interpreting the UI correctly or my jobs may not trigger the use
case.

For context, I am using PySpark (so much of my processing happens outside
of the allocated memory in Java) and running the Spark 1.1.0 release
binaries.

best,
-Brad


Re: driver memory management

2014-09-28 Thread Reynold Xin
The storage fraction only limits the amount of memory used for storage. It
doesn't actually limit anything else. I.e you can use all the memory if you
want in collect.

On Sunday, September 28, 2014, Brad Miller bmill...@eecs.berkeley.edu
wrote:

 Hi All,

 I am interested to collect() a large RDD so that I can run a learning
 algorithm on it.  I've noticed that when I don't increase
 SPARK_DRIVER_MEMORY I can run out of memory. I've also noticed that it
 looks like the same fraction of memory is reserved for storage on the
 driver as on the worker nodes, and that the web UI doesn't show any storage
 usage on the driver.  Since that memory is reserved for storage, it seems
 possible that it is not being used towards the collection of my RDD.

 Is there a way to configure the memory management (
 spark.storage.memoryFraction, spark.shuffle.memoryFraction) for the
 driver separately from the workers?

 Is there any reason to leave space for shuffle or storage on the driver?
 It seems like I never see either of these used on the web UI, although I
 may not be interpreting the UI correctly or my jobs may not trigger the use
 case.

 For context, I am using PySpark (so much of my processing happens outside
 of the allocated memory in Java) and running the Spark 1.1.0 release
 binaries.

 best,
 -Brad