Thanks! Sounds like my rough understanding was roughly right :)
Definitely understand cached RDDs can add to the memory requirements.
Luckily, like you mentioned, you can configure spark to flush that to disk
and bound its total size in memory via spark.storage.memoryFraction, so I
have a
I'm trying to determine how to bound my memory use in a job working with
more data than can simultaneously fit in RAM. From reading the tuning
guide, my impression is that Spark's memory usage is roughly the following:
(A) In-Memory RDD use + (B) In memory Shuffle use + (C) Transient memory
used
Keith, do you mean bound as in (a) strictly control to some quantifiable
limit, or (b) try to minimize the amount used by each task?
If a, then that is outside the scope of Spark's memory management, which
you should think of as an application-level (that is, above JVM) mechanism.
In this scope,
A dash of both. I want to know enough that I can reason about, rather
than strictly control, the amount of memory Spark will use. If I have a
big data set, I want to understand how I can design it so that Spark's
memory consumption falls below my available resources. Or alternatively,
if it's