Repository: spark
Updated Branches:
  refs/heads/master 30f3cfda1 -> 33a0ec937


[SPARK-11710] Document new memory management model

Author: Andrew Or <and...@databricks.com>

Closes #9676 from andrewor14/memory-management-docs.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/33a0ec93
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/33a0ec93
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/33a0ec93

Branch: refs/heads/master
Commit: 33a0ec93771ef5c3b388165b07cfab9014918d3b
Parents: 30f3cfd
Author: Andrew Or <and...@databricks.com>
Authored: Mon Nov 16 17:00:18 2015 -0800
Committer: Andrew Or <and...@databricks.com>
Committed: Mon Nov 16 17:00:18 2015 -0800

----------------------------------------------------------------------
 docs/configuration.md | 13 ++++++-----
 docs/tuning.md        | 54 ++++++++++++++++++++++++++++++----------------
 2 files changed, 44 insertions(+), 23 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/33a0ec93/docs/configuration.md
----------------------------------------------------------------------
diff --git a/docs/configuration.md b/docs/configuration.md
index d961f43..c496146 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -722,17 +722,20 @@ Apart from these, the following properties are also 
available, and may be useful
     Fraction of the heap space used for execution and storage. The lower this 
is, the more
     frequently spills and cached data eviction occur. The purpose of this 
config is to set
     aside memory for internal metadata, user data structures, and imprecise 
size estimation
-    in the case of sparse, unusually large records.
+    in the case of sparse, unusually large records. Leaving this at the 
default value is
+    recommended. For more detail, see <a 
href="tuning.html#memory-management-overview">
+    this description</a>.
   </td>
 </tr>
 <tr>
   <td><code>spark.memory.storageFraction</code></td>
   <td>0.5</td>
   <td>
-    T​he size of the storage region within the space set aside by
-    <code>s​park.memory.fraction</code>. This region is not statically 
reserved, but dynamically
-    allocated as cache requests come in. ​Cached data may be evicted only if 
total storage exceeds
-    this region.
+    Amount of storage memory immune to eviction, expressed as a fraction of 
the size of the
+    region set aside by <code>s​park.memory.fraction</code>. The higher this 
is, the less
+    working memory may be available to execution and tasks may spill to disk 
more often.
+    Leaving this at the default value is recommended. For more detail, see
+    <a href="tuning.html#memory-management-overview">this description</a>.
   </td>
 </tr>
 <tr>

http://git-wip-us.apache.org/repos/asf/spark/blob/33a0ec93/docs/tuning.md
----------------------------------------------------------------------
diff --git a/docs/tuning.md b/docs/tuning.md
index 879340a..a8fe7a4 100644
--- a/docs/tuning.md
+++ b/docs/tuning.md
@@ -88,9 +88,39 @@ than the "raw" data inside their fields. This is due to 
several reasons:
   but also pointers (typically 8 bytes each) to the next object in the list.
 * Collections of primitive types often store them as "boxed" objects such as 
`java.lang.Integer`.
 
-This section will discuss how to determine the memory usage of your objects, 
and how to improve
-it -- either by changing your data structures, or by storing data in a 
serialized format.
-We will then cover tuning Spark's cache size and the Java garbage collector.
+This section will start with an overview of memory management in Spark, then 
discuss specific
+strategies the user can take to make more efficient use of memory in his/her 
application. In
+particular, we will describe how to determine the memory usage of your 
objects, and how to
+improve it -- either by changing your data structures, or by storing data in a 
serialized
+format. We will then cover tuning Spark's cache size and the Java garbage 
collector.
+
+## Memory Management Overview
+
+Memory usage in Spark largely falls under one of two categories: execution and 
storage.
+Execution memory refers to that used for computation in shuffles, joins, sorts 
and aggregations,
+while storage memory refers to that used for caching and propagating internal 
data across the
+cluster. In Spark, execution and storage share a unified region (M). When no 
execution memory is
+used, storage can acquire all the available memory and vice versa. Execution 
may evict storage
+if necessary, but only until total storage memory usage falls under a certain 
threshold (R).
+In other words, `R` describes a subregion within `M` where cached blocks are 
never evicted.
+Storage may not evict execution due to complexities in implementation.
+
+This design ensures several desirable properties. First, applications that do 
not use caching
+can use the entire space for execution, obviating unnecessary disk spills. 
Second, applications
+that do use caching can reserve a minimum storage space (R) where their data 
blocks are immune
+to being evicted. Lastly, this approach provides reasonable out-of-the-box 
performance for a
+variety of workloads without requiring user expertise of how memory is divided 
internally.
+
+Although there are two relevant configurations, the typical user should not 
need to adjust them
+as the default values are applicable to most workloads:
+
+* `spark.memory.fraction` expresses the size of `M` as a fraction of the total 
JVM heap space
+(default 0.75). The rest of the space (25%) is reserved for user data 
structures, internal
+metadata in Spark, and safeguarding against OOM errors in the case of sparse 
and unusually
+large records.
+* `spark.memory.storageFraction` expresses the size of `R` as a fraction of 
`M` (default 0.5).
+`R` is the storage space within `M` where cached blocks immune to being 
evicted by execution.
+
 
 ## Determining Memory Consumption
 
@@ -151,18 +181,6 @@ time spent GC. This can be done by adding `-verbose:gc 
-XX:+PrintGCDetails -XX:+
 each time a garbage collection occurs. Note these logs will be on your 
cluster's worker nodes (in the `stdout` files in
 their work directories), *not* on your driver program.
 
-**Cache Size Tuning**
-
-One important configuration parameter for GC is the amount of memory that 
should be used for caching RDDs.
-By default, Spark uses 60% of the configured executor memory 
(`spark.executor.memory`) to
-cache RDDs. This means that 40% of memory is available for any objects created 
during task execution.
-
-In case your tasks slow down and you find that your JVM is garbage-collecting 
frequently or running out of
-memory, lowering this value will help reduce the memory consumption. To change 
this to, say, 50%, you can call
-`conf.set("spark.storage.memoryFraction", "0.5")` on your SparkConf. Combined 
with the use of serialized caching,
-using a smaller cache should be sufficient to mitigate most of the garbage 
collection problems.
-In case you are interested in further tuning the Java GC, continue reading 
below.
-
 **Advanced GC Tuning**
 
 To further tune garbage collection, we first need to understand some basic 
information about memory management in the JVM:
@@ -183,9 +201,9 @@ temporary objects created during task execution. Some steps 
which may be useful
 * Check if there are too many garbage collections by collecting GC stats. If a 
full GC is invoked multiple times for
   before a task completes, it means that there isn't enough memory available 
for executing tasks.
 
-* In the GC stats that are printed, if the OldGen is close to being full, 
reduce the amount of memory used for caching.
-  This can be done using the `spark.storage.memoryFraction` property. It is 
better to cache fewer objects than to slow
-  down task execution!
+* In the GC stats that are printed, if the OldGen is close to being full, 
reduce the amount of
+  memory used for caching by lowering `spark.memory.storageFraction`; it is 
better to cache fewer
+  objects than to slow down task execution!
 
 * If there are too many minor collections but not many major GCs, allocating 
more memory for Eden would help. You
   can set the size of the Eden to be an over-estimate of how much memory each 
task will need. If the size of Eden


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to