Hi all,

I would like to validate my understanding of memory regions in Spark. Any
comments on my description below would be appreciated!

Execution is split up into stages, based on wide dependencies between RDDs
and actions such as save. All transformations involving narrow dependencies
before this wide dependency (or action) are pipelined. When Spark uses HDFS,
input data is loaded into memory according to the partitioning used in the
HDFS. As Spark has three regions (general, shuffle and storage), and this
does not yet involve an explicit cache nor a shuffle, I'll assume it goes
into general. During the pipelined execution of transformations with narrow
dependencies, it stays here, using the same partitioning, until we reach a
wide dependency (or an action). It then acquires memory from the shuffle
region, and spills to disk when there is no sufficient amount of memory
available. The result is passed in an iterator (located in the general
space) and the shuffle region is freed.

Only when an RDD is explicitly cached, it moves from the general region into
the storage region. This will guarantee availability for future use, but
also save space from the general region.

Question 1: Is this correct?
Question 2: How big is the general region?
Example: When I tell Spark I have 4GB, but my system actually has 16. I see
that the parameters shuffle and storage are defined (default: 20% and 60%),
but it does not seem that the general area is bounded by this. Will spark
use:
Shuffle: 0.2*4=0.8
Storage: 0.6*4=2.4
General: 1-(0.2+0.6)*4=0.8
Or
Shuffle: max(0.2*4)=max(0.8) //As we acquire from a counter, and not
actually divided memory regions
Storage: max(0.6*4)=max(2.4)
General: 16-actualUsage(shuffle+storage) or 4-actualUsage(shuffle+storage)



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Memory-tp8916.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to