Hi all, I would like to validate my understanding of memory regions in Spark. Any comments on my description below would be appreciated!
Execution is split up into stages, based on wide dependencies between RDDs and actions such as save. All transformations involving narrow dependencies before this wide dependency (or action) are pipelined. When Spark uses HDFS, input data is loaded into memory according to the partitioning used in the HDFS. As Spark has three regions (general, shuffle and storage), and this does not yet involve an explicit cache nor a shuffle, I'll assume it goes into general. During the pipelined execution of transformations with narrow dependencies, it stays here, using the same partitioning, until we reach a wide dependency (or an action). It then acquires memory from the shuffle region, and spills to disk when there is no sufficient amount of memory available. The result is passed in an iterator (located in the general space) and the shuffle region is freed. Only when an RDD is explicitly cached, it moves from the general region into the storage region. This will guarantee availability for future use, but also save space from the general region. Question 1: Is this correct? Question 2: How big is the general region? Example: When I tell Spark I have 4GB, but my system actually has 16. I see that the parameters shuffle and storage are defined (default: 20% and 60%), but it does not seem that the general area is bounded by this. Will spark use: Shuffle: 0.2*4=0.8 Storage: 0.6*4=2.4 General: 1-(0.2+0.6)*4=0.8 Or Shuffle: max(0.2*4)=max(0.8) //As we acquire from a counter, and not actually divided memory regions Storage: max(0.6*4)=max(2.4) General: 16-actualUsage(shuffle+storage) or 4-actualUsage(shuffle+storage) -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Memory-tp8916.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org