Hey, I was talking about something more like:

    val size = 1024 * 1024
    val numSlices = 8
    val arr = Array.fill[Array[Int]](numSlices) { new Array[Int](size /
numSlices) }
    val rdd = sc.parallelize(arr, numSlices).cache()
    val size2 = rdd.map(_.length).sum()
    assert( size2 == size )

If I do this, I see 8 blocks are put into MemoryStore, each with a size of
512.1 KB, which adds up to almost exactly 4MB as expected.

Regarding your other questions:
Non-cached RDDs are not written back to disk, their results are simply not
stored anywhere. If the results are needed again, the RDD will be
recomputed. I'm not sure I understand your distinction between "JVM" and
"Spark" memory -- both arrays and cached RDDs are stored in the JVM heap.

Shuffle operations are unique in that they store intermediate output to
local disk immediately, in order to avoid overly expensive recomputation.
This shuffle data is always written to disk, whether or not the input
RDD(s) are cached, and the final output of the shuffle (the groupBy in your
example) will *not* be cached in memory unless explicitly requested.



On Mon, Apr 14, 2014 at 8:48 PM, wxhsdp <wxh...@gmail.com> wrote:

> thanks for your help,  Davidson!
> i modified
> val a:RDD[Int] = sc.parallelize(array).cache()
> to keep "val a" an RDD of Int, but has the same result
>
> another question
> JVM and spark memory locate at different parts of system memory, the spark
> code is executed in JVM memory, malloc operation like val e = new
> Array[Int](2*size) /*8MB*/ use JVM memory. if not cached, generated RDDs
> are
> writed back to disk, if cached, RDDs are copied to spark memory for further
> use, is that
> right?
>
> val RDD_1 = RDD_0.groupByKey{...}
> shuffle separate stages, can anyone tell me the memory/disk usage of
> shuffle
> input  RDD and shuffle output RDD under the condition that RDD_0, RDD_1 is
> cached or not?
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/storage-MemoryStore-estimated-size-7-times-larger-than-real-tp4251p4256.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Reply via email to