Ah, I think I can see where your issue may be coming from. In spark-shell, the MASTER is "local[*]", which just means it uses a pre-set number of cores. This distinction only matters because the default number of slices created from sc.parallelize() is based on the number of cores.
So when you run from sbt, you probably use a SparkContext with a "local" master, which sets number of cores to 1, meaning you are doing sc.parallelize(array, 1) while in Spark Shell you are doing sc.parallelize(array, 6ish?) The difference between the two is just that the array is broken up into more parts in the latter, so you will store blocks for rdd_0_0, rdd_0_1, ..., rdd_0_5 rather than just one (large) block. In both cases, though, I suspect that the total size is around the same, at around 28 MB. In my case, where I have an RDD[Array[Int]], I have 8 partitions (a number I just chose randomly), and each one is 512 KB, so the total size is actually 4 MB. You could do the same test with numSlices = 1, and you'd just have a single 4 MB block. The reason our two solutions produced different total memory values is because of Java primitive boxing [1]. In your case, your RDD[Int] is converted into an Array[Any] right before being stored into memory, which causes it to be effectively an Array[java.lang.Integer] [2]. In my case, the actual values inside the RDD are primitive arrays, so they cannot be broken up. Spark still converts my RDD[Array[Int]] into an Array[Any], but "Array[Int]" is already an Any, so there's no memory impact here. [1] http://docs.oracle.com/javase/tutorial/java/data/autoboxing.html [2] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L90 On Tue, Apr 15, 2014 at 3:58 AM, wxhsdp <wxh...@gmail.com> wrote: > sorry, davidosn, i don't catch the point. what's the essential difference > between our codes? > /*my code*/ > val array = new Array[Int](size) > val a = sc.parallelize(array).cache() /*4MB*/ > > /*your code*/ > val numSlices = 8 > val arr = Array.fill[Array[Int]](numSlices) { new Array[Int](size / > numSlices) } > val rdd = sc.parallelize(arr, numSlices).cache() > > i'm in local mode, with only one partitions, it's just an RDD of one > partition with the type RDD[Int] > your RDD have 8 partitions with the type RDD[Array[Int]], do that matter? > my question is why the memory usage is 7x in sbt, but right in spark shell? > > as to the following question, i made a mistake, sorry > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/storage-MemoryStore-estimated-size-7-times-larger-than-real-tp4251p4269.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >