Hi, I believe SizeOf.jar may calculate the wrong size for you. Spark has a util call SizeEstimator located in org.apache.spark.util.SizeEstimator. And some one extracted it out in https://github.com/phatak-dev/java-sizeof/blob/master/src/main/scala/com/madhukaraphatak/sizeof/SizeEstimator.scala You can try that out in the scala repl. The size for Array[Int](43) is 192bytes (12 bytes object size + 4 bytes length variable + (43 * 4 round to 176 bytes)) And the size for (1, Array[Int](43)) is 240 bytes { Tuple2 Object: 12 bytes object size + 4 bytes filed _1 + 4 byes field _2 => round to 24 bytes 1 => java.lang.Number 12 bytes => round to 16 bytes -> java.lang.Integer: 16 bytes + 4 bytes int => round to 24 bytes ( Integer extends Number. I thought Scala Tuple2 will specialized Int and this should be 4, but it seems not) Array => 192 bytes }
So, 24 + 24 + 192 = 240 bytes. This is my calculation based on the spark SizeEstimator. However I am not sure what an Integer will occupy for 64 bits JVM with compressedOps on. It should be 12 + 4 = 16 bytes, then that means the SizeEstimator gives the wrong result. @Sean what do you think? -- Ye Xianjin Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Friday, February 13, 2015 at 2:26 PM, Landmark wrote: > Hi foks, > > My Spark cluster has 8 machines, each of which has 377GB physical memory, > and thus the total maximum memory can be used for Spark is more than > 2400+GB. In my program, I have to deal with 1 billion of (key, value) pairs, > where the key is an integer and the value is an integer array with 43 > elements. Therefore, the memory cost of this raw dataset is [(1+43) * > 1000000000 * 4] / (1024 * 1024 * 1024) = 164GB. > > Since I have to use this dataset repeatedly, I have to cache it in memory. > Some key parameter settings are: > spark.storage.fraction=0.6 > spark.driver.memory=30GB > spark.executor.memory=310GB. > > But it failed on running a simple countByKey() and the error message is > "java.lang.OutOfMemoryError: Java heap space...". Does this mean a Spark > cluster of 2400+GB memory cannot keep 164GB raw data in memory? > > The codes of my program is as follows: > > def main(args: Array[String]):Unit = { > val sc = new SparkContext(new SparkConfig()); > > val rdd = sc.parallelize(0 until 1000000000, 25600).map(i => (i, new > Array[Int](43))).cache(); > println("The number of keys is " + rdd.countByKey()); > > //some other operations following here ... > } > > > > > To figure out the issue, I evaluated the memory cost of key-value pairs and > computed their memory cost using SizeOf.jar. The codes are as follows: > > val arr = new Array[Int](43); > println(SizeOf.humanReadable(SizeOf.deepSizeOf(arr))); > > val tuple = (1, arr.clone); > println(SizeOf.humanReadable(SizeOf.deepSizeOf(tuple))); > > The output is: > 192.0b > 992.0b > > > *Hard to believe, but it is true!! This result means, to store a key-value > pair, Tuple2 needs more than 5+ times memory than the simplest method with > array. Even though it may take 5+ times memory, its size is less than > 1000GB, which is still much less than the total memory size of my cluster, > i.e., 2400+GB. I really do not understand why this happened.* > > BTW, if the number of pairs is 1 million, it works well. If the arr contains > only 1 integer, to store a pair, Tuples needs around 10 times memory. > > So I have some questions: > 1. Why does Spark choose such a poor data structure, Tuple2, for key-value > pairs? Is there any better data structure for storing (key, value) pairs > with less memory cost ? > 2. Given a dataset with size of M, in general Spark how many times of memory > to handle it? > > > Best, > Landmark > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/An-interesting-and-serious-problem-I-encountered-tp21637.html > Sent from the Apache Spark User List mailing list archive at Nabble.com > (http://Nabble.com). > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > (mailto:user-unsubscr...@spark.apache.org) > For additional commands, e-mail: user-h...@spark.apache.org > (mailto:user-h...@spark.apache.org) > >