Hi, 

I believe SizeOf.jar may calculate the wrong size for you.
 Spark has a util call SizeEstimator located in 
org.apache.spark.util.SizeEstimator. And some one extracted it out in 
https://github.com/phatak-dev/java-sizeof/blob/master/src/main/scala/com/madhukaraphatak/sizeof/SizeEstimator.scala
You can try that out in the scala repl. 
The size for Array[Int](43) is 192bytes (12 bytes object size + 4 bytes length 
variable + (43 * 4 round to 176 bytes))
 And the size for (1, Array[Int](43)) is 240 bytes {
   Tuple2 Object: 12 bytes object size + 4 bytes filed _1 + 4 byes field _2 => 
round to 24 bytes
   1 =>  java.lang.Number 12  bytes => round to 16 bytes -> java.lang.Integer: 
16 bytes + 4 bytes int => round to 24 bytes ( Integer extends Number. I thought 
Scala Tuple2 will specialized Int and this should be 4, but it seems not)
   Array => 192 bytes
}

So, 24 + 24 + 192 = 240 bytes.
This is my calculation based on the spark SizeEstimator. 

However I am not sure what an Integer will occupy for 64 bits JVM with 
compressedOps on. It should be 12 + 4 = 16 bytes, then that means the 
SizeEstimator gives the wrong result. @Sean what do you think?
-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Friday, February 13, 2015 at 2:26 PM, Landmark wrote:

> Hi foks,
> 
> My Spark cluster has 8 machines, each of which has 377GB physical memory,
> and thus the total maximum memory can be used for Spark is more than
> 2400+GB. In my program, I have to deal with 1 billion of (key, value) pairs,
> where the key is an integer and the value is an integer array with 43
> elements. Therefore, the memory cost of this raw dataset is [(1+43) *
> 1000000000 * 4] / (1024 * 1024 * 1024) = 164GB. 
> 
> Since I have to use this dataset repeatedly, I have to cache it in memory.
> Some key parameter settings are: 
> spark.storage.fraction=0.6
> spark.driver.memory=30GB
> spark.executor.memory=310GB.
> 
> But it failed on running a simple countByKey() and the error message is
> "java.lang.OutOfMemoryError: Java heap space...". Does this mean a Spark
> cluster of 2400+GB memory cannot keep 164GB raw data in memory? 
> 
> The codes of my program is as follows:
> 
> def main(args: Array[String]):Unit = {
> val sc = new SparkContext(new SparkConfig());
> 
> val rdd = sc.parallelize(0 until 1000000000, 25600).map(i => (i, new
> Array[Int](43))).cache();
> println("The number of keys is " + rdd.countByKey());
> 
> //some other operations following here ...
> }
> 
> 
> 
> 
> To figure out the issue, I evaluated the memory cost of key-value pairs and
> computed their memory cost using SizeOf.jar. The codes are as follows:
> 
> val arr = new Array[Int](43);
> println(SizeOf.humanReadable(SizeOf.deepSizeOf(arr)));
> 
> val tuple = (1, arr.clone);
> println(SizeOf.humanReadable(SizeOf.deepSizeOf(tuple)));
> 
> The output is:
> 192.0b
> 992.0b
> 
> 
> *Hard to believe, but it is true!! This result means, to store a key-value
> pair, Tuple2 needs more than 5+ times memory than the simplest method with
> array. Even though it may take 5+ times memory, its size is less than
> 1000GB, which is still much less than the total memory size of my cluster,
> i.e., 2400+GB. I really do not understand why this happened.*
> 
> BTW, if the number of pairs is 1 million, it works well. If the arr contains
> only 1 integer, to store a pair, Tuples needs around 10 times memory.
> 
> So I have some questions:
> 1. Why does Spark choose such a poor data structure, Tuple2, for key-value
> pairs? Is there any better data structure for storing (key, value) pairs
> with less memory cost ?
> 2. Given a dataset with size of M, in general Spark how many times of memory
> to handle it?
> 
> 
> Best,
> Landmark
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/An-interesting-and-serious-problem-I-encountered-tp21637.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com 
> (http://Nabble.com).
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> (mailto:user-unsubscr...@spark.apache.org)
> For additional commands, e-mail: user-h...@spark.apache.org 
> (mailto:user-h...@spark.apache.org)
> 
> 


Reply via email to