Re: An interesting and serious problem I encountered

2015-02-13 Thread Sean Owen
A number of comments:

310GB is probably too large for an executor. You probably want many
smaller executors per machine. But this is not your problem.

You didn't say where the OutOfMemoryError occurred. Executor or driver?

Tuple2 is a Scala type, and a general type. It is appropriate for
general pairs. You're asking about optimizing for a primitive array,
yes, but of course Spark handles other types.

I don't quite understand your test result. An array doesn't change
size because it's referred to in a Tuple2. You are still dealing with
a primitive array.

There is no general answer to your question. Usually you have to
consider the overhead of Java references, which does matter
significantly, but there is no constant multiplier of course. It's up
to you if it matters to implement more efficient data structures. Here
however you're using just about the most efficient rep of an array of
integers.

I think you have plenty of memory in general, so the question is what
was throwing the memory error? I'd also confirm that the configuration
your executors actually used is what you expect to rule out config
problems.

On Fri, Feb 13, 2015 at 6:26 AM, Landmark fangyixiang...@gmail.com wrote:
 Hi foks,

 My Spark cluster has 8 machines, each of which has 377GB physical memory,
 and thus the total maximum memory can be used for Spark is more than
 2400+GB. In my program, I have to deal with 1 billion of (key, value) pairs,
 where the key is an integer and the value is an integer array with 43
 elements.  Therefore, the memory cost of this raw dataset is [(1+43) *
 10 * 4] / (1024 * 1024 * 1024) = 164GB.

 Since I have to use this dataset repeatedly, I have to cache it in memory.
 Some key parameter settings are:
 spark.storage.fraction=0.6
 spark.driver.memory=30GB
 spark.executor.memory=310GB.

 But it failed on running a simple countByKey() and the error message is
 java.lang.OutOfMemoryError: Java heap space Does this mean a Spark
 cluster of 2400+GB memory cannot keep 164GB raw data in memory?

 The codes of my program is as follows:

 def main(args: Array[String]):Unit = {
 val sc = new SparkContext(new SparkConfig());

 val rdd = sc.parallelize(0 until 10, 25600).map(i = (i, new
 Array[Int](43))).cache();
 println(The number of keys is  + rdd.countByKey());

 //some other operations following here ...
 }




 To figure out the issue, I evaluated the memory cost of key-value pairs and
 computed their memory cost using SizeOf.jar. The codes are as follows:

 val arr = new Array[Int](43);
 println(SizeOf.humanReadable(SizeOf.deepSizeOf(arr)));

 val tuple = (1, arr.clone);
 println(SizeOf.humanReadable(SizeOf.deepSizeOf(tuple)));

 The output is:
 192.0b
 992.0b


 *Hard to believe, but it is true!! This result means, to store a key-value
 pair, Tuple2 needs more than 5+ times memory than the simplest method with
 array. Even though it may take 5+ times memory, its size is less than
 1000GB, which is still much less than the total memory size of my cluster,
 i.e., 2400+GB. I really do not understand why this happened.*

 BTW, if the number of pairs is 1 million, it works well. If the arr contains
 only 1 integer, to store a pair, Tuples needs around 10 times memory.

 So I have some questions:
 1. Why does Spark choose such a poor data structure, Tuple2, for key-value
 pairs? Is there any better data structure for storing (key, value)  pairs
 with less memory cost ?
 2. Given a dataset with size of M, in general Spark how many times of memory
 to handle it?


 Best,
 Landmark




 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/An-interesting-and-serious-problem-I-encountered-tp21637.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: An interesting and serious problem I encountered

2015-02-13 Thread Ye Xianjin
Hi, 

I believe SizeOf.jar may calculate the wrong size for you.
 Spark has a util call SizeEstimator located in 
org.apache.spark.util.SizeEstimator. And some one extracted it out in 
https://github.com/phatak-dev/java-sizeof/blob/master/src/main/scala/com/madhukaraphatak/sizeof/SizeEstimator.scala
You can try that out in the scala repl. 
The size for Array[Int](43) is 192bytes (12 bytes object size + 4 bytes length 
variable + (43 * 4 round to 176 bytes))
 And the size for (1, Array[Int](43)) is 240 bytes {
   Tuple2 Object: 12 bytes object size + 4 bytes filed _1 + 4 byes field _2 = 
round to 24 bytes
   1 =  java.lang.Number 12  bytes = round to 16 bytes - java.lang.Integer: 
16 bytes + 4 bytes int = round to 24 bytes ( Integer extends Number. I thought 
Scala Tuple2 will specialized Int and this should be 4, but it seems not)
   Array = 192 bytes
}

So, 24 + 24 + 192 = 240 bytes.
This is my calculation based on the spark SizeEstimator. 

However I am not sure what an Integer will occupy for 64 bits JVM with 
compressedOps on. It should be 12 + 4 = 16 bytes, then that means the 
SizeEstimator gives the wrong result. @Sean what do you think?
-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Friday, February 13, 2015 at 2:26 PM, Landmark wrote:

 Hi foks,
 
 My Spark cluster has 8 machines, each of which has 377GB physical memory,
 and thus the total maximum memory can be used for Spark is more than
 2400+GB. In my program, I have to deal with 1 billion of (key, value) pairs,
 where the key is an integer and the value is an integer array with 43
 elements. Therefore, the memory cost of this raw dataset is [(1+43) *
 10 * 4] / (1024 * 1024 * 1024) = 164GB. 
 
 Since I have to use this dataset repeatedly, I have to cache it in memory.
 Some key parameter settings are: 
 spark.storage.fraction=0.6
 spark.driver.memory=30GB
 spark.executor.memory=310GB.
 
 But it failed on running a simple countByKey() and the error message is
 java.lang.OutOfMemoryError: Java heap space Does this mean a Spark
 cluster of 2400+GB memory cannot keep 164GB raw data in memory? 
 
 The codes of my program is as follows:
 
 def main(args: Array[String]):Unit = {
 val sc = new SparkContext(new SparkConfig());
 
 val rdd = sc.parallelize(0 until 10, 25600).map(i = (i, new
 Array[Int](43))).cache();
 println(The number of keys is  + rdd.countByKey());
 
 //some other operations following here ...
 }
 
 
 
 
 To figure out the issue, I evaluated the memory cost of key-value pairs and
 computed their memory cost using SizeOf.jar. The codes are as follows:
 
 val arr = new Array[Int](43);
 println(SizeOf.humanReadable(SizeOf.deepSizeOf(arr)));
 
 val tuple = (1, arr.clone);
 println(SizeOf.humanReadable(SizeOf.deepSizeOf(tuple)));
 
 The output is:
 192.0b
 992.0b
 
 
 *Hard to believe, but it is true!! This result means, to store a key-value
 pair, Tuple2 needs more than 5+ times memory than the simplest method with
 array. Even though it may take 5+ times memory, its size is less than
 1000GB, which is still much less than the total memory size of my cluster,
 i.e., 2400+GB. I really do not understand why this happened.*
 
 BTW, if the number of pairs is 1 million, it works well. If the arr contains
 only 1 integer, to store a pair, Tuples needs around 10 times memory.
 
 So I have some questions:
 1. Why does Spark choose such a poor data structure, Tuple2, for key-value
 pairs? Is there any better data structure for storing (key, value) pairs
 with less memory cost ?
 2. Given a dataset with size of M, in general Spark how many times of memory
 to handle it?
 
 
 Best,
 Landmark
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/An-interesting-and-serious-problem-I-encountered-tp21637.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com 
 (http://Nabble.com).
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 (mailto:user-unsubscr...@spark.apache.org)
 For additional commands, e-mail: user-h...@spark.apache.org 
 (mailto:user-h...@spark.apache.org)
 
 




Re: An interesting and serious problem I encountered

2015-02-13 Thread Landmark
Thanks for Ye Xianjin's suggestions.

The SizeOf.jar may indeed have some problems. I did a simple test as
follows. The codes are 

 val n = 1; //5; //10; //100; //1000;
 val arr1 = new Array[(Int, Array[Int])](n);
 for(i - 0 until arr1.length){
arr1(i) = (i, new Array[Int](43));
 }
 println(SizeOf.humanReadable(SizeOf.deepSizeOf(arr1)));

 
 val arr2 = new Array[Array[Int]](n);
 for(i - 0 until arr2.length){
arr2(i) = new Array[Int](43);
 }
 println(SizeOf.humanReadable(SizeOf.deepSizeOf(arr2)));


I changed the value of n, and its results are 
n=1
1016.0b
216.0b
 
n=5
1.9140625Kb
1000.0b

n=10
3.0625Kb
1.9296875Kb

n=100
23.8046875Kb
19.15625Kb

n=1000
231.2265625Kb
191.421875Kb


As suggested by Ye Xianjin, I tried to use SizeEstimator
(https://github.com/phatak-dev/java-sizeof/blob/master/src/main/scala/com/madhukaraphatak/sizeof/SizeEstimator.scala)
The results are 
n=1
264
216

n=5
1240
1000

n=10
2456
1976

n=100
24416
19616

n=1000
227216
182576

It seems that SizeEstimator computes the memory correctly.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/An-interesting-and-serious-problem-I-encountered-tp21637p21652.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



An interesting and serious problem I encountered

2015-02-12 Thread Landmark
Hi foks,

My Spark cluster has 8 machines, each of which has 377GB physical memory,
and thus the total maximum memory can be used for Spark is more than
2400+GB. In my program, I have to deal with 1 billion of (key, value) pairs,
where the key is an integer and the value is an integer array with 43
elements.  Therefore, the memory cost of this raw dataset is [(1+43) *
10 * 4] / (1024 * 1024 * 1024) = 164GB.  

Since I have to use this dataset repeatedly, I have to cache it in memory.
Some key parameter settings are: 
spark.storage.fraction=0.6
spark.driver.memory=30GB
spark.executor.memory=310GB.

But it failed on running a simple countByKey() and the error message is
java.lang.OutOfMemoryError: Java heap space Does this mean a Spark
cluster of 2400+GB memory cannot keep 164GB raw data in memory? 

The codes of my program is as follows:

def main(args: Array[String]):Unit = {
val sc = new SparkContext(new SparkConfig());

val rdd = sc.parallelize(0 until 10, 25600).map(i = (i, new
Array[Int](43))).cache();
println(The number of keys is  + rdd.countByKey());

//some other operations following here ...
}




To figure out the issue, I evaluated the memory cost of key-value pairs and
computed their memory cost using SizeOf.jar. The codes are as follows:

val arr = new Array[Int](43);
println(SizeOf.humanReadable(SizeOf.deepSizeOf(arr)));

val tuple = (1, arr.clone);
println(SizeOf.humanReadable(SizeOf.deepSizeOf(tuple)));

The output is:
192.0b
992.0b


*Hard to believe, but it is true!! This result means, to store a key-value
pair, Tuple2 needs more than 5+ times memory than the simplest method with
array. Even though it may take 5+ times memory, its size is less than
1000GB, which is still much less than the total memory size of my cluster,
i.e., 2400+GB. I really do not understand why this happened.*

BTW, if the number of pairs is 1 million, it works well. If the arr contains
only 1 integer, to store a pair, Tuples needs around 10 times memory.

So I have some questions:
1. Why does Spark choose such a poor data structure, Tuple2, for key-value
pairs? Is there any better data structure for storing (key, value)  pairs
with less memory cost ?
2. Given a dataset with size of M, in general Spark how many times of memory
to handle it?


Best,
Landmark




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/An-interesting-and-serious-problem-I-encountered-tp21637.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org