Re: An interesting and serious problem I encountered
A number of comments: 310GB is probably too large for an executor. You probably want many smaller executors per machine. But this is not your problem. You didn't say where the OutOfMemoryError occurred. Executor or driver? Tuple2 is a Scala type, and a general type. It is appropriate for general pairs. You're asking about optimizing for a primitive array, yes, but of course Spark handles other types. I don't quite understand your test result. An array doesn't change size because it's referred to in a Tuple2. You are still dealing with a primitive array. There is no general answer to your question. Usually you have to consider the overhead of Java references, which does matter significantly, but there is no constant multiplier of course. It's up to you if it matters to implement more efficient data structures. Here however you're using just about the most efficient rep of an array of integers. I think you have plenty of memory in general, so the question is what was throwing the memory error? I'd also confirm that the configuration your executors actually used is what you expect to rule out config problems. On Fri, Feb 13, 2015 at 6:26 AM, Landmark fangyixiang...@gmail.com wrote: Hi foks, My Spark cluster has 8 machines, each of which has 377GB physical memory, and thus the total maximum memory can be used for Spark is more than 2400+GB. In my program, I have to deal with 1 billion of (key, value) pairs, where the key is an integer and the value is an integer array with 43 elements. Therefore, the memory cost of this raw dataset is [(1+43) * 10 * 4] / (1024 * 1024 * 1024) = 164GB. Since I have to use this dataset repeatedly, I have to cache it in memory. Some key parameter settings are: spark.storage.fraction=0.6 spark.driver.memory=30GB spark.executor.memory=310GB. But it failed on running a simple countByKey() and the error message is java.lang.OutOfMemoryError: Java heap space Does this mean a Spark cluster of 2400+GB memory cannot keep 164GB raw data in memory? The codes of my program is as follows: def main(args: Array[String]):Unit = { val sc = new SparkContext(new SparkConfig()); val rdd = sc.parallelize(0 until 10, 25600).map(i = (i, new Array[Int](43))).cache(); println(The number of keys is + rdd.countByKey()); //some other operations following here ... } To figure out the issue, I evaluated the memory cost of key-value pairs and computed their memory cost using SizeOf.jar. The codes are as follows: val arr = new Array[Int](43); println(SizeOf.humanReadable(SizeOf.deepSizeOf(arr))); val tuple = (1, arr.clone); println(SizeOf.humanReadable(SizeOf.deepSizeOf(tuple))); The output is: 192.0b 992.0b *Hard to believe, but it is true!! This result means, to store a key-value pair, Tuple2 needs more than 5+ times memory than the simplest method with array. Even though it may take 5+ times memory, its size is less than 1000GB, which is still much less than the total memory size of my cluster, i.e., 2400+GB. I really do not understand why this happened.* BTW, if the number of pairs is 1 million, it works well. If the arr contains only 1 integer, to store a pair, Tuples needs around 10 times memory. So I have some questions: 1. Why does Spark choose such a poor data structure, Tuple2, for key-value pairs? Is there any better data structure for storing (key, value) pairs with less memory cost ? 2. Given a dataset with size of M, in general Spark how many times of memory to handle it? Best, Landmark -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/An-interesting-and-serious-problem-I-encountered-tp21637.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: An interesting and serious problem I encountered
Hi, I believe SizeOf.jar may calculate the wrong size for you. Spark has a util call SizeEstimator located in org.apache.spark.util.SizeEstimator. And some one extracted it out in https://github.com/phatak-dev/java-sizeof/blob/master/src/main/scala/com/madhukaraphatak/sizeof/SizeEstimator.scala You can try that out in the scala repl. The size for Array[Int](43) is 192bytes (12 bytes object size + 4 bytes length variable + (43 * 4 round to 176 bytes)) And the size for (1, Array[Int](43)) is 240 bytes { Tuple2 Object: 12 bytes object size + 4 bytes filed _1 + 4 byes field _2 = round to 24 bytes 1 = java.lang.Number 12 bytes = round to 16 bytes - java.lang.Integer: 16 bytes + 4 bytes int = round to 24 bytes ( Integer extends Number. I thought Scala Tuple2 will specialized Int and this should be 4, but it seems not) Array = 192 bytes } So, 24 + 24 + 192 = 240 bytes. This is my calculation based on the spark SizeEstimator. However I am not sure what an Integer will occupy for 64 bits JVM with compressedOps on. It should be 12 + 4 = 16 bytes, then that means the SizeEstimator gives the wrong result. @Sean what do you think? -- Ye Xianjin Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Friday, February 13, 2015 at 2:26 PM, Landmark wrote: Hi foks, My Spark cluster has 8 machines, each of which has 377GB physical memory, and thus the total maximum memory can be used for Spark is more than 2400+GB. In my program, I have to deal with 1 billion of (key, value) pairs, where the key is an integer and the value is an integer array with 43 elements. Therefore, the memory cost of this raw dataset is [(1+43) * 10 * 4] / (1024 * 1024 * 1024) = 164GB. Since I have to use this dataset repeatedly, I have to cache it in memory. Some key parameter settings are: spark.storage.fraction=0.6 spark.driver.memory=30GB spark.executor.memory=310GB. But it failed on running a simple countByKey() and the error message is java.lang.OutOfMemoryError: Java heap space Does this mean a Spark cluster of 2400+GB memory cannot keep 164GB raw data in memory? The codes of my program is as follows: def main(args: Array[String]):Unit = { val sc = new SparkContext(new SparkConfig()); val rdd = sc.parallelize(0 until 10, 25600).map(i = (i, new Array[Int](43))).cache(); println(The number of keys is + rdd.countByKey()); //some other operations following here ... } To figure out the issue, I evaluated the memory cost of key-value pairs and computed their memory cost using SizeOf.jar. The codes are as follows: val arr = new Array[Int](43); println(SizeOf.humanReadable(SizeOf.deepSizeOf(arr))); val tuple = (1, arr.clone); println(SizeOf.humanReadable(SizeOf.deepSizeOf(tuple))); The output is: 192.0b 992.0b *Hard to believe, but it is true!! This result means, to store a key-value pair, Tuple2 needs more than 5+ times memory than the simplest method with array. Even though it may take 5+ times memory, its size is less than 1000GB, which is still much less than the total memory size of my cluster, i.e., 2400+GB. I really do not understand why this happened.* BTW, if the number of pairs is 1 million, it works well. If the arr contains only 1 integer, to store a pair, Tuples needs around 10 times memory. So I have some questions: 1. Why does Spark choose such a poor data structure, Tuple2, for key-value pairs? Is there any better data structure for storing (key, value) pairs with less memory cost ? 2. Given a dataset with size of M, in general Spark how many times of memory to handle it? Best, Landmark -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/An-interesting-and-serious-problem-I-encountered-tp21637.html Sent from the Apache Spark User List mailing list archive at Nabble.com (http://Nabble.com). - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org (mailto:user-unsubscr...@spark.apache.org) For additional commands, e-mail: user-h...@spark.apache.org (mailto:user-h...@spark.apache.org)
Re: An interesting and serious problem I encountered
Thanks for Ye Xianjin's suggestions. The SizeOf.jar may indeed have some problems. I did a simple test as follows. The codes are val n = 1; //5; //10; //100; //1000; val arr1 = new Array[(Int, Array[Int])](n); for(i - 0 until arr1.length){ arr1(i) = (i, new Array[Int](43)); } println(SizeOf.humanReadable(SizeOf.deepSizeOf(arr1))); val arr2 = new Array[Array[Int]](n); for(i - 0 until arr2.length){ arr2(i) = new Array[Int](43); } println(SizeOf.humanReadable(SizeOf.deepSizeOf(arr2))); I changed the value of n, and its results are n=1 1016.0b 216.0b n=5 1.9140625Kb 1000.0b n=10 3.0625Kb 1.9296875Kb n=100 23.8046875Kb 19.15625Kb n=1000 231.2265625Kb 191.421875Kb As suggested by Ye Xianjin, I tried to use SizeEstimator (https://github.com/phatak-dev/java-sizeof/blob/master/src/main/scala/com/madhukaraphatak/sizeof/SizeEstimator.scala) The results are n=1 264 216 n=5 1240 1000 n=10 2456 1976 n=100 24416 19616 n=1000 227216 182576 It seems that SizeEstimator computes the memory correctly. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/An-interesting-and-serious-problem-I-encountered-tp21637p21652.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
An interesting and serious problem I encountered
Hi foks, My Spark cluster has 8 machines, each of which has 377GB physical memory, and thus the total maximum memory can be used for Spark is more than 2400+GB. In my program, I have to deal with 1 billion of (key, value) pairs, where the key is an integer and the value is an integer array with 43 elements. Therefore, the memory cost of this raw dataset is [(1+43) * 10 * 4] / (1024 * 1024 * 1024) = 164GB. Since I have to use this dataset repeatedly, I have to cache it in memory. Some key parameter settings are: spark.storage.fraction=0.6 spark.driver.memory=30GB spark.executor.memory=310GB. But it failed on running a simple countByKey() and the error message is java.lang.OutOfMemoryError: Java heap space Does this mean a Spark cluster of 2400+GB memory cannot keep 164GB raw data in memory? The codes of my program is as follows: def main(args: Array[String]):Unit = { val sc = new SparkContext(new SparkConfig()); val rdd = sc.parallelize(0 until 10, 25600).map(i = (i, new Array[Int](43))).cache(); println(The number of keys is + rdd.countByKey()); //some other operations following here ... } To figure out the issue, I evaluated the memory cost of key-value pairs and computed their memory cost using SizeOf.jar. The codes are as follows: val arr = new Array[Int](43); println(SizeOf.humanReadable(SizeOf.deepSizeOf(arr))); val tuple = (1, arr.clone); println(SizeOf.humanReadable(SizeOf.deepSizeOf(tuple))); The output is: 192.0b 992.0b *Hard to believe, but it is true!! This result means, to store a key-value pair, Tuple2 needs more than 5+ times memory than the simplest method with array. Even though it may take 5+ times memory, its size is less than 1000GB, which is still much less than the total memory size of my cluster, i.e., 2400+GB. I really do not understand why this happened.* BTW, if the number of pairs is 1 million, it works well. If the arr contains only 1 integer, to store a pair, Tuples needs around 10 times memory. So I have some questions: 1. Why does Spark choose such a poor data structure, Tuple2, for key-value pairs? Is there any better data structure for storing (key, value) pairs with less memory cost ? 2. Given a dataset with size of M, in general Spark how many times of memory to handle it? Best, Landmark -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/An-interesting-and-serious-problem-I-encountered-tp21637.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org