confused about memory usage in spark
I have a PairRDD of type String,String which I persist to S3 (using the following code). JavaPairRDDText, Text aRDDWritable = aRDD.mapToPair(new ConvertToWritableTypes());aRDDWritable.saveAsHadoopFile(outputFile, Text.class, Text.class, SequenceFileOutputFormat.class); class ConvertToWritableTypes implements PairFunctionTuple2String, String, Text, Text { public Tuple2Text, Text call(Tuple2String, String record) { return new Tuple2(new Text(record._1), new Text(record._2)); } } When I look at the S3 reported size for say one of the parts (part-0) it indicates the size is 156MB. I then bring up a spark-shell and load this part-0 and cache it. scala val keyPair = sc.sequenceFile[String,String](s3n://somebucket/part-0).cache() After execution an action for the above RDD to force the cache, I look at the storage (using the Application UI) and it show that I'm using 297MB for this RDD (when it was only 156MB in S3). I get that there could be some differences between the serialized storage format and what is then used in memory, but I'm curious as to whether I'm missing something and/or should be doing things differently. Thanks. Darin.
Re: confused about memory usage in spark
You can enable rdd compression (*spark.rdd.compress*) also you can use MEMORY_ONLY_SER ( *sc.sequenceFile[String,String](s3n://somebucket/part-0).persist(StorageLevel.MEMORY_ONLY_SER* *)* ) to reduce the rdd size in memory. Thanks Best Regards On Wed, Oct 22, 2014 at 7:51 PM, Darin McBeath ddmcbe...@yahoo.com.invalid wrote: I have a PairRDD of type String,String which I persist to S3 (using the following code). JavaPairRDDText, Text aRDDWritable = aRDD.mapToPair(new ConvertToWritableTypes()); aRDDWritable.saveAsHadoopFile(outputFile, Text.class, Text.class, SequenceFileOutputFormat.class); class ConvertToWritableTypes implements PairFunctionTuple2String, String, Text, Text { public Tuple2Text, Text call(Tuple2String, String record) { return new Tuple2(new Text(record._1), new Text(record._2)); } } When I look at the S3 reported size for say one of the parts (part-0) it indicates the size is 156MB. I then bring up a spark-shell and load this part-0 and cache it. scala val keyPair = sc.sequenceFile[String,String](s3n://somebucket/part-0).cache() After execution an action for the above RDD to force the cache, I look at the storage (using the Application UI) and it show that I'm using 297MB for this RDD (when it was only 156MB in S3). I get that there could be some differences between the serialized storage format and what is then used in memory, but I'm curious as to whether I'm missing something and/or should be doing things differently. Thanks. Darin.
Re: confused about memory usage in spark
One thing to remember is that Strings are composed of chars in Java, which take 2 bytes each. The encoding of the text on disk on S3 is probably something like UTF-8, which takes much closer to 1 byte per character for English text. This might explain the factor of ~2 difference. On Wed, Oct 22, 2014 at 3:21 PM, Darin McBeath ddmcbe...@yahoo.com.invalid wrote: I have a PairRDD of type String,String which I persist to S3 (using the following code). JavaPairRDDText, Text aRDDWritable = aRDD.mapToPair(new ConvertToWritableTypes()); aRDDWritable.saveAsHadoopFile(outputFile, Text.class, Text.class, SequenceFileOutputFormat.class); class ConvertToWritableTypes implements PairFunctionTuple2String, String, Text, Text { public Tuple2Text, Text call(Tuple2String, String record) { return new Tuple2(new Text(record._1), new Text(record._2)); } } When I look at the S3 reported size for say one of the parts (part-0) it indicates the size is 156MB. I then bring up a spark-shell and load this part-0 and cache it. scala val keyPair = sc.sequenceFile[String,String](s3n://somebucket/part-0).cache() After execution an action for the above RDD to force the cache, I look at the storage (using the Application UI) and it show that I'm using 297MB for this RDD (when it was only 156MB in S3). I get that there could be some differences between the serialized storage format and what is then used in memory, but I'm curious as to whether I'm missing something and/or should be doing things differently. Thanks. Darin. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org