confused about memory usage in spark

2014-10-22 Thread Darin McBeath
I have a PairRDD of type String,String which I persist to S3 (using the 
following code).
JavaPairRDDText, Text aRDDWritable = aRDD.mapToPair(new 
ConvertToWritableTypes());aRDDWritable.saveAsHadoopFile(outputFile, Text.class, 
Text.class, SequenceFileOutputFormat.class);
class ConvertToWritableTypes implements PairFunctionTuple2String, String, 
Text, Text {  public Tuple2Text, Text call(Tuple2String, String record) {  
return new Tuple2(new Text(record._1), new Text(record._2));    } }
When I look at the S3 reported size for say one of the parts (part-0) it 
indicates the size is 156MB.

I then bring up a spark-shell and load this part-0 and cache it. 
scala val keyPair = 
sc.sequenceFile[String,String](s3n://somebucket/part-0).cache()

After execution an action for the above RDD to force the cache, I look at the 
storage (using the Application UI) and it show that I'm using 297MB for this 
RDD (when it was only 156MB in S3).  I get that there could be some differences 
between the serialized storage format and what is then used in memory, but I'm 
curious as to whether I'm missing something and/or should be doing things 
differently.
Thanks.
Darin.

Re: confused about memory usage in spark

2014-10-22 Thread Akhil Das
You can enable rdd compression (*spark.rdd.compress*) also you can
use MEMORY_ONLY_SER (
*sc.sequenceFile[String,String](s3n://somebucket/part-0).persist(StorageLevel.MEMORY_ONLY_SER*
*)* ) to reduce the rdd size in memory.

Thanks
Best Regards

On Wed, Oct 22, 2014 at 7:51 PM, Darin McBeath ddmcbe...@yahoo.com.invalid
wrote:

 I have a PairRDD of type String,String which I persist to S3 (using the
 following code).

 JavaPairRDDText, Text aRDDWritable = aRDD.mapToPair(new
 ConvertToWritableTypes());
 aRDDWritable.saveAsHadoopFile(outputFile, Text.class, Text.class,
 SequenceFileOutputFormat.class);

 class ConvertToWritableTypes implements PairFunctionTuple2String,
 String, Text, Text {
  public Tuple2Text, Text call(Tuple2String, String record) {
 return new Tuple2(new Text(record._1), new Text(record._2));

 }
  }

 When I look at the S3 reported size for say one of the parts (part-0)
 it indicates the size is 156MB.

 I then bring up a spark-shell and load this part-0 and cache it.

 scala val keyPair =
 sc.sequenceFile[String,String](s3n://somebucket/part-0).cache()

 After execution an action for the above RDD to force the cache, I look at
 the storage (using the Application UI) and it show that I'm using 297MB for
 this RDD (when it was only 156MB in S3).  I get that there could be some
 differences between the serialized storage format and what is then used in
 memory, but I'm curious as to whether I'm missing something and/or should
 be doing things differently.

 Thanks.

 Darin.



Re: confused about memory usage in spark

2014-10-22 Thread Sean Owen
One thing to remember is that Strings are composed of chars in Java,
which take 2 bytes each. The encoding of the text on disk on S3 is
probably something like UTF-8, which takes much closer to 1 byte per
character for English text. This might explain the factor of ~2
difference.

On Wed, Oct 22, 2014 at 3:21 PM, Darin McBeath
ddmcbe...@yahoo.com.invalid wrote:
 I have a PairRDD of type String,String which I persist to S3 (using the
 following code).

 JavaPairRDDText, Text aRDDWritable = aRDD.mapToPair(new
 ConvertToWritableTypes());
 aRDDWritable.saveAsHadoopFile(outputFile, Text.class, Text.class,
 SequenceFileOutputFormat.class);

 class ConvertToWritableTypes implements PairFunctionTuple2String, String,
 Text, Text {
 public Tuple2Text, Text call(Tuple2String, String record) {
 return new Tuple2(new Text(record._1), new Text(record._2));

 }
 }

 When I look at the S3 reported size for say one of the parts (part-0) it
 indicates the size is 156MB.

 I then bring up a spark-shell and load this part-0 and cache it.

 scala val keyPair =
 sc.sequenceFile[String,String](s3n://somebucket/part-0).cache()

 After execution an action for the above RDD to force the cache, I look at
 the storage (using the Application UI) and it show that I'm using 297MB for
 this RDD (when it was only 156MB in S3).  I get that there could be some
 differences between the serialized storage format and what is then used in
 memory, but I'm curious as to whether I'm missing something and/or should be
 doing things differently.

 Thanks.

 Darin.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org