Re: confused about memory usage in spark

2014-10-22 Thread Akhil Das
You can enable rdd compression (*spark.rdd.compress*) also you can
use MEMORY_ONLY_SER (
*sc.sequenceFile[String,String](s3n://somebucket/part-0).persist(StorageLevel.MEMORY_ONLY_SER*
*)* ) to reduce the rdd size in memory.

Thanks
Best Regards

On Wed, Oct 22, 2014 at 7:51 PM, Darin McBeath ddmcbe...@yahoo.com.invalid
wrote:

 I have a PairRDD of type String,String which I persist to S3 (using the
 following code).

 JavaPairRDDText, Text aRDDWritable = aRDD.mapToPair(new
 ConvertToWritableTypes());
 aRDDWritable.saveAsHadoopFile(outputFile, Text.class, Text.class,
 SequenceFileOutputFormat.class);

 class ConvertToWritableTypes implements PairFunctionTuple2String,
 String, Text, Text {
  public Tuple2Text, Text call(Tuple2String, String record) {
 return new Tuple2(new Text(record._1), new Text(record._2));

 }
  }

 When I look at the S3 reported size for say one of the parts (part-0)
 it indicates the size is 156MB.

 I then bring up a spark-shell and load this part-0 and cache it.

 scala val keyPair =
 sc.sequenceFile[String,String](s3n://somebucket/part-0).cache()

 After execution an action for the above RDD to force the cache, I look at
 the storage (using the Application UI) and it show that I'm using 297MB for
 this RDD (when it was only 156MB in S3).  I get that there could be some
 differences between the serialized storage format and what is then used in
 memory, but I'm curious as to whether I'm missing something and/or should
 be doing things differently.

 Thanks.

 Darin.



Re: confused about memory usage in spark

2014-10-22 Thread Sean Owen
One thing to remember is that Strings are composed of chars in Java,
which take 2 bytes each. The encoding of the text on disk on S3 is
probably something like UTF-8, which takes much closer to 1 byte per
character for English text. This might explain the factor of ~2
difference.

On Wed, Oct 22, 2014 at 3:21 PM, Darin McBeath
ddmcbe...@yahoo.com.invalid wrote:
 I have a PairRDD of type String,String which I persist to S3 (using the
 following code).

 JavaPairRDDText, Text aRDDWritable = aRDD.mapToPair(new
 ConvertToWritableTypes());
 aRDDWritable.saveAsHadoopFile(outputFile, Text.class, Text.class,
 SequenceFileOutputFormat.class);

 class ConvertToWritableTypes implements PairFunctionTuple2String, String,
 Text, Text {
 public Tuple2Text, Text call(Tuple2String, String record) {
 return new Tuple2(new Text(record._1), new Text(record._2));

 }
 }

 When I look at the S3 reported size for say one of the parts (part-0) it
 indicates the size is 156MB.

 I then bring up a spark-shell and load this part-0 and cache it.

 scala val keyPair =
 sc.sequenceFile[String,String](s3n://somebucket/part-0).cache()

 After execution an action for the above RDD to force the cache, I look at
 the storage (using the Application UI) and it show that I'm using 297MB for
 this RDD (when it was only 156MB in S3).  I get that there could be some
 differences between the serialized storage format and what is then used in
 memory, but I'm curious as to whether I'm missing something and/or should be
 doing things differently.

 Thanks.

 Darin.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org