I have a PairRDD of type <String,String> which I persist to S3 (using the 
following code).
JavaPairRDD<Text, Text> aRDDWritable = aRDD.mapToPair(new 
ConvertToWritableTypes());aRDDWritable.saveAsHadoopFile(outputFile, Text.class, 
Text.class, SequenceFileOutputFormat.class);
class ConvertToWritableTypes implements PairFunction<Tuple2<String, String>, 
Text, Text> {  public Tuple2<Text, Text> call(Tuple2<String, String> record) {  
return new Tuple2(new Text(record._1), new Text(record._2));    } }
When I look at the S3 reported size for say one of the parts (part-00000) it 
indicates the size is 156MB.

I then bring up a spark-shell and load this part-00000 and cache it. 
scala> val keyPair = 
sc.sequenceFile[String,String]("s3n://somebucket/part-00000").cache()

After execution an action for the above RDD to force the cache, I look at the 
storage (using the Application UI) and it show that I'm using 297MB for this 
RDD (when it was only 156MB in S3).  I get that there could be some differences 
between the serialized storage format and what is then used in memory, but I'm 
curious as to whether I'm missing something and/or should be doing things 
differently.
Thanks.
Darin.

Reply via email to