One thing to remember is that Strings are composed of chars in Java,
which take 2 bytes each. The encoding of the text on disk on S3 is
probably something like UTF-8, which takes much closer to 1 byte per
character for English text. This might explain the factor of ~2
difference.
On Wed, Oct 22, 2014 at 3:21 PM, Darin McBeath
ddmcbe...@yahoo.com.invalid wrote:
I have a PairRDD of type String,String which I persist to S3 (using the
following code).
JavaPairRDDText, Text aRDDWritable = aRDD.mapToPair(new
ConvertToWritableTypes());
aRDDWritable.saveAsHadoopFile(outputFile, Text.class, Text.class,
SequenceFileOutputFormat.class);
class ConvertToWritableTypes implements PairFunctionTuple2String, String,
Text, Text {
public Tuple2Text, Text call(Tuple2String, String record) {
return new Tuple2(new Text(record._1), new Text(record._2));
}
}
When I look at the S3 reported size for say one of the parts (part-0) it
indicates the size is 156MB.
I then bring up a spark-shell and load this part-0 and cache it.
scala val keyPair =
sc.sequenceFile[String,String](s3n://somebucket/part-0).cache()
After execution an action for the above RDD to force the cache, I look at
the storage (using the Application UI) and it show that I'm using 297MB for
this RDD (when it was only 156MB in S3). I get that there could be some
differences between the serialized storage format and what is then used in
memory, but I'm curious as to whether I'm missing something and/or should be
doing things differently.
Thanks.
Darin.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org