One thing to remember is that Strings are composed of chars in Java, which take 2 bytes each. The encoding of the text on disk on S3 is probably something like UTF-8, which takes much closer to 1 byte per character for English text. This might explain the factor of ~2 difference.
On Wed, Oct 22, 2014 at 3:21 PM, Darin McBeath <ddmcbe...@yahoo.com.invalid> wrote: > I have a PairRDD of type <String,String> which I persist to S3 (using the > following code). > > JavaPairRDD<Text, Text> aRDDWritable = aRDD.mapToPair(new > ConvertToWritableTypes()); > aRDDWritable.saveAsHadoopFile(outputFile, Text.class, Text.class, > SequenceFileOutputFormat.class); > > class ConvertToWritableTypes implements PairFunction<Tuple2<String, String>, > Text, Text> { > public Tuple2<Text, Text> call(Tuple2<String, String> record) { > return new Tuple2(new Text(record._1), new Text(record._2)); > > } > } > > When I look at the S3 reported size for say one of the parts (part-00000) it > indicates the size is 156MB. > > I then bring up a spark-shell and load this part-00000 and cache it. > > scala> val keyPair = > sc.sequenceFile[String,String]("s3n://somebucket/part-00000").cache() > > After execution an action for the above RDD to force the cache, I look at > the storage (using the Application UI) and it show that I'm using 297MB for > this RDD (when it was only 156MB in S3). I get that there could be some > differences between the serialized storage format and what is then used in > memory, but I'm curious as to whether I'm missing something and/or should be > doing things differently. > > Thanks. > > Darin. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org