Re: confused about memory usage in spark

Sean Owen Wed, 22 Oct 2014 07:45:43 -0700

One thing to remember is that Strings are composed of chars in Java,
which take 2 bytes each. The encoding of the text on disk on S3 is
probably something like UTF-8, which takes much closer to 1 byte per
character for English text. This might explain the factor of ~2
difference.


On Wed, Oct 22, 2014 at 3:21 PM, Darin McBeath
<ddmcbe...@yahoo.com.invalid> wrote:
> I have a PairRDD of type <String,String> which I persist to S3 (using the
> following code).
>
> JavaPairRDD<Text, Text> aRDDWritable = aRDD.mapToPair(new
> ConvertToWritableTypes());
> aRDDWritable.saveAsHadoopFile(outputFile, Text.class, Text.class,
> SequenceFileOutputFormat.class);
>
> class ConvertToWritableTypes implements PairFunction<Tuple2<String, String>,
> Text, Text> {
> public Tuple2<Text, Text> call(Tuple2<String, String> record) {
> return new Tuple2(new Text(record._1), new Text(record._2));
>
> }
> }
>
> When I look at the S3 reported size for say one of the parts (part-00000) it
> indicates the size is 156MB.
>
> I then bring up a spark-shell and load this part-00000 and cache it.
>
> scala> val keyPair =
> sc.sequenceFile[String,String]("s3n://somebucket/part-00000").cache()
>
> After execution an action for the above RDD to force the cache, I look at
> the storage (using the Application UI) and it show that I'm using 297MB for
> this RDD (when it was only 156MB in S3).  I get that there could be some
> differences between the serialized storage format and what is then used in
> memory, but I'm curious as to whether I'm missing something and/or should be
> doing things differently.
>
> Thanks.
>
> Darin.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: confused about memory usage in spark

Reply via email to