confused about memory usage in spark

2014-10-22 Thread Darin McBeath
I have a PairRDD of type String,String which I persist to S3 (using the following code). JavaPairRDDText, Text aRDDWritable = aRDD.mapToPair(new ConvertToWritableTypes());aRDDWritable.saveAsHadoopFile(outputFile, Text.class, Text.class, SequenceFileOutputFormat.class); class

Re: confused about memory usage in spark

2014-10-22 Thread Akhil Das
You can enable rdd compression (*spark.rdd.compress*) also you can use MEMORY_ONLY_SER ( *sc.sequenceFile[String,String](s3n://somebucket/part-0).persist(StorageLevel.MEMORY_ONLY_SER* *)* ) to reduce the rdd size in memory. Thanks Best Regards On Wed, Oct 22, 2014 at 7:51 PM, Darin McBeath

Re: confused about memory usage in spark

2014-10-22 Thread Sean Owen
One thing to remember is that Strings are composed of chars in Java, which take 2 bytes each. The encoding of the text on disk on S3 is probably something like UTF-8, which takes much closer to 1 byte per character for English text. This might explain the factor of ~2 difference. On Wed, Oct 22,