Have you tried caching the RDD in memory with serialization? How are you measuring the in-memory size?
In general I can imagine a blowup of 2-3 times for small rows would be expected, but 10x does seem excessive. On Fri, Feb 7, 2014 at 12:38 PM, Vipul Pandey <vipan...@gmail.com> wrote: > Hi, > > I have a very small dataset that I need to join with some bigger ones > later. The data is around 75M in size on disk. When I load it and transform > it a little it, as it goes, generates an RDD[(String,String)] where the > first string is on an average 25 chars long and the second one is about 10. > > Now : > - When I save this new RDD as a file on HDFS, the output file size is > around 70M > - When I cache it on disk with java serialization, the size in memory is > around 55M. > - But, when I cache this RDD in memory without any serialization, the > cached size is 700M (??) > > Any idea why is it bloating up by a factor of 10? What's a typical factor > (for size) by which uncompressed input is represented in memory/cache? > > Vipul > >