Have you tried caching the RDD in memory with serialization? How are you
measuring the in-memory size?

In general I can imagine a blowup of 2-3 times for small rows would be
expected, but 10x does seem excessive.


On Fri, Feb 7, 2014 at 12:38 PM, Vipul Pandey <vipan...@gmail.com> wrote:

> Hi,
>
> I have a very small dataset that I need to join with some bigger ones
> later. The data is around 75M in size on disk. When I load it and transform
> it a little it, as it goes, generates an RDD[(String,String)] where the
> first string is on an average 25 chars long and the second one is about 10.
>
> Now :
> - When I save this new RDD as a file on HDFS, the output file size is
> around 70M
> - When I cache it on disk with java serialization, the size in memory is
> around 55M.
> - But, when I cache this RDD in memory without any serialization, the
> cached size is 700M (??)
>
> Any idea why is it bloating up by a factor of 10? What's a typical factor
> (for size) by which uncompressed input is represented in memory/cache?
>
> Vipul
>
>

Reply via email to