I think the email was filtered out. Resending. ---------- Forwarded message ---------- From: Aniket Mokashi <aniket...@gmail.com> Date: Wed, Feb 20, 2013 at 1:18 PM Subject: Replicated join: is there a setting to make this better? To: "d...@pig.apache.org" <d...@pig.apache.org>
Hi devs, I was looking into limitations of size/records for fragment replicated join (map join) in pig. To test that I loaded a map (aka fragment) of longs in an alias to join it with other alias which had few other columns. With a map file of 50mb I saw GC Overheads on the mappers. I took a heap dump of mapper to look into whats causing the GC Overheads and found that its the memory footprint of fragment itself was high. [image: Inline image 1] Note, the hashmap was able to only load about 1.8 million records- [image: Inline image 2] Reason was that every map record has an overhead of about 1.5kb. Most of it is part of retained heap, but it needs to be garbage collected. [image: Inline image 3] So, it turns out- Size of heap required by a map join from above = 1.5 KB * Number of records + Size of input (uncompressed databytearray)... (assuming the key is a long). So, to run your replicated join, you need to satisfy following criteria: *1.5 KB * Number of records + Size of input (uncompressed) < estimated free memory in the mapper (total heap - io.sort.mb - some minor constant etc.)* Is that a right conclusion? Is there a setting/way to make this better? Thanks, Aniket * * -- "...:::Aniket:::... Quetzalco@tl"