I think the email was filtered out. Resending.

---------- Forwarded message ----------
From: Aniket Mokashi <aniket...@gmail.com>
Date: Wed, Feb 20, 2013 at 1:18 PM
Subject: Replicated join: is there a setting to make this better?
To: "d...@pig.apache.org" <d...@pig.apache.org>


Hi devs,

I was looking into limitations of size/records for fragment replicated join
(map join) in pig. To test that I loaded a map (aka fragment) of longs in
an alias to join it with other alias which had few other columns. With a
map file of 50mb I saw GC Overheads on the mappers. I took a heap dump of
mapper to look into whats causing the GC Overheads and found that its the
memory footprint of fragment itself was high.

[image: Inline image 1]

Note, the hashmap was able to only load about 1.8 million records-
[image: Inline image 2]
Reason was that every map record has an overhead of about 1.5kb. Most of it
is part of retained heap, but it needs to be garbage collected.
[image: Inline image 3]

So, it turns out-

Size of heap required by a map join from above = 1.5 KB * Number of records
+ Size of input (uncompressed databytearray)... (assuming the key is a
long).

So, to run your replicated join, you need to satisfy following criteria:

*1.5 KB * Number of records + Size of input (uncompressed) < estimated free
memory in the mapper (total heap - io.sort.mb - some minor constant etc.)*

Is that a right conclusion? Is there a setting/way to make this better?

Thanks,

Aniket

*
*



-- 
"...:::Aniket:::... Quetzalco@tl"

Reply via email to