[ https://issues.apache.org/jira/browse/PIG-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13402722#comment-13402722 ]
Thejas M Nair commented on PIG-2774: ------------------------------------ If the left side relations tuples for a value of join key are serialized to disk, then for ever value of join key in right relation, it will hit the disk. That will perform very poorly. Looks like what we need is something like a merge-skew join. Ie, similar to skew join, sample left side, and partition the splits for map tasks based on sampled information. > Fix merge join to work with many duplicate left keys > ---------------------------------------------------- > > Key: PIG-2774 > URL: https://issues.apache.org/jira/browse/PIG-2774 > Project: Pig > Issue Type: Bug > Reporter: Aneesh Sharma > > A merge join can throw an OOM error if the number of duplicate left tuples is > large as it accumulates all of them in memory. There are two solutions around > this problem: > 1. Serialize the accumulated tuples to disk if they exceed a certain size. > 2. Spit out join output periodically, and re-seek on the right hand side > index. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira