[ https://issues.apache.org/jira/browse/PIG-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403480#comment-13403480 ]
Thejas M Nair commented on PIG-2774: ------------------------------------ bq. we might have other operations queued up after the join In 2nd approach, the operations within map task don't complicate things. But to handle a reduce after the merge-join, we would need to introduce another map task that does a union of merge-join results. For example, if the merge-join is followed by a group+agg , then the follow transformation to plan would be needed. Map(Merge-join + group+agg ops) + Reduce(group+agg ops) => Map (merge-join wave 1 + group+agg ops) + Map (merge-join wave 2 + group+agg opps) + Map(union of 1st 2 maps) + Reduce(group+agg ops) This transformation can't happen dynamically - we can't decide to skip the reduce while in the map phase. To handle this case dynamically, looks like the first approach is one that actually would work! The user or a metadata system possibly identify the skew problem and recommend using a 'skew-merge' join next time query is run on similar data. > Fix merge join to work with many duplicate left keys > ---------------------------------------------------- > > Key: PIG-2774 > URL: https://issues.apache.org/jira/browse/PIG-2774 > Project: Pig > Issue Type: Bug > Reporter: Aneesh Sharma > > A merge join can throw an OOM error if the number of duplicate left tuples is > large as it accumulates all of them in memory. There are two solutions around > this problem: > 1. Serialize the accumulated tuples to disk if they exceed a certain size. > 2. Spit out join output periodically, and re-seek on the right hand side > index. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira