[
https://issues.apache.org/jira/browse/PIG-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403480#comment-13403480
]
Thejas M Nair commented on PIG-2774:
------------------------------------
bq. we might have other operations queued up after the join
In 2nd approach, the operations within map task don't complicate things. But to
handle a reduce after the merge-join, we would need to introduce another map
task that does a union of merge-join results. For example, if the merge-join is
followed by a group+agg , then the follow transformation to plan would be
needed.
Map(Merge-join + group+agg ops) + Reduce(group+agg ops)
=> Map (merge-join wave 1 + group+agg ops) + Map (merge-join wave 2 +
group+agg opps) + Map(union of 1st 2 maps) + Reduce(group+agg ops)
This transformation can't happen dynamically - we can't decide to skip the
reduce while in the map phase.
To handle this case dynamically, looks like the first approach is one that
actually would work! The user or a metadata system possibly identify the skew
problem and recommend using a 'skew-merge' join next time query is run on
similar data.
> Fix merge join to work with many duplicate left keys
> ----------------------------------------------------
>
> Key: PIG-2774
> URL: https://issues.apache.org/jira/browse/PIG-2774
> Project: Pig
> Issue Type: Bug
> Reporter: Aneesh Sharma
>
> A merge join can throw an OOM error if the number of duplicate left tuples is
> large as it accumulates all of them in memory. There are two solutions around
> this problem:
> 1. Serialize the accumulated tuples to disk if they exceed a certain size.
> 2. Spit out join output periodically, and re-seek on the right hand side
> index.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira