[ 
https://issues.apache.org/jira/browse/PIG-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403480#comment-13403480
 ] 

Thejas M Nair commented on PIG-2774:
------------------------------------

bq. we might have other operations queued up after the join 

In 2nd approach, the operations within map task don't complicate things. But to 
handle a reduce after the merge-join, we would need to introduce another map 
task that does a union of merge-join results. For example, if the merge-join is 
followed by a group+agg , then the follow transformation to plan would be 
needed. 
Map(Merge-join + group+agg ops) + Reduce(group+agg ops)  
     => Map (merge-join wave 1 + group+agg ops)  + Map (merge-join wave 2 + 
group+agg opps) + Map(union of 1st 2 maps) + Reduce(group+agg ops)

This transformation can't happen dynamically - we can't decide to skip the 
reduce while in the map phase. 


To handle this case dynamically, looks like the first approach is one that 
actually would work! The user or a metadata system possibly identify the skew 
problem and recommend using a 'skew-merge' join next time query is run on 
similar data.


                
> Fix merge join to work with many duplicate left keys
> ----------------------------------------------------
>
>                 Key: PIG-2774
>                 URL: https://issues.apache.org/jira/browse/PIG-2774
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Aneesh Sharma
>
> A merge join can throw an OOM error if the number of duplicate left tuples is 
> large as it accumulates all of them in memory. There are two solutions around 
> this problem:
> 1. Serialize the accumulated tuples to disk if they exceed a certain size.
> 2. Spit out join output periodically, and re-seek on the right hand side 
> index.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to