[ 
https://issues.apache.org/jira/browse/PIG-845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-845:
---------------------------------

    Attachment: merge-join-1.patch

Specification: http://wiki.apache.org/pig/PigMergeJoin

Updated patch with following enhancements:

Performance:
a) Got completely rid of POForEach for generating joined output tuples.
b) Creating output tuple of required size and then doing set instead of append.
c) Caching of key as suggested by Pradeep in previous comment.
d) Creating new arraylist for holding buffered left tuples instead of clearing 
it, thus avoiding resizing of array.
  
Functionality:
a) Added typecasting for index keys, thus making join work when schemas are 
supplied.
b) Refactored visit(LOJoin loj) in LogToPhyTranslationVisitor to avoid 
duplicate code.

Error Handling:
a) Better error handling at various places.
b) Added validateMergeJoin() in LogToPhyTranslationVisitor to generate 
exception where Merge Join cant be used.
c) Added more tests.

Limitations:
Merge Join doesn't work when there are splits, streaming and order-by in 
predecessors or streaming is present in successors.
Some of these are related to an issue outlined here: 
https://issues.apache.org/jira/browse/PIG-858 and requires work in MRCompiler.
Currently we detect these conditions in validateMergeJoin() and fail at compile 
time.  

> PERFORMANCE: Merge Join
> -----------------------
>
>                 Key: PIG-845
>                 URL: https://issues.apache.org/jira/browse/PIG-845
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>         Attachments: merge-join-1.patch, merge-join-for-review.patch
>
>
> Thsi join would work if the data for both tables is sorted on the join key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to