[jira] Commented: (PIG-350) PERFORMANCE: Join optimization for pipeline rework

Pradeep Kamath (JIRA) Tue, 21 Oct 2008 10:52:08 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641460#action_12641460
 ]


Pradeep Kamath commented on PIG-350:
------------------------------------

Submitted patch with modification to Join4.patch
- The last input is now streamed instead of the first input. This is because we 
now have tuples arriving in package in the order of the index starting from 
input 0 upto input n. All tuples for a given index (input) arrive before any 
tuple for the next index.
- The last input is now sent as a tuple at a time to foreach rather than as a 
bag containing one tuple at a time
- A new optimized foreach has been introduced which assumes its input has 
already been "attached" (which will be the case in the join optimization)
- In POForeach all data structures (arrays) are now created statically and 
reused in each getNext() call rather than being re-created on each getNext() 
call. Likewise the datastructures in POJoinPackage are also created once up 
front and reused.


> PERFORMANCE: Join optimization for pipeline rework
> --------------------------------------------------
>
>                 Key: PIG-350
>                 URL: https://issues.apache.org/jira/browse/PIG-350
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: join.patch, join2.patch, join3.patch, join4.patch, 
> PIG-350.patch
>
>
> Currently, joins in pig are done as groupings where each input is grouped on 
> the join key.  In the reduce phase, records from each input are collected 
> into a bag for each key, and then a cross product done on these bags.  This 
> can be optimized by selecting one (hopefully the largest) input and streaming 
> through it rather than placing the results in a bag.  This will result in 
> better memory usage, less spills to disk due to bag overflow, and better 
> performance.  Ideally, the system would intelligently select which input to 
> stream, based on a histogram of value distributions for the keys.  Pig does 
> not have that kind of metadata.  So for now it is best to always pick the 
> same input (first or last) so that the user can select which input to stream.
> Similarly, order by in pig is done in this same way, with the grouping keys 
> being the ordering keys, and only one input.  In this case pig still 
> currently collects all the records for a key into a bag, and then flattens 
> the bag.  This is a total waste, and in some cases causes significant 
> performance degradation.  The same optimization listed above can address this 
> case, where the last bag (in this case the only bag) is streamed rather than 
> collected.
> To do these operations, a new POJoinPackage will be needed.  It will replace 
> POPackage and the following POForEach in these types of scripts, handling 
> pulling the records from hadoop and streaming them into the pig pipeline.  A 
> visitor will need to be added in the map reduce compilation phase that 
> detects this case and combines the POPackage with POForeach into this new 
> POJoinPackage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-350) PERFORMANCE: Join optimization for pipeline rework

Reply via email to