[jira] Commented: (PIG-350) PERFORMANCE: Join optimization for pipeline rework

Alan Gates (JIRA) Fri, 31 Oct 2008 13:24:07 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644420#action_12644420
 ]


Alan Gates commented on PIG-350:
--------------------------------

In general, looks very good.  One question.

In POForEach you added arrays isToBeFlattenedArray and planLeafOps, which look 
like copies of the lists isToBeFlattened and planLeaves respectively.  In the 
comment on isToBeFlattenedArray you say you're doing this for performance.  I 
think that's fine, but why do you still have the lists?  Having two copies of 
the data seems dangerous, as you're likely to get out of sync.

> PERFORMANCE: Join optimization for pipeline rework
> --------------------------------------------------
>
>                 Key: PIG-350
>                 URL: https://issues.apache.org/jira/browse/PIG-350
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: join.patch, join2.patch, join3.patch, join4.patch, 
> PIG-350-2.patch, PIG-350.patch
>
>
> Currently, joins in pig are done as groupings where each input is grouped on 
> the join key.  In the reduce phase, records from each input are collected 
> into a bag for each key, and then a cross product done on these bags.  This 
> can be optimized by selecting one (hopefully the largest) input and streaming 
> through it rather than placing the results in a bag.  This will result in 
> better memory usage, less spills to disk due to bag overflow, and better 
> performance.  Ideally, the system would intelligently select which input to 
> stream, based on a histogram of value distributions for the keys.  Pig does 
> not have that kind of metadata.  So for now it is best to always pick the 
> same input (first or last) so that the user can select which input to stream.
> Similarly, order by in pig is done in this same way, with the grouping keys 
> being the ordering keys, and only one input.  In this case pig still 
> currently collects all the records for a key into a bag, and then flattens 
> the bag.  This is a total waste, and in some cases causes significant 
> performance degradation.  The same optimization listed above can address this 
> case, where the last bag (in this case the only bag) is streamed rather than 
> collected.
> To do these operations, a new POJoinPackage will be needed.  It will replace 
> POPackage and the following POForEach in these types of scripts, handling 
> pulling the records from hadoop and streaming them into the pig pipeline.  A 
> visitor will need to be added in the map reduce compilation phase that 
> detects this case and combines the POPackage with POForeach into this new 
> POJoinPackage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-350) PERFORMANCE: Join optimization for pipeline rework

Reply via email to