[ 
https://issues.apache.org/jira/browse/PIG-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-3835:
------------------------------------

    Attachment: PIG-3835-Initial-1.patch

Changes done:
   - Changed POLocalRearrangeTez to POValueOutputTez
   - Wrote a UnionOptimizer
         - Got union followed by store working using vertexgroup. 
        - Also implemented the case where union followed by group by or join 
only has 3 vertices instead of 4 using vertexgroup. But that is not working 
because ConcatenatedMergedKeyValuesInput is not working as expected. It does 
not group values from the two input together. The values are grouped only 
within each input. Will file a Tez bug on that. 

Putting up the patch to get approach vetted. Still to run unit and e2e tests 
and some minor cleanup pending. 

Reviewboard link - https://reviews.apache.org/r/19836/

> Improve performance of union
> ----------------------------
>
>                 Key: PIG-3835
>                 URL: https://issues.apache.org/jira/browse/PIG-3835
>             Project: Pig
>          Issue Type: Sub-task
>          Components: tez
>    Affects Versions: tez-branch
>            Reporter: Cheolsoo Park
>            Assignee: Rohini Palaniswamy
>             Fix For: tez-branch
>
>         Attachments: PIG-3835-Initial-1.patch
>
>
> PIG-3743 implements union using VertexGroup. But there are a couple of 
> optimizations that we can apply to it.
> * Union followed by store
> Union is a blocking operator meaning that a new vertex is added for its 
> succeeding operators. But if there is only one store in the succeeding 
> vertex, MROutput could be directly attached to VertexGroup instead of adding 
> a new vertex for it. Then, each union source vertex will write directly to 
> the destination, and therefore, it will be faster.
> * Replace POLocalRearrangeTez with POValueOutputTez
> Union uses POLocalRearrange by setting the whole record as key. But since 
> union only needs to partition records evenly across tasks, it might make more 
> sense to use POValueOutputTez with RR partitioner instead.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to