[ 
https://issues.apache.org/jira/browse/PIG-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3835:
-------------------------------

    Description: 
PIG-3743 implements union using VertexGroup. But there are a couple of 
optimizations that we can apply to it.

* Union followed by store
Union is a blocking operator meaning that a new vertex is added for its 
succeeding operators. But if there is only one store in the succeeding vertex, 
MROutput could be directly attached to VertexGroup instead of adding a new 
vertex for it. Then, each union source vertex will write directly to the 
destination, and therefore, it will be faster.

* Replace POLocalRearrangeTez with POValueOutputTez
Union uses POLocalRearrange by setting the whole record as key. But since union 
only needs to partition records evenly across tasks, it might make more sense 
to use POValueOutputTez with RR partitioner instead.

  was:
PIG-3743 implements union using VertexGroup. Currently, union is a blocking 
operator meaning that a new vertex is added for its succeeding operators.

But if there is only one store in the succeeding vertex, MROutput could be 
directly attached to VertexGroup instead of adding a new vertex for it. Then, 
each union source vertex will write directly to the destination, and therefore, 
it will be faster.

        Summary: Improve performance of union  (was: Optimize union followed by 
store)

> Improve performance of union
> ----------------------------
>
>                 Key: PIG-3835
>                 URL: https://issues.apache.org/jira/browse/PIG-3835
>             Project: Pig
>          Issue Type: Sub-task
>          Components: tez
>    Affects Versions: tez-branch
>            Reporter: Cheolsoo Park
>             Fix For: tez-branch
>
>
> PIG-3743 implements union using VertexGroup. But there are a couple of 
> optimizations that we can apply to it.
> * Union followed by store
> Union is a blocking operator meaning that a new vertex is added for its 
> succeeding operators. But if there is only one store in the succeeding 
> vertex, MROutput could be directly attached to VertexGroup instead of adding 
> a new vertex for it. Then, each union source vertex will write directly to 
> the destination, and therefore, it will be faster.
> * Replace POLocalRearrangeTez with POValueOutputTez
> Union uses POLocalRearrange by setting the whole record as key. But since 
> union only needs to partition records evenly across tasks, it might make more 
> sense to use POValueOutputTez with RR partitioner instead.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to