[ 
https://issues.apache.org/jira/browse/PIG-3856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-3856:
------------------------------------

    Attachment: PIG-3856-1.patch

Attached patch has the required changes mentioned in description except for 
optimizing further with Tez Shared edge. Cheolsoo ran one of the Netflix 
productions scripts with the patch, but found that the performance degrade a 
bit. This is most likely due to writing the same replicated join table  
multiple times to different outputs. So have just uploaded the patch for now. 
Will make required changes once shared edges are available and then have this 
committed. 

Also realized that the vertex caching is applicable only for 1 vertex. In this 
case same replicated join table can be cached for more than 1 vertex. Candidate 
for another feature request ask in Tez.

> UnionOptimizer in Tez should optimize the case of replicated join
> -----------------------------------------------------------------
>
>                 Key: PIG-3856
>                 URL: https://issues.apache.org/jira/browse/PIG-3856
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Rohini Palaniswamy
>             Fix For: tez-branch
>
>         Attachments: PIG-3856-1.patch
>
>
> Replicate join input that was broadcast to union vertex now needs to be 
> broadcast to all the union predecessors. So we need to
>     - Create edges from the Replicate join input to all the union predecessors
>     - Change replicate join input to write to multiple outputs.
> This can be further optimized by using a shared edge which is yet to be 
> implemented in Tez (TEZ-391)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to