[ 
https://issues.apache.org/jira/browse/PIG-4495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482215#comment-14482215
 ] 

Rohini Palaniswamy commented on PIG-4495:
-----------------------------------------

This patch basically gets rid of the need for the ask TEZ-1190 Allow multiple 
edges between two vertexes. 

Changes done:
   1) Case of Self join/cross/cogroup
        - Multiple sub-plans of split write to the same output. The 
POShuffleTezLoad is now capable of splitting the input into correct bags based 
on the index in the key.
        - Do not allow cases like self-replicate/self-skewed join
   2) Case of union
        - Multiple sub-plans of split write to the same output and connect to 
the vertex group. If only sub-plans of the split are members of the union, then 
no vertex group is created and split is directly connected to union successors. 
        - For cases like nightly.conf Union_16.pig (moved to multiquery.conf 
now) which has multiple levels of union all from same split, even the vertex 
group created is removed and all the split sub-plans write directly to the 
successor.
   3) Other optimizations done
        - If there was a union followed by replicate join it was not optimized 
(PIG-3856). But if the union is within the same split we now broadcast the 
replicate join once to the split operator.
   4) Refactored code in UnionOptimizer into methods for easy readability.
   5) Not very related, but cleaned up TestMultiQueryLocal as had to search the 
logs for exception logged while testing this patch instead of being able to 
look at the junit test failure stacktrace.

For one of the pig scripts, which had 72 vertex+vertex groups (due to lots of 
splits and unions) the new plan just has 18 vertex/vertex groups. Performance 
difference was not that much only improving by a minute (from 9 mins to 8 mins) 
which I expected it to be more better and need to investigate. It utilized lot 
less resources (from 3561 tasks to 1011 tasks. MR uses 999 tasks in total) 
which is good and also there was a good difference in the file bytes read (2.7G 
less) and written counters (fG less) for the 64G input data as more vertices 
were merged into the Split vertex. Writing to same output without using vertex 
group to combine it later also helps as the io.sort.mb is not divided between 
multiple outputs making sorting faster.

> Better multi-query planning in case of multiple edges
> -----------------------------------------------------
>
>                 Key: PIG-4495
>                 URL: https://issues.apache.org/jira/browse/PIG-4495
>             Project: Pig
>          Issue Type: Sub-task
>          Components: tez
>    Affects Versions: 0.14.0
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>             Fix For: 0.15.0
>
>         Attachments: PIG-4495-1.patch
>
>
> Details in 
> https://issues.apache.org/jira/browse/TEZ-1190?focusedCommentId=14393033&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14393033
> People split the data, perform some foreach transformations/filter, union 
> them and then do some operation like group by or join with other data. In 
> those cases it creates multiple edges from same Split, so we do not merge 
> them, but  
> write out the data to another dummy vertex to avoid multiple edges and this 
> adds overhead and affects performance. Vertex groups accept multiple edges 
> from same vertex. So if the multiple edges end up in a vertex group (and not 
> a vertex which is the case in self join) we can avoid the dummy vertex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to