[ https://issues.apache.org/jira/browse/PIG-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rohini Palaniswamy reopened PIG-3835: ------------------------------------- > Improve performance of union > ---------------------------- > > Key: PIG-3835 > URL: https://issues.apache.org/jira/browse/PIG-3835 > Project: Pig > Issue Type: Sub-task > Components: tez > Affects Versions: tez-branch > Reporter: Cheolsoo Park > Assignee: Rohini Palaniswamy > Fix For: tez-branch > > Attachments: PIG-3835-2.patch, PIG-3835-3.patch, > PIG-3835-Initial-1.patch, PIG-3835-addendum-1.patch > > > PIG-3743 implements union using VertexGroup. But there are a couple of > optimizations that we can apply to it. > * Union followed by store > Union is a blocking operator meaning that a new vertex is added for its > succeeding operators. But if there is only one store in the succeeding > vertex, MROutput could be directly attached to VertexGroup instead of adding > a new vertex for it. Then, each union source vertex will write directly to > the destination, and therefore, it will be faster. > * Replace POLocalRearrangeTez with POValueOutputTez > Union uses POLocalRearrange by setting the whole record as key. But since > union only needs to partition records evenly across tasks, it might make more > sense to use POValueOutputTez with RR partitioner instead. -- This message was sent by Atlassian JIRA (v6.2#6252)