Ming Ma created TEZ-3500:
----------------------------

             Summary: Support for multiple source vertices
                 Key: TEZ-3500
                 URL: https://issues.apache.org/jira/browse/TEZ-3500
             Project: Apache Tez
          Issue Type: Sub-task
            Reporter: Ming Ma


For fair_parallelism policy where multiple destination tasks process the data 
from different source tasks of the same partition, current implementation only 
supports one source vertex.

Support for multiple source vertices will enable skewed shuffle join as 
mentioned in 
https://issues.apache.org/jira/browse/TEZ-3209?focusedCommentId=15385449&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15385449.
 Some rough ideas:

* For a large partition, if the volume comes mostly from one source vertex, 
apply fair routing on that primary source vertex and have other vertices 
broadcast their output to those destination tasks processing that partition.
* If the large partition volume is big from more than one source vertex, then 
we will need something like cartesian product to do the join of different 
sub-partition data from multiple vertices.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to