Ming Ma created TEZ-3500: ---------------------------- Summary: Support for multiple source vertices Key: TEZ-3500 URL: https://issues.apache.org/jira/browse/TEZ-3500 Project: Apache Tez Issue Type: Sub-task Reporter: Ming Ma
For fair_parallelism policy where multiple destination tasks process the data from different source tasks of the same partition, current implementation only supports one source vertex. Support for multiple source vertices will enable skewed shuffle join as mentioned in https://issues.apache.org/jira/browse/TEZ-3209?focusedCommentId=15385449&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15385449. Some rough ideas: * For a large partition, if the volume comes mostly from one source vertex, apply fair routing on that primary source vertex and have other vertices broadcast their output to those destination tasks processing that partition. * If the large partition volume is big from more than one source vertex, then we will need something like cartesian product to do the join of different sub-partition data from multiple vertices. -- This message was sent by Atlassian JIRA (v6.3.4#6332)