[ https://issues.apache.org/jira/browse/TEZ-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14321182#comment-14321182 ]
Gopal V edited comment on TEZ-2104 at 2/14/15 2:47 AM: ------------------------------------------------------- The cross-product edge has special affinity scheduling optimizations for one of the edges (the one which moves the most amount of data) to avoid re-merging the input streams during node/container co-located runs. The built-in edge opens up a need for evolving affinity optimizations related to temporal locality in the scheduler. was (Author: gopalv): The cross-product edge has special affinity scheduling optimizations for one of the edges (the one which moves the most amount of data) to avoid re-merging the input streams during node/container co-located runs. The built-in edge opens up a need to evolving affinity optimizations related to temporal locality in the scheduler. > A CrossProductEdge which produces synthetic cross-product parallelism > --------------------------------------------------------------------- > > Key: TEZ-2104 > URL: https://issues.apache.org/jira/browse/TEZ-2104 > Project: Apache Tez > Issue Type: New Feature > Reporter: Gopal V > Labels: gsoc, gsoc2015, hadoop, hive, java, tez > > Instead of producing duplicate data for the synthetic cross-product, to fit > into partitions, the amount of net IO can be vastly reduced by a special > purpose cross-product data movement edge. > The Shuffle edge routes each partition's output to a single reducer, while > the cross-product edge routes it into a matrix of reducers without actually > duplicating the disk data. > A partitioning scheme with 3 partitions on the lhs and rhs of a join > operation can be routed into 9 reducers by performing a cross-product similar > to > (1,2,3) x (a,b,c) = [(1,a), (1,b), (1,c), (2,a), (2,b) ...] > This turns a single task cross-product model into a distributed cross product. -- This message was sent by Atlassian JIRA (v6.3.4#6332)