[jira] [Comment Edited] (TEZ-2104) A CrossProductEdge which produces synthetic cross-product parallelism

Aurijoy Majumdar (JIRA) Thu, 26 Mar 2015 17:09:07 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382998#comment-14382998
 ]


Aurijoy Majumdar edited comment on TEZ-2104 at 3/27/15 12:08 AM:
-----------------------------------------------------------------

Had a few doubts regarding setting up the cross join matrix or cross product 
based matrix to route the Mapper messages to the destination reducer( 
partitioning still necessary? other wise no concept of destination reducer). 
Here we actually send it to all reducers int the matrix of reducers in order to 
eliminate writing to file (I/O redundancy). But this leads to redundancy in 
sending the same message to non destination reducers. I guess the way out of 
that is sending them over to specific reducers by setting up a scheduling layer 
in between which will have affinity based scheduling to route it to the 
specific destination reducers over the edge setup for that specific reducer. Is 
that correct? 

Or is the concept of partitioning unnecessary here hence we can route the 
messages to any number of reducers ( no concept of destination reducer) that is 
to say that just route them over to the reducers according to some affinty 
based metric where the affinity is calculated on the basis of locality and load 
of the reducer and finally according to this calculation the reducer 
destination is chosen? 

Thanks



was (Author: aurijoy):
Had a few doubts regarding setting up the cross join matrix or cross product 
based matrix to route the Mapper messages to the destination reducer( 
partitioning still necessary? other wise no concept of destination reducer). 
Here we actually send it to all reducers int the matrix of reducers in order to 
eliminate writing to file (I/O redundancy). But this leads to redundancy in 
sending the same message to non destination reducers. I guess the way out of 
that is sending them over to specific reducers by setting up a scheduling layer 
in between which will have affinity based scheduling to route it to the 
specific destination reducers over the edge setup for that specific reducer. Is 
that correct? 

Or is the concept of partitioning not necessary here hence we can get route the 
messages to any number of reducers and just route them over to the reducers 
according to some affinty based metric which is to say the affinity is 
calculated on the basis of locality and load of the reducers available. 

Thanks


> A CrossProductEdge which produces synthetic cross-product parallelism
> ---------------------------------------------------------------------
>
>                 Key: TEZ-2104
>                 URL: https://issues.apache.org/jira/browse/TEZ-2104
>             Project: Apache Tez
>          Issue Type: New Feature
>            Reporter: Gopal V
>              Labels: gsoc, gsoc2015, hadoop, hive, java, tez
>
> Instead of producing duplicate data for the synthetic cross-product, to fit 
> into partitions, the amount of net IO can be vastly reduced by a special 
> purpose cross-product data movement edge.
> The Shuffle edge routes each partition's output to a single reducer, while 
> the cross-product edge routes it into a matrix of reducers without actually 
> duplicating the disk data.
> A partitioning scheme with 3 partitions on the lhs and rhs of a join 
> operation can be routed into 9 reducers by performing a cross-product similar 
> to 
> (1,2,3) x (a,b,c) = [(1,a), (1,b), (1,c), (2,a), (2,b) ...]
> This turns a single task cross-product model into a distributed cross product.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TEZ-2104) A CrossProductEdge which produces synthetic cross-product parallelism

Reply via email to