[ 
https://issues.apache.org/jira/browse/TEZ-3209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated TEZ-3209:
--------------------------------
    Target Version/s: 0.9.0

> Support for fair custom data routing
> ------------------------------------
>
>                 Key: TEZ-3209
>                 URL: https://issues.apache.org/jira/browse/TEZ-3209
>             Project: Apache Tez
>          Issue Type: New Feature
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>         Attachments: TEZ-3209.patch, Tez-based demuxer for highly skewed 
> category data.pdf
>
>
> This is based on offline discussion with [~gopalv], [~hitesh], 
> [~jrottinghuis] and [~lohit] w.r.t. the support for efficient processing of 
> highly skewed unordered partitioned mapper output. Our use case is to demux 
> highly skewed unordered category data partitioned by category name. Gopal and 
> Hitesh mentioned dynamically shuffled join scenario.
> One option we discussed is to leverage auto-parallelism feature with upfront 
> over-partitioning. That means possible overhead to support large number 
> partitions and unnecessary data movement as each reducer needs to get data 
> from all mappers. 
> Another alternative is to use custom {{DataMovementType}} which doesn't 
> require each reducer to fetch data from all mappers. In that way, a large 
> partition will be processed by several reducers, each of which will fetch 
> data from a portion of mappers.
> For example, say there are 100 mappers each of which has 10 partitions (P1, 
> ..., P10). Each mapper generates 100MB for its P10 and 1MB for each of its 
> (P1, ... P9). The default SCATTER_GATHER routing means the reducer for P10 
> has to process 10GB of input and becomes the bottleneck of the job. With the 
> fair custom data routing, The P10 belonging to the first 10 mappers will be 
> processed by one reducer with 1GB input data. The P10 belonging to the second 
> 10 mappers will be processed by another reducer, etc.
> For further optimization, we can allocate the reducer on the same nodes as 
> the mappers that it fetches data from.
> To support this, we need TEZ-3206 as well as customized data routing based on 
> {{VertexManagerPlugin}} and {{EdgeManagerPluginOnDemand}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to