[ 
https://issues.apache.org/jira/browse/TEZ-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14490468#comment-14490468
 ] 

Bikas Saha edited comment on TEZ-145 at 4/10/15 10:30 PM:
----------------------------------------------------------

Taking a step back, lets figure out the scenarios for this. 
Do we agree that 
1) Small jobs (small data) - this is not going to be helpful because we will be 
adding an extra stage latency for small combiner benefits.
2) Large job (large data) with no data reduction in the map side combiner - 
this is not going to be helpful because the extra combiner will not reduce the 
data further.
3) Large job (large data) with high data reduction in the map side combiner - 
this is going to be useful because the extra combiner will reduce the data 
further and also decrease the number of data shards by aggregating small 
outputs from the map tasks into smaller number of combiner tasks.
4) Large job (large data) with lot of filtering (no combiner) - this may be 
useful, not because their is a combine operation) but to reduce the large 
number of small outputs produced by the map tasks into a smaller number of 
shards due to the combiner tasks.

For 3/4 this may be useful if we can run aggregation combiner tasks at the rack 
level to coalesce the data within a rack (cheap) compared to having to pull 
that data across racks in the final reducer. Even in these cases, given better 
networks, we need to understand the trade off between pulling the data across 
to the final reducer vs the cost of running the extra combiner stage. 
Essentially, what is the killer scenario for this?


was (Author: bikassaha):
Taking a step back, lets figure out the scenarios for this. 
Do we agree that for small jobs (small data) - this is not going to be helpful 
because we will be adding an extra stage latency for small combiner benefits.
Large job (large data) with no data reduction in the map side combiner - this 
is not going to be helpful because the extra combiner will not reduce the data 
further.
Large job (large data) with high data reduction in the map side combiner - this 
is going to be useful because the extra combiner will reduce the data further 
and also decrease the number of data shards by aggregating small outputs from 
the map tasks into smaller number of combiner tasks.
Large job (large data) with lot of filtering (no combiner) - this may be 
useful, not because their is a combine operation) but to reduce the large 
number of small outputs produced by the map tasks into a smaller number of 
shards due to the combiner tasks.

> Support a combiner processor that can run non-local to map/reduce nodes
> -----------------------------------------------------------------------
>
>                 Key: TEZ-145
>                 URL: https://issues.apache.org/jira/browse/TEZ-145
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Hitesh Shah
>            Assignee: Tsuyoshi Ozawa
>         Attachments: TEZ-145.2.patch, WIP-TEZ-145-001.patch
>
>
> For aggregate operators that can benefit by running in multi-level trees, 
> support of being able to run a combiner in a non-local mode would allow 
> performance efficiencies to be gained by running a combiner at a rack-level. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to