[ 
https://issues.apache.org/jira/browse/TEZ-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14490535#comment-14490535
 ] 

Gopal V commented on TEZ-145:
-----------------------------

Combiner have only made sense in case of 3/4. 

#3 is the true use case for this, because combiners are written only for those 
scenarios.

The old MRv2 model did a re-merge + combine() only if there were > 3 spills per 
task.

So tuning it to have no extra spills produced bad shuffle performance, which is 
what the Tez approach is not vulnerable to, since it is meant to combine 
host-local data (plus skip merges via pipelining).

The original scenario where I discovered a need for this was when I was trying 
to find the first/last transaction of sessions across a time window, to look 
for overlapped session-ids for the same user to detect multiple device usage or 
stolen tokens.

> Support a combiner processor that can run non-local to map/reduce nodes
> -----------------------------------------------------------------------
>
>                 Key: TEZ-145
>                 URL: https://issues.apache.org/jira/browse/TEZ-145
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Hitesh Shah
>            Assignee: Tsuyoshi Ozawa
>         Attachments: TEZ-145.2.patch, WIP-TEZ-145-001.patch
>
>
> For aggregate operators that can benefit by running in multi-level trees, 
> support of being able to run a combiner in a non-local mode would allow 
> performance efficiencies to be gained by running a combiner at a rack-level. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to