[ https://issues.apache.org/jira/browse/TEZ-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14490535#comment-14490535 ]
Gopal V commented on TEZ-145: ----------------------------- Combiner have only made sense in case of 3/4. #3 is the true use case for this, because combiners are written only for those scenarios. The old MRv2 model did a re-merge + combine() only if there were > 3 spills per task. So tuning it to have no extra spills produced bad shuffle performance, which is what the Tez approach is not vulnerable to, since it is meant to combine host-local data (plus skip merges via pipelining). The original scenario where I discovered a need for this was when I was trying to find the first/last transaction of sessions across a time window, to look for overlapped session-ids for the same user to detect multiple device usage or stolen tokens. > Support a combiner processor that can run non-local to map/reduce nodes > ----------------------------------------------------------------------- > > Key: TEZ-145 > URL: https://issues.apache.org/jira/browse/TEZ-145 > Project: Apache Tez > Issue Type: Bug > Reporter: Hitesh Shah > Assignee: Tsuyoshi Ozawa > Attachments: TEZ-145.2.patch, WIP-TEZ-145-001.patch > > > For aggregate operators that can benefit by running in multi-level trees, > support of being able to run a combiner in a non-local mode would allow > performance efficiencies to be gained by running a combiner at a rack-level. -- This message was sent by Atlassian JIRA (v6.3.4#6332)