[ https://issues.apache.org/jira/browse/HIVE-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901605#comment-14901605 ]
Matt McCline commented on HIVE-11794: ------------------------------------- +1 > GBY vectorization appears to process COMPLETE reduce-side GBY incorrectly > ------------------------------------------------------------------------- > > Key: HIVE-11794 > URL: https://issues.apache.org/jira/browse/HIVE-11794 > Project: Hive > Issue Type: Bug > Reporter: Sergey Shelukhin > Assignee: Sergey Shelukhin > Attachments: HIVE-11794.01.patch, HIVE-11794.patch > > > The code in Vectorizer is as such: > {noformat} > boolean isMergePartial = (desc.getMode() != GroupByDesc.Mode.HASH); > {noformat} > then, if it's reduce side: > {noformat} > if (isMergePartial) { > // Reduce Merge-Partial GROUP BY. > // A merge-partial GROUP BY is fed by grouping by keys from > reduce-shuffle. It is the > // first (or root) operator for its reduce task. > .... > } else { > // Reduce Hash GROUP BY or global aggregation. > ... > {noformat} > In fact, this logic is missing the COMPLETE mode. Both from the comment: > {noformat} > COMPLETE: complete 1-phase aggregation: iterate, terminate > ... > HASH: For non-distinct the same as PARTIAL1 but use hash-table-based > aggregation > ... > PARTIAL1: partial aggregation - first phase: iterate, terminatePartial > {noformat} > and from the explain plan like this (the query has multiple stages of > aggregations over a union; the mapper does a partial hash aggregation for > each side of the union, which is then followed by mergepartial, and 2nd stage > as complete): > {noformat} > Map Operator Tree: > ... > Group By Operator > keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), > _col3 (type: int), _col4 (type: int), _col5 (type: bigint), _col6 (type: > bigint), _col7 (type: bigint), _col8 (type: bigint), _col9 (type: bigint), > _col10 (type: bigint), _col11 (type: bigint), _col12 (type: bigint) > mode: hash > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, > _col7, _col8, _col9, _col10, _col11, _col12 > Reduce Output Operator > ... > feeding into > Reduce Operator Tree: > Group By Operator > keys: KEY._col0 (type: int), KEY._col1 (type: int), KEY._col2 (type: > int), KEY._col3 (type: int), KEY._col4 (type: int), KEY._col5 (type: bigint), > KEY._col6 (type: bigint), KEY._col7 (type: bigint), KEY._col8 (type: bigint), > KEY._col9 (type: bigint), KEY._col10 (type: bigint), KEY._col11 (type: > bigint), KEY._col12 (type: bigint) > mode: mergepartial > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, > _col7, _col8, _col9, _col10, _col11, _col12 > Group By Operator > aggregations: sum(_col5), sum(_col6), sum(_col7), sum(_col8), > sum(_col9), sum(_col10), sum(_col11), sum(_col12) > keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), _col3 > (type: int), _col4 (type: int) > mode: complete > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, > _col7, _col8, _col9, _col10, _col11, _col12 > {noformat} > it seems like COMPLETE is actually the global aggregation, and HASH isn't (or > may not be). > So, it seems like reduce-side COMPLETE should be handled on the else-path of > the above if. For map-side, it doesn't check mode at all as far as I can see. > Not sure if additional code changes are necessary after that, it may just > work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)