[
https://issues.apache.org/jira/browse/HIVE-6222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13925001#comment-13925001
]
Remus Rusanu commented on HIVE-6222:
------------------------------------
The 1.patch refactors the VectorGroupByOperator to delegate the algorithm used
to a nested processingMode object. Three processing modes are provided:
- global aggregate. This is the trivial mode when there are no keys. All
values are aggregated into a single row of aggregation buffers and the values
are emitted at operator closeOp()
- hash aggregate. This is all the previous VGBy operator logic,with hash table
and including memory pressure flushes
- streaming aggregate. This mode aggregates intermediate values as keys change
in the input and flushes at each key value change. It relies on MR shuffle and
row-mode GBy reduce phase to merge the intermediate values. Due to the way
aggregators operate on batches, the logic of flushing is not strictly 'on new
key' but 'for all new keys in a batch, except last'. Identical Identical keys
in a batch are not aggregated, unless they make a contiguous run.
This patch will conflict with HIVE-6518 because the relevant code is moved into
the new nested ProcessingModeHashAggregate class. Porting the fix is trivial. I
will rebase either this or HIVE-6518 depending which gets committed first.
> Make Vector Group By operator abandon grouping if too many distinct keys
> ------------------------------------------------------------------------
>
> Key: HIVE-6222
> URL: https://issues.apache.org/jira/browse/HIVE-6222
> Project: Hive
> Issue Type: Sub-task
> Reporter: Remus Rusanu
> Assignee: Remus Rusanu
> Priority: Minor
> Attachments: HIVE-6222.1.patch
>
>
> Row mode GBY is becoming a pass-through if not enough aggregation occurs on
> the map side, relying on the shuffle+reduce side to do the work. Have VGBY do
> the same.
--
This message was sent by Atlassian JIRA
(v6.2#6252)