[ https://issues.apache.org/jira/browse/HIVE-7405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matt McCline updated HIVE-7405: ------------------------------- Description: Vectorize the basic case that does not have any count distinct aggregation. Add a 4th processing mode in VectorGroupByOperator for reduce where each input VectorizedRowBatch has only values for one key at a time. Thus, the values in the batch can be aggregated quickly. was: Take advantage of the fact that in most plans a reduce-side GroupBy will get the group keys in sorted order so aggregation can be done "streaming" and not require large buffering of intermediate aggregation in memory/storage. Push any case requiring large buffering -- e.g. COUNT(DISTINCT(..)) -- to part 2 of Vectorize Reduce-Side GroupBy. In theory, if there is only one COUNT(DISTINCT(..)) the optimizer could arrange for sorting on the distinct column(s) as subordinate sort key and do the count of each distinct column(s) as a "streaming" operation. Then, only multiple COUNT(DISTINCT(..)) would require large buffering. Summary: Vectorize GROUP BY on the Reduce-Side (Part 1 – Basic) (was: Vectorize Reduce-Side GroupBy) > Vectorize GROUP BY on the Reduce-Side (Part 1 – Basic) > ------------------------------------------------------ > > Key: HIVE-7405 > URL: https://issues.apache.org/jira/browse/HIVE-7405 > Project: Hive > Issue Type: Sub-task > Reporter: Matt McCline > Assignee: Matt McCline > > Vectorize the basic case that does not have any count distinct aggregation. > Add a 4th processing mode in VectorGroupByOperator for reduce where each > input VectorizedRowBatch has only values for one key at a time. Thus, the > values in the batch can be aggregated quickly. -- This message was sent by Atlassian JIRA (v6.2#6252)