Allow multiple group bys with the same input data and spray keys to be run on 
the same reducer.
-----------------------------------------------------------------------------------------------

                 Key: HIVE-2621
                 URL: https://issues.apache.org/jira/browse/HIVE-2621
             Project: Hive
          Issue Type: New Feature
            Reporter: Kevin Wilfong
            Assignee: Kevin Wilfong


Currently, when a user runs a query, such as a multi-insert, where each 
insertion subclause consists of a simple query followed by a group by, the 
group bys for each clause are run on a separate reducer.  This requires writing 
the data for each group by clause to an intermediate file, and then reading it 
back.  This uses a significant amount of the total CPU consumed by the query 
for an otherwise simple query.

If the subclauses are grouped by their distinct expressions and group by keys, 
with all of the group by expressions for a group of subclauses run on a single 
reducer, this would reduce the amount of reading/writing to intermediate files 
for some queries.

To do this, for each group of subclauses, in the mapper we would execute a the 
filters for each subclause 'or'd together (provided each subclause has a 
filter) followed by a reduce sink.  In the reducer, the child operators would 
be each subclauses filter followed by the group by and any subsequent 
operations.

Note that this would require turning off map aggregation, so we would need to 
make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to