[ 
https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174945#comment-13174945
 ] 

Kevin Wilfong commented on HIVE-2621:
-------------------------------------

There are currently two ways of getting common distincts, the current way 
checks that all distinct expressions in the subqueries are the same.  My new 
code doesn't depend on this, it tries to construct subsets of the subqueries 
such that this is true for each subset.

The advantage of doing it in the form
if (optimizeMultiGroupBy) {
  ...
} else {
  <group queries by common distinct and group by expressions>
  for each group:
    if (size of group > 1 && etc.) {
      <new code>
    } else {
      <old code>
    }
}

is that the block of code inside the optimizeMultiGroupBy if statement can 
produce 2 map reduce jobs where the new code might produce many.

After looking at it more carefully, I can get rid of the singlemrMultiGroupBy 
if statement and the code within the block because it produces the same result 
that my new code would except that the new code can handle filters as well.

After removing that code, the only remaining code above the if statement will 
be the poorly named getCommonDistinctExprs (as it only returns the common 
distinct expressions provided a lot of conditions are met including a 
requirement that all the distinct expressions are common), which I should be 
able to modify to use my new code.
                
> Allow multiple group bys with the same input data and spray keys to be run on 
> the same reducer.
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2621
>                 URL: https://issues.apache.org/jira/browse/HIVE-2621
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, 
> HIVE-2621.D567.2.patch, HIVE-2621.D567.3.patch
>
>
> Currently, when a user runs a query, such as a multi-insert, where each 
> insertion subclause consists of a simple query followed by a group by, the 
> group bys for each clause are run on a separate reducer.  This requires 
> writing the data for each group by clause to an intermediate file, and then 
> reading it back.  This uses a significant amount of the total CPU consumed by 
> the query for an otherwise simple query.
> If the subclauses are grouped by their distinct expressions and group by 
> keys, with all of the group by expressions for a group of subclauses run on a 
> single reducer, this would reduce the amount of reading/writing to 
> intermediate files for some queries.
> To do this, for each group of subclauses, in the mapper we would execute a 
> the filters for each subclause 'or'd together (provided each subclause has a 
> filter) followed by a reduce sink.  In the reducer, the child operators would 
> be each subclauses filter followed by the group by and any subsequent 
> operations.
> Note that this would require turning off map aggregation, so we would need to 
> make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to