[jira] [Updated] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

Phabricator (Updated) (JIRA) Thu, 01 Dec 2011 18:37:07 -0800

     [ 
https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Phabricator updated HIVE-2621:
------------------------------

    Attachment: HIVE-2621.D567.1.patch

kevinwilfong requested code review of "HIVE-2621 [jira] Allow multiple group 
bys with the same input data and spray keys to be run on the same reducer.".
Reviewers: JIRA

  The meaningful changes are all in how the plan is generated.

  If the conf variable has been set, the subclauses are first grouped by their 
group by keys and distinct keys.  To facilitate this I added a wrapper class to 
ExprNodeDesc which makes equals like the isSame method.

  If the conf variable is not set, I create a single group of all the 
subqueries.

  Then, provided certain conditions are met, e.g. the conf variable is set, 
there is a group by and there are aggregations, the skew conf variable hasn't 
been set, I create the new plan for each group, otherwise the old plan is 
produced.

  To start I generate the common filter by 'or'ing the group's clauses' 
filters.  This goes into a select operator, which goes into a new reduce 
operator.  The reduce operator is like the typical 1 MR group by reduce 
operator, except that to generate the reduce values it loops over each of the 
group's subclauses' aggregations and the columns used in the where clauses.

  This goes into a forward operator and for each subclause the forward operator 
has a child filter operator, if the subclause has a filter, and a group by 
operator.  Each group by operator is followed by the operators which would 
normally follow it in a plan.

TEST PLAN
  I added some unit tests.

  I verified these unit tests and the old unit tests all passed.

  I created a sample query which consisted of a multi-insert from a table with 
1,000,000 rows, going into 6 tables, each of which's subclause consisted of a 
group by, and a count distinct, as well as some other aggregations and havings. 
 The subclauses were constructed such that they could be grouped into two 
reducers using the new plan.  I also ensured that the data was such that map 
aggregation was turned of early using the existing plan.  I verified that this 
query saw a significant improvement in its CPU usage.

REVISION DETAIL
  https://reviews.facebook.net/D567

AFFECTED FILES
  conf/hive-default.xml
  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
  ql/src/test/results/clientpositive/groupby7_noskew_multi_single_reducer.q.out
  ql/src/test/results/clientpositive/groupby_multi_single_reducer.q.out
  
ql/src/test/results/clientpositive/groupby_complex_types_multi_single_reducer.q.out
  ql/src/test/queries/clientpositive/groupby_multi_single_reducer.q
  ql/src/test/queries/clientpositive/groupby7_noskew_multi_single_reducer.q
  
ql/src/test/queries/clientpositive/groupby_complex_types_multi_single_reducer.q
  ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeDesc.java
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java

MANAGE HERALD DIFFERENTIAL RULES
  https://reviews.facebook.net/herald/view/differential/

WHY DID I GET THIS EMAIL?
  https://reviews.facebook.net/herald/transcript/1269/

Tip: use the X-Herald-Rules header to filter Herald messages in your client.

                
> Allow multiple group bys with the same input data and spray keys to be run on 
> the same reducer.
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2621
>                 URL: https://issues.apache.org/jira/browse/HIVE-2621
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch
>
>
> Currently, when a user runs a query, such as a multi-insert, where each 
> insertion subclause consists of a simple query followed by a group by, the 
> group bys for each clause are run on a separate reducer.  This requires 
> writing the data for each group by clause to an intermediate file, and then 
> reading it back.  This uses a significant amount of the total CPU consumed by 
> the query for an otherwise simple query.
> If the subclauses are grouped by their distinct expressions and group by 
> keys, with all of the group by expressions for a group of subclauses run on a 
> single reducer, this would reduce the amount of reading/writing to 
> intermediate files for some queries.
> To do this, for each group of subclauses, in the mapper we would execute a 
> the filters for each subclause 'or'd together (provided each subclause has a 
> filter) followed by a reduce sink.  In the reducer, the child operators would 
> be each subclauses filter followed by the group by and any subsequent 
> operations.
> Note that this would require turning off map aggregation, so we would need to 
> make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

Reply via email to