[jira] [Work logged] (HIVE-24084) Enhance cost model to push down more Aggregates

2020-08-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24084?focusedWorklogId=475379&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-475379
 ]

ASF GitHub Bot logged work on HIVE-24084:
-

Author: ASF GitHub Bot
Created on: 27/Aug/20 15:58
Start Date: 27/Aug/20 15:58
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk opened a new pull request #1439:
URL: https://github.com/apache/hive/pull/1439


   
   
   ### What changes were proposed in this pull request?
   
   * changes to cost estimations - fixed a few bugs
   * changed Aggregate estimation to a different on HiveOnTezCostmodel
   * in HiveAggregateJoinTransposeRule relaxed the comparision with the new 
plan to use <= instead < 
   
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 475379)
Remaining Estimate: 0h
Time Spent: 10m

> Enhance cost model to push down more Aggregates
> ---
>
> Key: HIVE-24084
> URL: https://issues.apache.org/jira/browse/HIVE-24084
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24084) Enhance cost model to push down more Aggregates

2020-08-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24084?focusedWorklogId=475953&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-475953
 ]

ASF GitHub Bot logged work on HIVE-24084:
-

Author: ASF GitHub Bot
Created on: 28/Aug/20 19:14
Start Date: 28/Aug/20 19:14
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r479474959



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
 }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the 
results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping 
which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+if (groups.isEmpty()) {
+  return false;
+}
+RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+Set uKeys = mq.getUniqueKeys(input);

Review comment:
   We could rely on `mq.areColumnsUnique`.

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveOnTezCostModel.java
##
@@ -89,22 +89,23 @@ public RelOptCost getAggregateCost(HiveAggregate aggregate) 
{
 } else {
   final RelMetadataQuery mq = aggregate.getCluster().getMetadataQuery();
   // 1. Sum of input cardinalities
-  final Double rCount = mq.getRowCount(aggregate.getInput());
-  if (rCount == null) {
+  final Double inputRowCount = mq.getRowCount(aggregate.getInput());
+  final Double rowCount = mq.getRowCount(aggregate);
+  if (inputRowCount == null || rowCount == null) {
 return null;
   }
   // 2. CPU cost = sorting cost
-  final double cpuCost = algoUtils.computeSortCPUCost(rCount);
+  final double cpuCost = algoUtils.computeSortCPUCost(rowCount) + 
inputRowCount * algoUtils.getCpuUnitCost();

Review comment:
   Not sure about this change. If the algorithm is sort-based, you will 
still sort the complete input, right?

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
 }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the 
results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping 
which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+if (groups.isEmpty()) {
+  return false;
+}
+RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+Set uKeys = mq.getUniqueKeys(input);
+for (ImmutableBitSet u : uKeys) {
+  if (groups.contains(u)) {
+return true;
+  }
+}
+if (input instanceof Join) {
+  Join join = (Join) input;
+  RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+  SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), 
rexBuilder);
+
+  if (cond.valid) {
+ImmutableBitSet newGroup = 
groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+RelNode l = join.getLeft();
+RelNode r = join.getRight();
+
+int joinFieldCount = join.getRowType().getFieldCount();
+int lFieldCount = l.getRowType().getFieldCount();
+
+ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+ImmutableBitSet groupR = newGroup.get(lFieldCount, 
joinFieldCount).shift(-lFieldCount);
+
+if (isGroupingUnique(l, groupL)) {

Review comment:
   This could call `mq.areColumnsUnique` instead of making the recursive 
call.

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##
@@ -290,7 +291,8 @@ public void onMatch(RelOptRuleCall call) {
   RelNode r = relBuilder.build();
   RelOptCost afterCost = mq.getCumulativeCost(r);
   RelOptCost beforeCost = mq.getCumulativeCost(aggregate);
-  if (afterCost.isLt(beforeCost)) {

Review comment:
   I think you suggested changing this... Maybe `isLe` if we do not 
introduce an additional aggregate on top?

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##
@@ -290,7 +291,8 @@ public void onMatch(RelOptRuleCall call) {
   RelNode r = relBuilder.build();
   RelOptCost afterCost = mq.getCumulativeCost(r);
   RelOptCost beforeCost = mq.getCumulativeCost(aggrega

[jira] [Work logged] (HIVE-24084) Enhance cost model to push down more Aggregates

2020-08-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24084?focusedWorklogId=475964&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-475964
 ]

ASF GitHub Bot logged work on HIVE-24084:
-

Author: ASF GitHub Bot
Created on: 28/Aug/20 20:13
Start Date: 28/Aug/20 20:13
Worklog Time Spent: 10m 
  Work Description: vineetgarg02 commented on a change in pull request 
#1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r479508618



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveOnTezCostModel.java
##
@@ -89,22 +89,23 @@ public RelOptCost getAggregateCost(HiveAggregate aggregate) 
{
 } else {
   final RelMetadataQuery mq = aggregate.getCluster().getMetadataQuery();
   // 1. Sum of input cardinalities
-  final Double rCount = mq.getRowCount(aggregate.getInput());
-  if (rCount == null) {
+  final Double inputRowCount = mq.getRowCount(aggregate.getInput());
+  final Double rowCount = mq.getRowCount(aggregate);

Review comment:
   Can we change `rowCount` to `outputRowCount`? This will make the change 
more readable.

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveOnTezCostModel.java
##
@@ -89,22 +89,23 @@ public RelOptCost getAggregateCost(HiveAggregate aggregate) 
{
 } else {
   final RelMetadataQuery mq = aggregate.getCluster().getMetadataQuery();
   // 1. Sum of input cardinalities
-  final Double rCount = mq.getRowCount(aggregate.getInput());
-  if (rCount == null) {
+  final Double inputRowCount = mq.getRowCount(aggregate.getInput());
+  final Double rowCount = mq.getRowCount(aggregate);
+  if (inputRowCount == null || rowCount == null) {
 return null;
   }
   // 2. CPU cost = sorting cost
-  final double cpuCost = algoUtils.computeSortCPUCost(rCount);
+  final double cpuCost = algoUtils.computeSortCPUCost(rowCount) + 
inputRowCount * algoUtils.getCpuUnitCost();
   // 3. IO cost = cost of writing intermediary results to local FS +
   //  cost of reading from local FS for transferring to GBy +
   //  cost of transferring map outputs to GBy operator
   final Double rAverageSize = mq.getAverageRowSize(aggregate.getInput());
   if (rAverageSize == null) {
 return null;
   }
-  final double ioCost = algoUtils.computeSortIOCost(new 
Pair(rCount,rAverageSize));
+  final double ioCost = algoUtils.computeSortIOCost(new Pair(rowCount, rAverageSize));

Review comment:
   `rAverageSize` is based on input row count but `rowCount` is output row 
count. Is this intended or should average row size be computed based on output 
row count?

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
 }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the 
results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping 
which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+if (groups.isEmpty()) {
+  return false;
+}
+RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+Set uKeys = mq.getUniqueKeys(input);

Review comment:
   If the purpose of this method is to determine that given a set of 
columns are unique or not you can use `areColumnsUnique` as @jcamachor  
suggested.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 475964)
Time Spent: 0.5h  (was: 20m)

> Enhance cost model to push down more Aggregates
> ---
>
> Key: HIVE-24084
> URL: https://issues.apache.org/jira/browse/HIVE-24084
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24084) Enhance cost model to push down more Aggregates

2020-09-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24084?focusedWorklogId=477882&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-477882
 ]

ASF GitHub Bot logged work on HIVE-24084:
-

Author: ASF GitHub Bot
Created on: 02/Sep/20 13:58
Start Date: 02/Sep/20 13:58
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r482090289



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
 }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the 
results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping 
which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+if (groups.isEmpty()) {
+  return false;
+}
+RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+Set uKeys = mq.getUniqueKeys(input);
+for (ImmutableBitSet u : uKeys) {
+  if (groups.contains(u)) {
+return true;
+  }
+}
+if (input instanceof Join) {
+  Join join = (Join) input;
+  RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+  SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), 
rexBuilder);
+
+  if (cond.valid) {
+ImmutableBitSet newGroup = 
groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+RelNode l = join.getLeft();
+RelNode r = join.getRight();
+
+int joinFieldCount = join.getRowType().getFieldCount();
+int lFieldCount = l.getRowType().getFieldCount();
+
+ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+ImmutableBitSet groupR = newGroup.get(lFieldCount, 
joinFieldCount).shift(-lFieldCount);
+
+if (isGroupingUnique(l, groupL)) {

Review comment:
   this method does a bit different thing - honestly I feeled like I'm in 
trouble when I've given this name to it :)
   
   this method checks if the given columns contain an unique column somewhere 
in the covered joins; (this still sound fuzzy) so let's take an example
   
   consider:
   ```
   select c_id, sum(i_prize) from customer c join item i on(i.c_id=c.c_id)
   ```
   
   * do an aggregate grouping by the column C_ID  ; and sum up something 
   * below is a join which joins by C_ID
   * asking wether C_ID  is a unique column on top of the join is false; but 
there is subtree in which C_ID is unique => so if we push the aggregate on that 
branch the aggregation will be a no-op
   
   I think this case is not handled by `areColumnsUnique`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 477882)
Time Spent: 40m  (was: 0.5h)

> Enhance cost model to push down more Aggregates
> ---
>
> Key: HIVE-24084
> URL: https://issues.apache.org/jira/browse/HIVE-24084
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24084) Enhance cost model to push down more Aggregates

2020-09-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24084?focusedWorklogId=477884&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-477884
 ]

ASF GitHub Bot logged work on HIVE-24084:
-

Author: ASF GitHub Bot
Created on: 02/Sep/20 14:00
Start Date: 02/Sep/20 14:00
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r482092255



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##
@@ -290,7 +291,8 @@ public void onMatch(RelOptRuleCall call) {
   RelNode r = relBuilder.build();
   RelOptCost afterCost = mq.getCumulativeCost(r);
   RelOptCost beforeCost = mq.getCumulativeCost(aggregate);
-  if (afterCost.isLt(beforeCost)) {

Review comment:
   yes; if we use `isLe` the current cost model which only takes rowcount 
into account will prefer the pushing aggregates further
   
   there is another alternative to the `force` based approach: the rule can be 
configured to use the more advanced cost system - so that it could take cpu/io 
cost into account





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 477884)
Time Spent: 50m  (was: 40m)

> Enhance cost model to push down more Aggregates
> ---
>
> Key: HIVE-24084
> URL: https://issues.apache.org/jira/browse/HIVE-24084
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24084) Enhance cost model to push down more Aggregates

2020-09-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24084?focusedWorklogId=477900&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-477900
 ]

ASF GitHub Bot logged work on HIVE-24084:
-

Author: ASF GitHub Bot
Created on: 02/Sep/20 14:35
Start Date: 02/Sep/20 14:35
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r482119043



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveOnTezCostModel.java
##
@@ -89,22 +89,23 @@ public RelOptCost getAggregateCost(HiveAggregate aggregate) 
{
 } else {
   final RelMetadataQuery mq = aggregate.getCluster().getMetadataQuery();
   // 1. Sum of input cardinalities
-  final Double rCount = mq.getRowCount(aggregate.getInput());
-  if (rCount == null) {
+  final Double inputRowCount = mq.getRowCount(aggregate.getInput());
+  final Double rowCount = mq.getRowCount(aggregate);
+  if (inputRowCount == null || rowCount == null) {
 return null;
   }
   // 2. CPU cost = sorting cost
-  final double cpuCost = algoUtils.computeSortCPUCost(rCount);
+  final double cpuCost = algoUtils.computeSortCPUCost(rowCount) + 
inputRowCount * algoUtils.getCpuUnitCost();

Review comment:
   maybe...I'm trying to catch the case when `inputRowCount >> 
outputRowCount`; we are also grouping - so it will not be a full sort at all ; 
I was using the above to achieve:
   ```
   log(outputRowCount)*outputRowCount + inputRowCount*COST
   ```
   the rational behind this is that it needs to really sort `oRC` and read 
`iRC` rows - this could be an underestimation...but `log(iRC)*iRC` was highly 
overestimating the cost
   
   one alternative for the above could be to use:
   ```
   log(outputRowCount) * inputRowCount
   ```
   
   the rational behind this:
   we will need to find the place for every input row; but we also know that 
the output will be at most `outputRowCount` - so it shouldn't take more time to 
find the place for the actual row than `log(outputRowCount)`
   
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 477900)
Time Spent: 1h  (was: 50m)

> Enhance cost model to push down more Aggregates
> ---
>
> Key: HIVE-24084
> URL: https://issues.apache.org/jira/browse/HIVE-24084
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24084) Enhance cost model to push down more Aggregates

2020-09-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24084?focusedWorklogId=477907&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-477907
 ]

ASF GitHub Bot logged work on HIVE-24084:
-

Author: ASF GitHub Bot
Created on: 02/Sep/20 14:41
Start Date: 02/Sep/20 14:41
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r482123540



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveOnTezCostModel.java
##
@@ -89,22 +89,23 @@ public RelOptCost getAggregateCost(HiveAggregate aggregate) 
{
 } else {
   final RelMetadataQuery mq = aggregate.getCluster().getMetadataQuery();
   // 1. Sum of input cardinalities
-  final Double rCount = mq.getRowCount(aggregate.getInput());
-  if (rCount == null) {
+  final Double inputRowCount = mq.getRowCount(aggregate.getInput());
+  final Double rowCount = mq.getRowCount(aggregate);
+  if (inputRowCount == null || rowCount == null) {
 return null;
   }
   // 2. CPU cost = sorting cost
-  final double cpuCost = algoUtils.computeSortCPUCost(rCount);
+  final double cpuCost = algoUtils.computeSortCPUCost(rowCount) + 
inputRowCount * algoUtils.getCpuUnitCost();
   // 3. IO cost = cost of writing intermediary results to local FS +
   //  cost of reading from local FS for transferring to GBy +
   //  cost of transferring map outputs to GBy operator
   final Double rAverageSize = mq.getAverageRowSize(aggregate.getInput());
   if (rAverageSize == null) {
 return null;
   }
-  final double ioCost = algoUtils.computeSortIOCost(new 
Pair(rCount,rAverageSize));
+  final double ioCost = algoUtils.computeSortIOCost(new Pair(rowCount, rAverageSize));

Review comment:
   if we will be doing a 2 phase groupby: every mapper will do some 
grouping before it starts emitting; in case `iRC >> oRC` the mappers could 
eliminate a lot of rows ; and they will most likely utilize `O(oRC)` io
   
   this is an underestimation ; I wanted to multiply it with the number of 
mappers - but I don't think that's known at this pointI can add a config 
key for a fixed multiplier.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 477907)
Time Spent: 1h 10m  (was: 1h)

> Enhance cost model to push down more Aggregates
> ---
>
> Key: HIVE-24084
> URL: https://issues.apache.org/jira/browse/HIVE-24084
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24084) Enhance cost model to push down more Aggregates

2020-09-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24084?focusedWorklogId=477992&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-477992
 ]

ASF GitHub Bot logged work on HIVE-24084:
-

Author: ASF GitHub Bot
Created on: 02/Sep/20 17:27
Start Date: 02/Sep/20 17:27
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r482243376



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
 }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the 
results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping 
which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+if (groups.isEmpty()) {
+  return false;
+}
+RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+Set uKeys = mq.getUniqueKeys(input);
+for (ImmutableBitSet u : uKeys) {
+  if (groups.contains(u)) {
+return true;
+  }
+}
+if (input instanceof Join) {
+  Join join = (Join) input;
+  RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+  SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), 
rexBuilder);
+
+  if (cond.valid) {
+ImmutableBitSet newGroup = 
groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+RelNode l = join.getLeft();
+RelNode r = join.getRight();
+
+int joinFieldCount = join.getRowType().getFieldCount();
+int lFieldCount = l.getRowType().getFieldCount();
+
+ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+ImmutableBitSet groupR = newGroup.get(lFieldCount, 
joinFieldCount).shift(-lFieldCount);
+
+if (isGroupingUnique(l, groupL)) {

Review comment:
   Could you execute `areColumnsUnique` on the join input then? Wouldn't 
that simplify this logic?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 477992)
Time Spent: 1h 20m  (was: 1h 10m)

> Enhance cost model to push down more Aggregates
> ---
>
> Key: HIVE-24084
> URL: https://issues.apache.org/jira/browse/HIVE-24084
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24084) Enhance cost model to push down more Aggregates

2020-09-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24084?focusedWorklogId=478065&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-478065
 ]

ASF GitHub Bot logged work on HIVE-24084:
-

Author: ASF GitHub Bot
Created on: 02/Sep/20 18:49
Start Date: 02/Sep/20 18:49
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r482301409



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveOnTezCostModel.java
##
@@ -89,22 +89,23 @@ public RelOptCost getAggregateCost(HiveAggregate aggregate) 
{
 } else {
   final RelMetadataQuery mq = aggregate.getCluster().getMetadataQuery();
   // 1. Sum of input cardinalities
-  final Double rCount = mq.getRowCount(aggregate.getInput());
-  if (rCount == null) {
+  final Double inputRowCount = mq.getRowCount(aggregate.getInput());
+  final Double rowCount = mq.getRowCount(aggregate);
+  if (inputRowCount == null || rowCount == null) {
 return null;
   }
   // 2. CPU cost = sorting cost
-  final double cpuCost = algoUtils.computeSortCPUCost(rCount);
+  final double cpuCost = algoUtils.computeSortCPUCost(rowCount) + 
inputRowCount * algoUtils.getCpuUnitCost();

Review comment:
   I think the problem is that we are trying to encapsulate here the 
algorithm selection too: The fact that we are grouping in each node before 
sorting the data (I think this is also somehow reflected in the `isLe` 
discussion above). However, that is not represented with precision by current 
model, since output rows is supposed to be the output of the final step in the 
aggregation.
   Wrt read, there is also the IO part of the cost, I am trying to understand 
whether some of the cost representation that you are talking about is IO.
   There is some more info about the original formulas that were used to 
compute this here: 
https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive
   Can we split this into two patches and have the changes to the cost model on 
their own? This should also help to discuss this in more detail.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 478065)
Time Spent: 1.5h  (was: 1h 20m)

> Enhance cost model to push down more Aggregates
> ---
>
> Key: HIVE-24084
> URL: https://issues.apache.org/jira/browse/HIVE-24084
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24084) Enhance cost model to push down more Aggregates

2020-09-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24084?focusedWorklogId=479629&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-479629
 ]

ASF GitHub Bot logged work on HIVE-24084:
-

Author: ASF GitHub Bot
Created on: 07/Sep/20 13:03
Start Date: 07/Sep/20 13:03
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r484419776



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
 }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the 
results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping 
which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+if (groups.isEmpty()) {
+  return false;
+}
+RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+Set uKeys = mq.getUniqueKeys(input);
+for (ImmutableBitSet u : uKeys) {
+  if (groups.contains(u)) {
+return true;
+  }
+}
+if (input instanceof Join) {
+  Join join = (Join) input;
+  RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+  SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), 
rexBuilder);
+
+  if (cond.valid) {
+ImmutableBitSet newGroup = 
groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+RelNode l = join.getLeft();
+RelNode r = join.getRight();
+
+int joinFieldCount = join.getRowType().getFieldCount();
+int lFieldCount = l.getRowType().getFieldCount();
+
+ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+ImmutableBitSet groupR = newGroup.get(lFieldCount, 
joinFieldCount).shift(-lFieldCount);
+
+if (isGroupingUnique(l, groupL)) {

Review comment:
   this method recursively checks that the above condition is satisfied or 
not - that's why it needs to call itself





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 479629)
Time Spent: 1h 40m  (was: 1.5h)

> Enhance cost model to push down more Aggregates
> ---
>
> Key: HIVE-24084
> URL: https://issues.apache.org/jira/browse/HIVE-24084
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24084) Enhance cost model to push down more Aggregates

2020-09-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24084?focusedWorklogId=479641&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-479641
 ]

ASF GitHub Bot logged work on HIVE-24084:
-

Author: ASF GitHub Bot
Created on: 07/Sep/20 13:33
Start Date: 07/Sep/20 13:33
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r484434865



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveOnTezCostModel.java
##
@@ -89,22 +89,23 @@ public RelOptCost getAggregateCost(HiveAggregate aggregate) 
{
 } else {
   final RelMetadataQuery mq = aggregate.getCluster().getMetadataQuery();
   // 1. Sum of input cardinalities
-  final Double rCount = mq.getRowCount(aggregate.getInput());
-  if (rCount == null) {
+  final Double inputRowCount = mq.getRowCount(aggregate.getInput());
+  final Double rowCount = mq.getRowCount(aggregate);
+  if (inputRowCount == null || rowCount == null) {
 return null;
   }
   // 2. CPU cost = sorting cost
-  final double cpuCost = algoUtils.computeSortCPUCost(rCount);
+  final double cpuCost = algoUtils.computeSortCPUCost(rowCount) + 
inputRowCount * algoUtils.getCpuUnitCost();

Review comment:
   sure; I'll open a separate ticket for the cost model changes





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 479641)
Time Spent: 1h 50m  (was: 1h 40m)

> Enhance cost model to push down more Aggregates
> ---
>
> Key: HIVE-24084
> URL: https://issues.apache.org/jira/browse/HIVE-24084
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24084) Enhance cost model to push down more Aggregates

2020-09-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24084?focusedWorklogId=479644&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-479644
 ]

ASF GitHub Bot logged work on HIVE-24084:
-

Author: ASF GitHub Bot
Created on: 07/Sep/20 13:42
Start Date: 07/Sep/20 13:42
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r484439075



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##
@@ -290,7 +291,8 @@ public void onMatch(RelOptRuleCall call) {
   RelNode r = relBuilder.build();
   RelOptCost afterCost = mq.getCumulativeCost(r);
   RelOptCost beforeCost = mq.getCumulativeCost(aggregate);
-  if (afterCost.isLt(beforeCost)) {
+  boolean shouldForceTransform = isGroupingUnique(join, 
aggregate.getGroupSet());

Review comment:
   I've added a config: `hive.transpose.aggr.join.unique` to enable/disable 
this feature





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 479644)
Time Spent: 2h  (was: 1h 50m)

> Enhance cost model to push down more Aggregates
> ---
>
> Key: HIVE-24084
> URL: https://issues.apache.org/jira/browse/HIVE-24084
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)