[jira] [Work logged] (HIVE-24084) Enhance cost model to push down more Aggregates
[ https://issues.apache.org/jira/browse/HIVE-24084?focusedWorklogId=475379&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-475379 ] ASF GitHub Bot logged work on HIVE-24084: - Author: ASF GitHub Bot Created on: 27/Aug/20 15:58 Start Date: 27/Aug/20 15:58 Worklog Time Spent: 10m Work Description: kgyrtkirk opened a new pull request #1439: URL: https://github.com/apache/hive/pull/1439 ### What changes were proposed in this pull request? * changes to cost estimations - fixed a few bugs * changed Aggregate estimation to a different on HiveOnTezCostmodel * in HiveAggregateJoinTransposeRule relaxed the comparision with the new plan to use <= instead < ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 475379) Remaining Estimate: 0h Time Spent: 10m > Enhance cost model to push down more Aggregates > --- > > Key: HIVE-24084 > URL: https://issues.apache.org/jira/browse/HIVE-24084 > Project: Hive > Issue Type: Improvement >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24084) Enhance cost model to push down more Aggregates
[ https://issues.apache.org/jira/browse/HIVE-24084?focusedWorklogId=475953&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-475953 ] ASF GitHub Bot logged work on HIVE-24084: - Author: ASF GitHub Bot Created on: 28/Aug/20 19:14 Start Date: 28/Aug/20 19:14 Worklog Time Spent: 10m Work Description: jcamachor commented on a change in pull request #1439: URL: https://github.com/apache/hive/pull/1439#discussion_r479474959 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java ## @@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) { } } + /** + * Determines weather the give grouping is unique. + * + * Consider a join which might produce non-unique rows; but later the results are aggregated again. + * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s). + */ + private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) { +if (groups.isEmpty()) { + return false; +} +RelMetadataQuery mq = input.getCluster().getMetadataQuery(); +Set uKeys = mq.getUniqueKeys(input); Review comment: We could rely on `mq.areColumnsUnique`. ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveOnTezCostModel.java ## @@ -89,22 +89,23 @@ public RelOptCost getAggregateCost(HiveAggregate aggregate) { } else { final RelMetadataQuery mq = aggregate.getCluster().getMetadataQuery(); // 1. Sum of input cardinalities - final Double rCount = mq.getRowCount(aggregate.getInput()); - if (rCount == null) { + final Double inputRowCount = mq.getRowCount(aggregate.getInput()); + final Double rowCount = mq.getRowCount(aggregate); + if (inputRowCount == null || rowCount == null) { return null; } // 2. CPU cost = sorting cost - final double cpuCost = algoUtils.computeSortCPUCost(rCount); + final double cpuCost = algoUtils.computeSortCPUCost(rowCount) + inputRowCount * algoUtils.getCpuUnitCost(); Review comment: Not sure about this change. If the algorithm is sort-based, you will still sort the complete input, right? ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java ## @@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) { } } + /** + * Determines weather the give grouping is unique. + * + * Consider a join which might produce non-unique rows; but later the results are aggregated again. + * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s). + */ + private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) { +if (groups.isEmpty()) { + return false; +} +RelMetadataQuery mq = input.getCluster().getMetadataQuery(); +Set uKeys = mq.getUniqueKeys(input); +for (ImmutableBitSet u : uKeys) { + if (groups.contains(u)) { +return true; + } +} +if (input instanceof Join) { + Join join = (Join) input; + RexBuilder rexBuilder = input.getCluster().getRexBuilder(); + SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), rexBuilder); + + if (cond.valid) { +ImmutableBitSet newGroup = groups.intersect(ImmutableBitSet.fromBitSet(cond.fields)); +RelNode l = join.getLeft(); +RelNode r = join.getRight(); + +int joinFieldCount = join.getRowType().getFieldCount(); +int lFieldCount = l.getRowType().getFieldCount(); + +ImmutableBitSet groupL = newGroup.get(0, lFieldCount); +ImmutableBitSet groupR = newGroup.get(lFieldCount, joinFieldCount).shift(-lFieldCount); + +if (isGroupingUnique(l, groupL)) { Review comment: This could call `mq.areColumnsUnique` instead of making the recursive call. ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java ## @@ -290,7 +291,8 @@ public void onMatch(RelOptRuleCall call) { RelNode r = relBuilder.build(); RelOptCost afterCost = mq.getCumulativeCost(r); RelOptCost beforeCost = mq.getCumulativeCost(aggregate); - if (afterCost.isLt(beforeCost)) { Review comment: I think you suggested changing this... Maybe `isLe` if we do not introduce an additional aggregate on top? ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java ## @@ -290,7 +291,8 @@ public void onMatch(RelOptRuleCall call) { RelNode r = relBuilder.build(); RelOptCost afterCost = mq.getCumulativeCost(r); RelOptCost beforeCost = mq.getCumulativeCost(aggrega
[jira] [Work logged] (HIVE-24084) Enhance cost model to push down more Aggregates
[ https://issues.apache.org/jira/browse/HIVE-24084?focusedWorklogId=475964&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-475964 ] ASF GitHub Bot logged work on HIVE-24084: - Author: ASF GitHub Bot Created on: 28/Aug/20 20:13 Start Date: 28/Aug/20 20:13 Worklog Time Spent: 10m Work Description: vineetgarg02 commented on a change in pull request #1439: URL: https://github.com/apache/hive/pull/1439#discussion_r479508618 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveOnTezCostModel.java ## @@ -89,22 +89,23 @@ public RelOptCost getAggregateCost(HiveAggregate aggregate) { } else { final RelMetadataQuery mq = aggregate.getCluster().getMetadataQuery(); // 1. Sum of input cardinalities - final Double rCount = mq.getRowCount(aggregate.getInput()); - if (rCount == null) { + final Double inputRowCount = mq.getRowCount(aggregate.getInput()); + final Double rowCount = mq.getRowCount(aggregate); Review comment: Can we change `rowCount` to `outputRowCount`? This will make the change more readable. ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveOnTezCostModel.java ## @@ -89,22 +89,23 @@ public RelOptCost getAggregateCost(HiveAggregate aggregate) { } else { final RelMetadataQuery mq = aggregate.getCluster().getMetadataQuery(); // 1. Sum of input cardinalities - final Double rCount = mq.getRowCount(aggregate.getInput()); - if (rCount == null) { + final Double inputRowCount = mq.getRowCount(aggregate.getInput()); + final Double rowCount = mq.getRowCount(aggregate); + if (inputRowCount == null || rowCount == null) { return null; } // 2. CPU cost = sorting cost - final double cpuCost = algoUtils.computeSortCPUCost(rCount); + final double cpuCost = algoUtils.computeSortCPUCost(rowCount) + inputRowCount * algoUtils.getCpuUnitCost(); // 3. IO cost = cost of writing intermediary results to local FS + // cost of reading from local FS for transferring to GBy + // cost of transferring map outputs to GBy operator final Double rAverageSize = mq.getAverageRowSize(aggregate.getInput()); if (rAverageSize == null) { return null; } - final double ioCost = algoUtils.computeSortIOCost(new Pair(rCount,rAverageSize)); + final double ioCost = algoUtils.computeSortIOCost(new Pair(rowCount, rAverageSize)); Review comment: `rAverageSize` is based on input row count but `rowCount` is output row count. Is this intended or should average row size be computed based on output row count? ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java ## @@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) { } } + /** + * Determines weather the give grouping is unique. + * + * Consider a join which might produce non-unique rows; but later the results are aggregated again. + * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s). + */ + private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) { +if (groups.isEmpty()) { + return false; +} +RelMetadataQuery mq = input.getCluster().getMetadataQuery(); +Set uKeys = mq.getUniqueKeys(input); Review comment: If the purpose of this method is to determine that given a set of columns are unique or not you can use `areColumnsUnique` as @jcamachor suggested. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 475964) Time Spent: 0.5h (was: 20m) > Enhance cost model to push down more Aggregates > --- > > Key: HIVE-24084 > URL: https://issues.apache.org/jira/browse/HIVE-24084 > Project: Hive > Issue Type: Improvement >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24084) Enhance cost model to push down more Aggregates
[ https://issues.apache.org/jira/browse/HIVE-24084?focusedWorklogId=477882&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-477882 ] ASF GitHub Bot logged work on HIVE-24084: - Author: ASF GitHub Bot Created on: 02/Sep/20 13:58 Start Date: 02/Sep/20 13:58 Worklog Time Spent: 10m Work Description: kgyrtkirk commented on a change in pull request #1439: URL: https://github.com/apache/hive/pull/1439#discussion_r482090289 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java ## @@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) { } } + /** + * Determines weather the give grouping is unique. + * + * Consider a join which might produce non-unique rows; but later the results are aggregated again. + * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s). + */ + private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) { +if (groups.isEmpty()) { + return false; +} +RelMetadataQuery mq = input.getCluster().getMetadataQuery(); +Set uKeys = mq.getUniqueKeys(input); +for (ImmutableBitSet u : uKeys) { + if (groups.contains(u)) { +return true; + } +} +if (input instanceof Join) { + Join join = (Join) input; + RexBuilder rexBuilder = input.getCluster().getRexBuilder(); + SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), rexBuilder); + + if (cond.valid) { +ImmutableBitSet newGroup = groups.intersect(ImmutableBitSet.fromBitSet(cond.fields)); +RelNode l = join.getLeft(); +RelNode r = join.getRight(); + +int joinFieldCount = join.getRowType().getFieldCount(); +int lFieldCount = l.getRowType().getFieldCount(); + +ImmutableBitSet groupL = newGroup.get(0, lFieldCount); +ImmutableBitSet groupR = newGroup.get(lFieldCount, joinFieldCount).shift(-lFieldCount); + +if (isGroupingUnique(l, groupL)) { Review comment: this method does a bit different thing - honestly I feeled like I'm in trouble when I've given this name to it :) this method checks if the given columns contain an unique column somewhere in the covered joins; (this still sound fuzzy) so let's take an example consider: ``` select c_id, sum(i_prize) from customer c join item i on(i.c_id=c.c_id) ``` * do an aggregate grouping by the column C_ID ; and sum up something * below is a join which joins by C_ID * asking wether C_ID is a unique column on top of the join is false; but there is subtree in which C_ID is unique => so if we push the aggregate on that branch the aggregation will be a no-op I think this case is not handled by `areColumnsUnique` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 477882) Time Spent: 40m (was: 0.5h) > Enhance cost model to push down more Aggregates > --- > > Key: HIVE-24084 > URL: https://issues.apache.org/jira/browse/HIVE-24084 > Project: Hive > Issue Type: Improvement >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24084) Enhance cost model to push down more Aggregates
[ https://issues.apache.org/jira/browse/HIVE-24084?focusedWorklogId=477884&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-477884 ] ASF GitHub Bot logged work on HIVE-24084: - Author: ASF GitHub Bot Created on: 02/Sep/20 14:00 Start Date: 02/Sep/20 14:00 Worklog Time Spent: 10m Work Description: kgyrtkirk commented on a change in pull request #1439: URL: https://github.com/apache/hive/pull/1439#discussion_r482092255 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java ## @@ -290,7 +291,8 @@ public void onMatch(RelOptRuleCall call) { RelNode r = relBuilder.build(); RelOptCost afterCost = mq.getCumulativeCost(r); RelOptCost beforeCost = mq.getCumulativeCost(aggregate); - if (afterCost.isLt(beforeCost)) { Review comment: yes; if we use `isLe` the current cost model which only takes rowcount into account will prefer the pushing aggregates further there is another alternative to the `force` based approach: the rule can be configured to use the more advanced cost system - so that it could take cpu/io cost into account This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 477884) Time Spent: 50m (was: 40m) > Enhance cost model to push down more Aggregates > --- > > Key: HIVE-24084 > URL: https://issues.apache.org/jira/browse/HIVE-24084 > Project: Hive > Issue Type: Improvement >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24084) Enhance cost model to push down more Aggregates
[ https://issues.apache.org/jira/browse/HIVE-24084?focusedWorklogId=477900&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-477900 ] ASF GitHub Bot logged work on HIVE-24084: - Author: ASF GitHub Bot Created on: 02/Sep/20 14:35 Start Date: 02/Sep/20 14:35 Worklog Time Spent: 10m Work Description: kgyrtkirk commented on a change in pull request #1439: URL: https://github.com/apache/hive/pull/1439#discussion_r482119043 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveOnTezCostModel.java ## @@ -89,22 +89,23 @@ public RelOptCost getAggregateCost(HiveAggregate aggregate) { } else { final RelMetadataQuery mq = aggregate.getCluster().getMetadataQuery(); // 1. Sum of input cardinalities - final Double rCount = mq.getRowCount(aggregate.getInput()); - if (rCount == null) { + final Double inputRowCount = mq.getRowCount(aggregate.getInput()); + final Double rowCount = mq.getRowCount(aggregate); + if (inputRowCount == null || rowCount == null) { return null; } // 2. CPU cost = sorting cost - final double cpuCost = algoUtils.computeSortCPUCost(rCount); + final double cpuCost = algoUtils.computeSortCPUCost(rowCount) + inputRowCount * algoUtils.getCpuUnitCost(); Review comment: maybe...I'm trying to catch the case when `inputRowCount >> outputRowCount`; we are also grouping - so it will not be a full sort at all ; I was using the above to achieve: ``` log(outputRowCount)*outputRowCount + inputRowCount*COST ``` the rational behind this is that it needs to really sort `oRC` and read `iRC` rows - this could be an underestimation...but `log(iRC)*iRC` was highly overestimating the cost one alternative for the above could be to use: ``` log(outputRowCount) * inputRowCount ``` the rational behind this: we will need to find the place for every input row; but we also know that the output will be at most `outputRowCount` - so it shouldn't take more time to find the place for the actual row than `log(outputRowCount)` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 477900) Time Spent: 1h (was: 50m) > Enhance cost model to push down more Aggregates > --- > > Key: HIVE-24084 > URL: https://issues.apache.org/jira/browse/HIVE-24084 > Project: Hive > Issue Type: Improvement >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich >Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24084) Enhance cost model to push down more Aggregates
[ https://issues.apache.org/jira/browse/HIVE-24084?focusedWorklogId=477907&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-477907 ] ASF GitHub Bot logged work on HIVE-24084: - Author: ASF GitHub Bot Created on: 02/Sep/20 14:41 Start Date: 02/Sep/20 14:41 Worklog Time Spent: 10m Work Description: kgyrtkirk commented on a change in pull request #1439: URL: https://github.com/apache/hive/pull/1439#discussion_r482123540 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveOnTezCostModel.java ## @@ -89,22 +89,23 @@ public RelOptCost getAggregateCost(HiveAggregate aggregate) { } else { final RelMetadataQuery mq = aggregate.getCluster().getMetadataQuery(); // 1. Sum of input cardinalities - final Double rCount = mq.getRowCount(aggregate.getInput()); - if (rCount == null) { + final Double inputRowCount = mq.getRowCount(aggregate.getInput()); + final Double rowCount = mq.getRowCount(aggregate); + if (inputRowCount == null || rowCount == null) { return null; } // 2. CPU cost = sorting cost - final double cpuCost = algoUtils.computeSortCPUCost(rCount); + final double cpuCost = algoUtils.computeSortCPUCost(rowCount) + inputRowCount * algoUtils.getCpuUnitCost(); // 3. IO cost = cost of writing intermediary results to local FS + // cost of reading from local FS for transferring to GBy + // cost of transferring map outputs to GBy operator final Double rAverageSize = mq.getAverageRowSize(aggregate.getInput()); if (rAverageSize == null) { return null; } - final double ioCost = algoUtils.computeSortIOCost(new Pair(rCount,rAverageSize)); + final double ioCost = algoUtils.computeSortIOCost(new Pair(rowCount, rAverageSize)); Review comment: if we will be doing a 2 phase groupby: every mapper will do some grouping before it starts emitting; in case `iRC >> oRC` the mappers could eliminate a lot of rows ; and they will most likely utilize `O(oRC)` io this is an underestimation ; I wanted to multiply it with the number of mappers - but I don't think that's known at this pointI can add a config key for a fixed multiplier. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 477907) Time Spent: 1h 10m (was: 1h) > Enhance cost model to push down more Aggregates > --- > > Key: HIVE-24084 > URL: https://issues.apache.org/jira/browse/HIVE-24084 > Project: Hive > Issue Type: Improvement >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich >Priority: Major > Labels: pull-request-available > Time Spent: 1h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24084) Enhance cost model to push down more Aggregates
[ https://issues.apache.org/jira/browse/HIVE-24084?focusedWorklogId=477992&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-477992 ] ASF GitHub Bot logged work on HIVE-24084: - Author: ASF GitHub Bot Created on: 02/Sep/20 17:27 Start Date: 02/Sep/20 17:27 Worklog Time Spent: 10m Work Description: jcamachor commented on a change in pull request #1439: URL: https://github.com/apache/hive/pull/1439#discussion_r482243376 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java ## @@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) { } } + /** + * Determines weather the give grouping is unique. + * + * Consider a join which might produce non-unique rows; but later the results are aggregated again. + * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s). + */ + private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) { +if (groups.isEmpty()) { + return false; +} +RelMetadataQuery mq = input.getCluster().getMetadataQuery(); +Set uKeys = mq.getUniqueKeys(input); +for (ImmutableBitSet u : uKeys) { + if (groups.contains(u)) { +return true; + } +} +if (input instanceof Join) { + Join join = (Join) input; + RexBuilder rexBuilder = input.getCluster().getRexBuilder(); + SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), rexBuilder); + + if (cond.valid) { +ImmutableBitSet newGroup = groups.intersect(ImmutableBitSet.fromBitSet(cond.fields)); +RelNode l = join.getLeft(); +RelNode r = join.getRight(); + +int joinFieldCount = join.getRowType().getFieldCount(); +int lFieldCount = l.getRowType().getFieldCount(); + +ImmutableBitSet groupL = newGroup.get(0, lFieldCount); +ImmutableBitSet groupR = newGroup.get(lFieldCount, joinFieldCount).shift(-lFieldCount); + +if (isGroupingUnique(l, groupL)) { Review comment: Could you execute `areColumnsUnique` on the join input then? Wouldn't that simplify this logic? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 477992) Time Spent: 1h 20m (was: 1h 10m) > Enhance cost model to push down more Aggregates > --- > > Key: HIVE-24084 > URL: https://issues.apache.org/jira/browse/HIVE-24084 > Project: Hive > Issue Type: Improvement >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich >Priority: Major > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24084) Enhance cost model to push down more Aggregates
[ https://issues.apache.org/jira/browse/HIVE-24084?focusedWorklogId=478065&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-478065 ] ASF GitHub Bot logged work on HIVE-24084: - Author: ASF GitHub Bot Created on: 02/Sep/20 18:49 Start Date: 02/Sep/20 18:49 Worklog Time Spent: 10m Work Description: jcamachor commented on a change in pull request #1439: URL: https://github.com/apache/hive/pull/1439#discussion_r482301409 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveOnTezCostModel.java ## @@ -89,22 +89,23 @@ public RelOptCost getAggregateCost(HiveAggregate aggregate) { } else { final RelMetadataQuery mq = aggregate.getCluster().getMetadataQuery(); // 1. Sum of input cardinalities - final Double rCount = mq.getRowCount(aggregate.getInput()); - if (rCount == null) { + final Double inputRowCount = mq.getRowCount(aggregate.getInput()); + final Double rowCount = mq.getRowCount(aggregate); + if (inputRowCount == null || rowCount == null) { return null; } // 2. CPU cost = sorting cost - final double cpuCost = algoUtils.computeSortCPUCost(rCount); + final double cpuCost = algoUtils.computeSortCPUCost(rowCount) + inputRowCount * algoUtils.getCpuUnitCost(); Review comment: I think the problem is that we are trying to encapsulate here the algorithm selection too: The fact that we are grouping in each node before sorting the data (I think this is also somehow reflected in the `isLe` discussion above). However, that is not represented with precision by current model, since output rows is supposed to be the output of the final step in the aggregation. Wrt read, there is also the IO part of the cost, I am trying to understand whether some of the cost representation that you are talking about is IO. There is some more info about the original formulas that were used to compute this here: https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive Can we split this into two patches and have the changes to the cost model on their own? This should also help to discuss this in more detail. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 478065) Time Spent: 1.5h (was: 1h 20m) > Enhance cost model to push down more Aggregates > --- > > Key: HIVE-24084 > URL: https://issues.apache.org/jira/browse/HIVE-24084 > Project: Hive > Issue Type: Improvement >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich >Priority: Major > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24084) Enhance cost model to push down more Aggregates
[ https://issues.apache.org/jira/browse/HIVE-24084?focusedWorklogId=479629&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-479629 ] ASF GitHub Bot logged work on HIVE-24084: - Author: ASF GitHub Bot Created on: 07/Sep/20 13:03 Start Date: 07/Sep/20 13:03 Worklog Time Spent: 10m Work Description: kgyrtkirk commented on a change in pull request #1439: URL: https://github.com/apache/hive/pull/1439#discussion_r484419776 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java ## @@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) { } } + /** + * Determines weather the give grouping is unique. + * + * Consider a join which might produce non-unique rows; but later the results are aggregated again. + * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s). + */ + private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) { +if (groups.isEmpty()) { + return false; +} +RelMetadataQuery mq = input.getCluster().getMetadataQuery(); +Set uKeys = mq.getUniqueKeys(input); +for (ImmutableBitSet u : uKeys) { + if (groups.contains(u)) { +return true; + } +} +if (input instanceof Join) { + Join join = (Join) input; + RexBuilder rexBuilder = input.getCluster().getRexBuilder(); + SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), rexBuilder); + + if (cond.valid) { +ImmutableBitSet newGroup = groups.intersect(ImmutableBitSet.fromBitSet(cond.fields)); +RelNode l = join.getLeft(); +RelNode r = join.getRight(); + +int joinFieldCount = join.getRowType().getFieldCount(); +int lFieldCount = l.getRowType().getFieldCount(); + +ImmutableBitSet groupL = newGroup.get(0, lFieldCount); +ImmutableBitSet groupR = newGroup.get(lFieldCount, joinFieldCount).shift(-lFieldCount); + +if (isGroupingUnique(l, groupL)) { Review comment: this method recursively checks that the above condition is satisfied or not - that's why it needs to call itself This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 479629) Time Spent: 1h 40m (was: 1.5h) > Enhance cost model to push down more Aggregates > --- > > Key: HIVE-24084 > URL: https://issues.apache.org/jira/browse/HIVE-24084 > Project: Hive > Issue Type: Improvement >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich >Priority: Major > Labels: pull-request-available > Time Spent: 1h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24084) Enhance cost model to push down more Aggregates
[ https://issues.apache.org/jira/browse/HIVE-24084?focusedWorklogId=479641&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-479641 ] ASF GitHub Bot logged work on HIVE-24084: - Author: ASF GitHub Bot Created on: 07/Sep/20 13:33 Start Date: 07/Sep/20 13:33 Worklog Time Spent: 10m Work Description: kgyrtkirk commented on a change in pull request #1439: URL: https://github.com/apache/hive/pull/1439#discussion_r484434865 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveOnTezCostModel.java ## @@ -89,22 +89,23 @@ public RelOptCost getAggregateCost(HiveAggregate aggregate) { } else { final RelMetadataQuery mq = aggregate.getCluster().getMetadataQuery(); // 1. Sum of input cardinalities - final Double rCount = mq.getRowCount(aggregate.getInput()); - if (rCount == null) { + final Double inputRowCount = mq.getRowCount(aggregate.getInput()); + final Double rowCount = mq.getRowCount(aggregate); + if (inputRowCount == null || rowCount == null) { return null; } // 2. CPU cost = sorting cost - final double cpuCost = algoUtils.computeSortCPUCost(rCount); + final double cpuCost = algoUtils.computeSortCPUCost(rowCount) + inputRowCount * algoUtils.getCpuUnitCost(); Review comment: sure; I'll open a separate ticket for the cost model changes This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 479641) Time Spent: 1h 50m (was: 1h 40m) > Enhance cost model to push down more Aggregates > --- > > Key: HIVE-24084 > URL: https://issues.apache.org/jira/browse/HIVE-24084 > Project: Hive > Issue Type: Improvement >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich >Priority: Major > Labels: pull-request-available > Time Spent: 1h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24084) Enhance cost model to push down more Aggregates
[ https://issues.apache.org/jira/browse/HIVE-24084?focusedWorklogId=479644&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-479644 ] ASF GitHub Bot logged work on HIVE-24084: - Author: ASF GitHub Bot Created on: 07/Sep/20 13:42 Start Date: 07/Sep/20 13:42 Worklog Time Spent: 10m Work Description: kgyrtkirk commented on a change in pull request #1439: URL: https://github.com/apache/hive/pull/1439#discussion_r484439075 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java ## @@ -290,7 +291,8 @@ public void onMatch(RelOptRuleCall call) { RelNode r = relBuilder.build(); RelOptCost afterCost = mq.getCumulativeCost(r); RelOptCost beforeCost = mq.getCumulativeCost(aggregate); - if (afterCost.isLt(beforeCost)) { + boolean shouldForceTransform = isGroupingUnique(join, aggregate.getGroupSet()); Review comment: I've added a config: `hive.transpose.aggr.join.unique` to enable/disable this feature This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 479644) Time Spent: 2h (was: 1h 50m) > Enhance cost model to push down more Aggregates > --- > > Key: HIVE-24084 > URL: https://issues.apache.org/jira/browse/HIVE-24084 > Project: Hive > Issue Type: Improvement >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich >Priority: Major > Labels: pull-request-available > Time Spent: 2h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)