Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16242 )
Change subject: IMPALA-9979: part 2: partitioned top-n ...................................................................... Patch Set 30: (21 comments) http://gerrit.cloudera.org:8080/#/c/16242/30//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16242/30//COMMIT_MSG@49 PS30, Line 49: The > nit: 'This' Done http://gerrit.cloudera.org:8080/#/c/16242/30//COMMIT_MSG@57 PS30, Line 57: both : the partitioning > nit: ..and non-partitioned case ? Done http://gerrit.cloudera.org:8080/#/c/16242/30/be/src/exec/topn-node.h File be/src/exec/topn-node.h: http://gerrit.cloudera.org:8080/#/c/16242/30/be/src/exec/topn-node.h@133 PS30, Line 133: partition are returned > Would be good to add how the returned partitions are ordered among themselv It seems worth being clear about the invariant, so added it here. http://gerrit.cloudera.org:8080/#/c/16242/30/be/src/exec/topn-node.cc File be/src/exec/topn-node.cc: http://gerrit.cloudera.org:8080/#/c/16242/30/be/src/exec/topn-node.cc@53 PS30, Line 53: the the > nit: repetition 'the' D'oh http://gerrit.cloudera.org:8080/#/c/16242/30/be/src/exec/topn-node.cc@59 PS30, Line 59: the the > nit: repetition 'the' D'oh http://gerrit.cloudera.org:8080/#/c/16242/30/be/src/exec/topn-node.cc@398 PS30, Line 398: RETURN_IF_CANCELLED(state); > I was wondering if the inner while loop (line 418) should also have a cance The inner loop is bounded by the batch size so should be fine (the general rule has been to check for cancellation at O(batch size) intervals). http://gerrit.cloudera.org:8080/#/c/16242/30/be/src/exec/topn-node.cc@559 PS30, Line 559: // We evict heaps starting with the heaps that were least effective at filtering > The eviction policy is interesting. Considering that a rank()/rownum() can Maybe I didn't quite get the point, but we should be able to start filtering as soon as we have N elements in a heap for the partition - after that point, each new row is either in the top N we've seen so far (in which case we discard a previous row) or it's not (in which case we discard the new row). http://gerrit.cloudera.org:8080/#/c/16242/30/be/src/exec/topn-node.cc@666 PS30, Line 666: // The key references memory in 'tuple_pool_'. Replace it with a rematerialized tuple. > line too long (92 > 90) Done http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/analysis/SortInfo.java File fe/src/main/java/org/apache/impala/analysis/SortInfo.java: http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/analysis/SortInfo.java@279 PS30, Line 279: public long estimateTopNMaterializedSize(long totalRows) { > Should this estimate account for any per-partition overhead (additional met We could probably add 100 bytes or so to account for the size of the heap, but I feel like it probably doesn't really affect estimates all that much and adds a bit of complexity. I did realise that the name of this method is a bit misleading, so renamed it to reflect that it's just estimating the size of that many rows. http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java File fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java: http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@646 PS30, Line 646: limit on > nit: incomplete sentence Done http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java File fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java: http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java@453 PS30, Line 453: if (limit.limit > 1) { > Maybe simplify this if-else to planNodeLimit = Math.max(limit.limit, 1) Done http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java@461 PS30, Line 461: if (partitionByExprs.isEmpty()) { > There was a recent issue with constant partition-by and order-by exprs (htt I modified this to check for whether they are all literals and added a test. There's a bit of a distinction between "constant", which means assumed constant within a query and literals that are guaranteed constant, so I used the latter. http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java@885 PS30, Line 885: /// Whether the source predicate was a simple < or <= inequality. > This sounded confusing at first ..I interpreted it to mean true for < (the Done http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java@928 PS30, Line 928: to . > nit: incomplete Done http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java File fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java: http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java@1158 PS30, Line 1158: returns > nit: 'returns ties' Done http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/planner/SelectNode.java File fe/src/main/java/org/apache/impala/planner/SelectNode.java: http://gerrit.cloudera.org:8080/#/c/16242/30/fe/src/main/java/org/apache/impala/planner/SelectNode.java@81 PS30, Line 81: selectNode.conjuncts_.addAll(conjuncts); > We should check if the child's conjuncts being added are unique compared to That's a good point. I don't understand exactly when duplicate conjuncts would or wouldn't be created, but it makes sense to handle it here. http://gerrit.cloudera.org:8080/#/c/16242/30/testdata/workloads/functional-planner/queries/PlannerTest/analytic-rank-pushdown.test File testdata/workloads/functional-planner/queries/PlannerTest/analytic-rank-pushdown.test: http://gerrit.cloudera.org:8080/#/c/16242/30/testdata/workloads/functional-planner/queries/PlannerTest/analytic-rank-pushdown.test@55 PS30, Line 55: rnk > nit: This extra rnk is not needed since doing a select * from the subquery Done http://gerrit.cloudera.org:8080/#/c/16242/30/testdata/workloads/functional-planner/queries/PlannerTest/analytic-rank-pushdown.test@755 PS30, Line 755: # Analytic predicate can be migrated into inline view, but not through the analytic > Is this true even if the rank predicate is on the top level subquery, not i I updated this comment to make it clearer which predicates it was referring to. int_col refers to the first_value() function but can't be pushed down. http://gerrit.cloudera.org:8080/#/c/16242/30/testdata/workloads/functional-planner/queries/PlannerTest/analytic-rank-pushdown.test@1741 PS30, Line 1741: # Other predicates referencing rank() and row_number() can't be > For the cases where the optimization is not applied maybe the DISTRIBUTEDPL That's a great point. http://gerrit.cloudera.org:8080/#/c/16242/30/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-partitioned-top-n.test File testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-partitioned-top-n.test: http://gerrit.cloudera.org:8080/#/c/16242/30/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-partitioned-top-n.test@76 PS30, Line 76: # row_number() predicate on equality instead of range. > In the previous patch (IMPALA-10296) equality predicate was disallowed exce The comment needed updating. The limit from the final TOP-N is not in fact pushed down here because it isn't safe. http://gerrit.cloudera.org:8080/#/c/16242/30/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-partitioned-top-n.test@777 PS30, Line 777: # Limit pushdown should not be applied. > But partitioned top-n is created. It would be good to clarify if (a) no op Done -- To view, visit http://gerrit.cloudera.org:8080/16242 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic638af9495981d889a4cb7455a71e8be0eb1a8e5 Gerrit-Change-Number: 16242 Gerrit-PatchSet: 30 Gerrit-Owner: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Aman Sinha <amsi...@cloudera.com> Gerrit-Reviewer: David Rorke <dro...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Shant Hovsepian <sh...@cloudera.com> Gerrit-Reviewer: Thomas Tauber-Marshall <tmarsh...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Comment-Date: Tue, 09 Feb 2021 04:11:59 +0000 Gerrit-HasComments: Yes