Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/16723 )
Change subject: IMPALA-10314: Optimize planning time for simple limits ...................................................................... Patch Set 3: (4 comments) Like it! http://gerrit.cloudera.org:8080/#/c/16723/3/be/src/service/query-options.h File be/src/service/query-options.h: http://gerrit.cloudera.org:8080/#/c/16723/3/be/src/service/query-options.h@226 PS3, Line 226: OPTIMIZE_SIMPLE_LIMIT OPTIMIZE_SIMPLE_LIMIT_QUERY probably is more specific. http://gerrit.cloudera.org:8080/#/c/16723/3/common/thrift/ImpalaService.thrift File common/thrift/ImpalaService.thrift: http://gerrit.cloudera.org:8080/#/c/16723/3/common/thrift/ImpalaService.thrift@601 PS3, Line 601: (1 row per file) Sounds like for parquet/orc, we can take a sample of few data files and get the number of rows from the meta-data section for each. This probably can help speed up the pruning process. http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java File fe/src/main/java/org/apache/impala/analysis/SelectStmt.java: http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java@180 PS3, Line 180: order-by Sounds like we should allow the order by clause since it does not increase/decrease # of rows. http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java File fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java: http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java@209 PS3, Line 209: for (FeFsPartition p : partitions) { : numRows += p.getNumFileDescriptors(); : prunedPartitions.add(p); : if (numRows >= analyzer.getSimpleLimitStatus().second) { : // here we only limit the partitions; later in HdfsScanNode we will : // limit the file descriptors within a partition : break; : } I wonder if the prunnedPartitions returned here are the only ones to be scanned during run-time. If so, I think for this optimization to work, we should not allow any WHERE clause. The other point is that we may consider randomly pick a small subset of partitions to reduce the chance of contention from multiple small limit queries. -- To view, visit http://gerrit.cloudera.org:8080/16723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574 Gerrit-Change-Number: 16723 Gerrit-PatchSet: 3 Gerrit-Owner: Aman Sinha <amsi...@cloudera.com> Gerrit-Reviewer: Aman Sinha <amsi...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Shant Hovsepian <sh...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Comment-Date: Tue, 17 Nov 2020 17:35:46 +0000 Gerrit-HasComments: Yes