[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

Qifan Chen (Code Review) Tue, 17 Nov 2020 09:36:07 -0800

Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )


Change subject: IMPALA-10314: Optimize planning time for simple limits
......................................................................


Patch Set 3:

(4 comments)

Like it!

http://gerrit.cloudera.org:8080/#/c/16723/3/be/src/service/query-options.h
File be/src/service/query-options.h:

http://gerrit.cloudera.org:8080/#/c/16723/3/be/src/service/query-options.h@226
PS3, Line 226: OPTIMIZE_SIMPLE_LIMIT
OPTIMIZE_SIMPLE_LIMIT_QUERY probably is more specific.


http://gerrit.cloudera.org:8080/#/c/16723/3/common/thrift/ImpalaService.thrift
File common/thrift/ImpalaService.thrift:

http://gerrit.cloudera.org:8080/#/c/16723/3/common/thrift/ImpalaService.thrift@601
PS3, Line 601: (1 row per file)
Sounds like for parquet/orc, we can take a sample of few data files and get the 
number of rows from the meta-data section for each. This probably can help 
speed up the pruning process.


http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
File fe/src/main/java/org/apache/impala/analysis/SelectStmt.java:

http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java@180
PS3, Line 180: order-by
Sounds like we should allow the order by clause since it does not 
increase/decrease # of rows.


http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java
File fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java:

http://gerrit.cloudera.org:8080/#/c/16723/3/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java@209
PS3, Line 209: for (FeFsPartition p : partitions) {
             :         numRows += p.getNumFileDescriptors();
             :         prunedPartitions.add(p);
             :         if (numRows >= analyzer.getSimpleLimitStatus().second) {
             :           // here we only limit the partitions; later in 
HdfsScanNode we will
             :           // limit the file descriptors within a partition
             :           break;
             :         }
I wonder if the prunnedPartitions returned here are the only ones to be scanned 
during run-time. If so, I think for this optimization to work, we should not 
allow any WHERE clause.

The other point is that we may consider randomly pick a small subset of 
partitions to reduce the chance of contention from multiple small limit queries.



--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 3
Gerrit-Owner: Aman Sinha <amsi...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <amsi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Shant Hovsepian <sh...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Comment-Date: Tue, 17 Nov 2020 17:35:46 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10314: Optimize planning time for simple limits

Reply via email to