Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16792 )

Change subject: IMPALA-10360: Allow simple limit to be treated as sampling hint
......................................................................


Patch Set 3:

(5 comments)

Thanks a lot for the explanation. Two additional questions follow :-).

http://gerrit.cloudera.org:8080/#/c/16792/2/fe/src/main/java/org/apache/impala/analysis/CompoundPredicate.java
File fe/src/main/java/org/apache/impala/analysis/CompoundPredicate.java:

http://gerrit.cloudera.org:8080/#/c/16792/2/fe/src/main/java/org/apache/impala/analysis/CompoundPredicate.java@79
PS2, Line 79:  if ((op == Operator.AND &&
            :           (Expr.IS_ALWAYS_TRUE_PREDICATE.apply(e1) &&
            :               Expr.IS_ALWAYS_TRUE_P
> Makes sense to handle OR for completeness.  I have added it.  One thing tha
Done


http://gerrit.cloudera.org:8080/#/c/16792/2/fe/src/main/java/org/apache/impala/analysis/Expr.java
File fe/src/main/java/org/apache/impala/analysis/Expr.java:

http://gerrit.cloudera.org:8080/#/c/16792/2/fe/src/main/java/org/apache/impala/analysis/Expr.java@520
PS2, Line 520: if (conjuncts.size() > 1) {
> For the single conjunct case it should have already been set on the line 51
Done


http://gerrit.cloudera.org:8080/#/c/16792/2/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
File fe/src/main/java/org/apache/impala/analysis/SelectStmt.java:

http://gerrit.cloudera.org:8080/#/c/16792/2/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java@223
PS2, Line 223: if (getTableRefs().size() == 1)
> Since this is a qualifying check to decide whether this query block is elig
By looking at the following view DDL, I have the impression that the 
convert_limit_to_sample hint is only for table alltypes_date_partition_2, not 
for table alltypessmall. It is about right?

---- DATASET
functional
---- BASE_TABLE_NAME
alltypes_dp_2_view_2
---- CREATE
DROP VIEW IF EXISTS {db_name}{db_suffix}.{table_name};
-- view which references a table with hint and a WHERE clause with hint.
-- WHERE clause has a compound predicate.
CREATE VIEW {db_name}{db_suffix}.{table_name}
AS SELECT * FROM {db_name}{db_suffix}.alltypes_date_partition_2 
[convert_limit_to_sample]
where [always_true] date_col = cast(timestamp_col as date) and int_col in 
(select bigint_col from functional.alltypessmall);
---- LOAD

Looks like making it a table level hint helps provide extra level of safety.


http://gerrit.cloudera.org:8080/#/c/16792/2/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java
File fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java:

http://gerrit.cloudera.org:8080/#/c/16792/2/fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java@209
PS2, Line 209: estimatedTotalRows
> Actually, the getTable().getNumRows() is the estimated row count.  If stats
Okay. It seems getTable().getNumRows() returns the raw row count as recorded in 
HMS for the table. In that case, your code ignores a table if the stats is 
missing or corrupt, which is good.

311   public void setTableStats(org.apache.hadoop.hive.metastore.api.Table 
msTbl) {
312     tableStats_ = new 
TTableStats(FeCatalogUtils.getRowCount(msTbl.getParameters()));
313     
tableStats_.setTotal_file_bytes(FeCatalogUtils.getTotalSize(msTbl.getParameters()));
314   }                                                                         
               
315
catalog/Table.java

The current approach computes the sampling percentage automatically and has a 
lower bound of 1%. Since simple limit optimization reduces # of partitions to 
be scanned, I wonder if the sampling rate would still hold on the surviving 
partitions. For example, there are 4 partitions, p1, p2, p3, p4. The amount of 
data of the total in each partition: 20% (p1), 20% (p2), 50% (p3) and 10% (p4). 
Assume p4 is surviving. The for a limit that is close to #rows in p4, I assume 
we need almost 100% sample rate.


http://gerrit.cloudera.org:8080/#/c/16792/2/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test:

http://gerrit.cloudera.org:8080/#/c/16792/2/testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test@275
PS2, Line 275: select * from functional.alltypes_dp_2_view_2 limit 10;
> Yup..makes sense. I added a query at the end that is against the same base
Done



--
To view, visit http://gerrit.cloudera.org:8080/16792
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ife05a5343c913006f7659949b327b63d3f10c04b
Gerrit-Change-Number: 16792
Gerrit-PatchSet: 3
Gerrit-Owner: Aman Sinha <amsi...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <amsi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Shant Hovsepian <sh...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Comment-Date: Thu, 03 Dec 2020 15:47:38 +0000
Gerrit-HasComments: Yes

Reply via email to