Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
......................................................................


Patch Set 27:

(3 comments)

Was getting ready to +2 this but just had couple of additional questions and a 
suggestion.

http://gerrit.cloudera.org:8080/#/c/17075/19//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17075/19//COMMIT_MSG@19
PS19, Line 19: show_column_minmax_stats
> On paper, yes we should be able to show the min/max. On the other hand, the
Did we arrive at a consensus on this ?  Adding new query options means we have 
to document them.  It seems option 2 could be deprecated in the future as soon 
as tests are updated so I would vote for not documenting it for now. I am also 
not clear why many tests have to be updated.. if non-parquet formats don't have 
this stats stored in HMS, why would it affect the display ? We could just skip 
the last 2 columns.


http://gerrit.cloudera.org:8080/#/c/17075/27/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
File fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java:

http://gerrit.cloudera.org:8080/#/c/17075/27/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@260
PS27, Line 260:       for (FeFsPartition partition : allPartitions) {
it's unfortunate that we have to iterate through all partitions (could be tens 
of thousands) to check this..and I am not sure if at this point the how many 
rpcs the loadAllPartitions() will make to the catalog server. In a vast 
majority of cases all partitions will be of the same format.  My question is 
..suppose you find at least 1 parquet format and return true for computeMinMax 
..doesn't it just mean that whatever min/max stats is computed will be off 
(similar to stale stats that we discussed earlier) .. so in theory it will 
reduce the effectiveness of the filter but not affect correctness.
I am thinking of use cases with lots of partitions where users have COMPUTE 
STATS in their ETL scripts and suppose they enable compute_min_max_stats, this 
will slow down the compute stats noticeably.


http://gerrit.cloudera.org:8080/#/c/17075/27/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
File fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java:

http://gerrit.cloudera.org:8080/#/c/17075/27/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java@279
PS27, Line 279:   public boolean isParquetBased() {
nit: There's an identical method in HdfsScanNode.java which is currently 
private (see below) but it would be good to either have that method call this 
one or have a public static isParquetBased(HdfsFileFormat) defined somewhere.

https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java#L387

Also, I suppose HUDI_PARQUET min/max stats should work with this patch ?  I 
don't think a test has been added for that. The fact that the other method in 
HdfsScanNode treats both somewhat equivalent leads me to think it should work.



--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 27
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <amsi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Thu, 25 Mar 2021 06:14:54 +0000
Gerrit-HasComments: Yes

Reply via email to