Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/17075 )
Change subject: IMPALA-10494: Making use of the min/max column stats to improve min/max filters ...................................................................... Patch Set 27: (3 comments) Was getting ready to +2 this but just had couple of additional questions and a suggestion. http://gerrit.cloudera.org:8080/#/c/17075/19//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/17075/19//COMMIT_MSG@19 PS19, Line 19: show_column_minmax_stats > On paper, yes we should be able to show the min/max. On the other hand, the Did we arrive at a consensus on this ? Adding new query options means we have to document them. It seems option 2 could be deprecated in the future as soon as tests are updated so I would vote for not documenting it for now. I am also not clear why many tests have to be updated.. if non-parquet formats don't have this stats stored in HMS, why would it affect the display ? We could just skip the last 2 columns. http://gerrit.cloudera.org:8080/#/c/17075/27/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java File fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java: http://gerrit.cloudera.org:8080/#/c/17075/27/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@260 PS27, Line 260: for (FeFsPartition partition : allPartitions) { it's unfortunate that we have to iterate through all partitions (could be tens of thousands) to check this..and I am not sure if at this point the how many rpcs the loadAllPartitions() will make to the catalog server. In a vast majority of cases all partitions will be of the same format. My question is ..suppose you find at least 1 parquet format and return true for computeMinMax ..doesn't it just mean that whatever min/max stats is computed will be off (similar to stale stats that we discussed earlier) .. so in theory it will reduce the effectiveness of the filter but not affect correctness. I am thinking of use cases with lots of partitions where users have COMPUTE STATS in their ETL scripts and suppose they enable compute_min_max_stats, this will slow down the compute stats noticeably. http://gerrit.cloudera.org:8080/#/c/17075/27/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java File fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java: http://gerrit.cloudera.org:8080/#/c/17075/27/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java@279 PS27, Line 279: public boolean isParquetBased() { nit: There's an identical method in HdfsScanNode.java which is currently private (see below) but it would be good to either have that method call this one or have a public static isParquetBased(HdfsFileFormat) defined somewhere. https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java#L387 Also, I suppose HUDI_PARQUET min/max stats should work with this patch ? I don't think a test has been added for that. The fact that the other method in HdfsScanNode treats both somewhat equivalent leads me to think it should work. -- To view, visit http://gerrit.cloudera.org:8080/17075 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df Gerrit-Change-Number: 17075 Gerrit-PatchSet: 27 Gerrit-Owner: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Aman Sinha <amsi...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Thu, 25 Mar 2021 06:14:54 +0000 Gerrit-HasComments: Yes