Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/17478 )
Change subject: IMPALA-10709: Min/max filters should be enabled for joins into sorted columns in Parquet tables ...................................................................... Patch Set 7: (5 comments) http://gerrit.cloudera.org:8080/#/c/17478/7//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/17478/7//COMMIT_MSG@7 PS7, Line 7: enabled for joins into sorted nit: maybe say 'enabled only for joins on sorted..' http://gerrit.cloudera.org:8080/#/c/17478/7//COMMIT_MSG@10 PS7, Line 10: addvantage nit: spelling http://gerrit.cloudera.org:8080/#/c/17478/7//COMMIT_MSG@10 PS7, Line 10: the nit: extra 'the' http://gerrit.cloudera.org:8080/#/c/17478/7/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: http://gerrit.cloudera.org:8080/#/c/17478/7/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@2776 PS7, Line 2776: sort_by_columns_ = For non-parquet tables or parquet tables that have not been created with the SORT BY clause, the getParameters().get("sort.columns") would return NULL, so suggest adding a null check for that. Also, getParameters() itself may return NULL so both checks would be needed. http://gerrit.cloudera.org:8080/#/c/17478/7/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/17478/7/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@750 PS7, Line 750: if (table != null && table instanceof HdfsTable) { I think this should also check for ((HdfsTable) table).isParquetTable() . A more general thought on this is when we talk about the sort by column, here we are limiting it to the case when the table itself was created with the SORT BY clause. As opposed to the ColumnOrder property contained in the Parquet footer: optional list<ColumnOrder> column_orders; <-- from the FileMetadata portion of the footer So if an external software (e.g Spark) populated the above in the footer, we won't end up using it. When we talk about min-max filters for Parquet, we always think about the Parquet footer so this would be worth clarifying in the commit message. In the future we could be smarter about this e.g by looking at the footers of a small sample of files. -- To view, visit http://gerrit.cloudera.org:8080/17478 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28c19c4b39b01ffa7d275fb245be85c28e9b2963 Gerrit-Change-Number: 17478 Gerrit-PatchSet: 7 Gerrit-Owner: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Aman Sinha <amsi...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com> Gerrit-Comment-Date: Fri, 21 May 2021 23:43:10 +0000 Gerrit-HasComments: Yes