[Impala-ASF-CR] IMPALA-10709: Min/max filters should be enabled for joins into sorted columns in Parquet tables

Aman Sinha (Code Review) Fri, 21 May 2021 16:43:16 -0700

Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17478 )


Change subject: IMPALA-10709: Min/max filters should be enabled for joins into 
sorted columns in Parquet tables
......................................................................


Patch Set 7:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/17478/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17478/7//COMMIT_MSG@7
PS7, Line 7: enabled for joins into sorted
nit: maybe say 'enabled only for joins on sorted..'


http://gerrit.cloudera.org:8080/#/c/17478/7//COMMIT_MSG@10
PS7, Line 10: addvantage
nit: spelling


http://gerrit.cloudera.org:8080/#/c/17478/7//COMMIT_MSG@10
PS7, Line 10: the
nit: extra 'the'


http://gerrit.cloudera.org:8080/#/c/17478/7/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

http://gerrit.cloudera.org:8080/#/c/17478/7/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@2776
PS7, Line 2776:       sort_by_columns_ =
For non-parquet tables or parquet tables that have not been created with the 
SORT BY clause, the getParameters().get("sort.columns") would return NULL, so 
suggest adding a null check for that.  Also, getParameters() itself may return 
NULL so both checks would be needed.


http://gerrit.cloudera.org:8080/#/c/17478/7/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/17478/7/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@750
PS7, Line 750:         if (table != null && table instanceof HdfsTable) {
I think this should also check for ((HdfsTable) table).isParquetTable() .

A more general thought on this is when we talk about the sort by column, here 
we are limiting it to the case when the table itself was created with the SORT 
BY clause.  As opposed to the ColumnOrder property contained in the Parquet 
footer:
   optional list<ColumnOrder> column_orders;     <-- from the FileMetadata 
portion of the footer

So if an external software (e.g Spark) populated the above in the footer, we 
won't end up using it. When we talk about min-max filters for Parquet, we 
always think about the Parquet footer so this would be worth clarifying in the 
commit message. In the future we could be smarter about this e.g by looking at 
the footers of a small sample of files.



--
To view, visit http://gerrit.cloudera.org:8080/17478
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28c19c4b39b01ffa7d275fb245be85c28e9b2963
Gerrit-Change-Number: 17478
Gerrit-PatchSet: 7
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <amsi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Comment-Date: Fri, 21 May 2021 23:43:10 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10709: Min/max filters should be enabled for joins into sorted columns in Parquet tables

Reply via email to