[ https://issues.apache.org/jira/browse/HIVE-21599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17635377#comment-17635377 ]
Stamatis Zampetakis commented on HIVE-21599: -------------------------------------------- The solution based on {{ReadContext#getRequestedSchema}} creates some other problems when the schema of the table evolves. My assumption was that getRequestedSchema always returns a subset of the columns of the original schema (i.e., {{{}FileMetaData#getSchema{}}}). This is not true since the getRequestedSchema is used to handle schema evolutions (and not only for column pruning). +Example:+ {code:sql} create table person (id int, fname string, lname string, age int) stored as parquet; {code} +FileMetaData#getSchema+ {noformat} message hive_schema { optional int32 id; optional binary fname (STRING); optional binary lname (STRING); optional int32 age; } {noformat} {code:sql} select fname from person where age >=25; {code} +ReadContext#getRequestedSchema+ {noformat} message hive_schema { optional binary fname (STRING); optional int32 age; } {noformat} {code:sql} ALTER TABLE person CHANGE COLUMN age years_from_birth int; select fname from person where years_from_birth >=25; {code} +ReadContext#getRequestedSchema+ {noformat} message hive_schema { optional binary fname (STRING); optional binary years_from_birth; } {noformat} Observe that after renaming the column the result of {{getRequestedSchema}} is not a subset of the {{FileMetaData#getSchema}} and years_from_birth column does not appear in the file. Creating a Parquet filter predicate for a column that does not actually exist in the file can cause various problems. For instance, Parquet [tries to determine which blocks are matching the filter|https://github.com/apache/parquet-mr/blob/d057b39d93014fe40f5067ee4a33621e65c91552/parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/RowGroupFilter.java#L103] and if the filter column does not appear in the block it can wrongly derive that a block does not have data for the filtering predicate. Moreover after the rename the Parquet types are not retained (int32 vs binary) which can cause problems as well when creating the filter predicate. All in all, relying on getRequestedSchema to build the filter predicate is not possible at this stage. > Parquet predicate pushdown on partition columns may cause wrong result if > files contain partition columns > --------------------------------------------------------------------------------------------------------- > > Key: HIVE-21599 > URL: https://issues.apache.org/jira/browse/HIVE-21599 > Project: Hive > Issue Type: Improvement > Components: Query Planning > Reporter: Vineet Garg > Assignee: Soumyakanti Das > Priority: Major > Labels: pull-request-available > Attachments: HIVE-21599.1.patch > > Time Spent: 3.5h > Remaining Estimate: 0h > > Filter predicates are pushed to Table Scan (to be pushed to and used by > storage handler/input format). Such predicates could consist of partition > columns which are of no use to storage handler or input formats. Therefore > it should be removed from TS filter expression. -- This message was sent by Atlassian Jira (v8.20.10#820010)