[jira] [Created] (PARQUET-2210) Add FilteredPageReader to filter rows based on page statistics

fatemah (Jira) Mon, 31 Oct 2022 09:41:07 -0700

fatemah created PARQUET-2210:
--------------------------------

             Summary: Add FilteredPageReader to filter rows based on page 
statistics
                 Key: PARQUET-2210
                 URL: https://issues.apache.org/jira/browse/PARQUET-2210
             Project: Parquet
          Issue Type: New Feature
            Reporter: fatemah



Currently, we do not use the statistics that is stored in the page headers for 
pruning the rows that we read. Row group pruning is very coarse-grained and in 
many cases does not prune the row group. I propose adding a FilteredPageReader 
that would accept a filter and would not return the pages that do not match the 
filter based on page statistics.

Initial set of filters can be: EQUALS, IS NULL, IS NOT NULL.

Also, the FilteredPageReader will keep track of what row ranges matched and not 
matched. We could use this to skip reading rows that do not match from the rest 
of the columns. Note that the SkipRecords API was recently added to the Parquet 
reader (https://issues.apache.org/jira/browse/PARQUET-2188)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (PARQUET-2210) Add FilteredPageReader to filter rows based on page statistics

Reply via email to