kosiew opened a new pull request, #19545:
URL: https://github.com/apache/datafusion/pull/19545

   ## Which issue does this PR close?
   
   * Closes #18560.
   
   ## Rationale for this change
   
   DataFusion’s Parquet row-level filter pushdown previously rejected all 
nested Arrow types (lists/structs), which prevented common and 
performance-sensitive filters on list columns (for example `array_has`, 
`array_has_all`, `array_has_any`) from being evaluated during Parquet decoding.
   
   Enabling safe pushdown for a small, well-defined set of list-aware 
predicates allows Parquet decoding to apply these filters earlier, reducing 
materialization work and improving scan performance, while still keeping 
unsupported nested projections (notably structs) evaluated after batches are 
materialized.
   
   ## What changes are included in this PR?
   
   * Allow a registry of list-aware predicates to be considered 
pushdown-compatible:
   
     * `array_has`, `array_has_all`, `array_has_any`
     * `IS NULL` / `IS NOT NULL`
   * Introduce `supported_predicates` module to detect whether an expression 
tree contains supported list predicates.
   * Update Parquet filter candidate selection to:
   
     * Accept list columns only when the predicate semantics are supported.
     * Continue rejecting struct columns (and other unsupported nested types).
   * Switch Parquet projection mask construction from root indices to **leaf 
indices** (`ProjectionMask::leaves`) so nested list filters project the correct 
leaf columns for decoding-time evaluation.
   * Expand root column indices to leaf indices for nested columns using the 
Parquet `SchemaDescriptor`.
   * Add unit tests verifying:
   
     * List columns are accepted for pushdown when used by supported predicates.
     * Struct columns (and mixed struct+primitive predicates) prevent pushdown.
     * `array_has`, `array_has_all`, `array_has_any` actually filter rows 
during decoding using a temp Parquet file.
   * Add sqllogictest coverage proving both correctness and plan behavior:
   
     * Queries return expected results.
     * `EXPLAIN` shows predicates pushed into `DataSourceExec` for Parquet.
   
   ## Are these changes tested?
   
   Yes.
   
   * Rust unit tests in `datafusion/datasource-parquet/src/row_filter.rs`:
   
     * Validate pushdown eligibility for list vs struct predicates.
     * Create a temp Parquet file and confirm list predicates prune/match the 
expected rows via Parquet decoding row filtering.
   * SQL logic tests in 
`datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt`:
   
     * Add end-to-end coverage for `array_has`, `array_has_all`, 
`array_has_any` and combinations (OR / AND with other predicates).
     * Confirm pushdown appears in the physical plan (`DataSourceExec ... 
predicate=...`).
   
   ## Are there any user-facing changes?
   
   Yes.
   
   * Parquet filter pushdown now supports list columns for the following 
predicates:
   
     * `array_has`, `array_has_all`, `array_has_any`
     * `IS NULL`, `IS NOT NULL`
   
   This can improve query performance for workloads that filter on array/list 
columns.
   
   No breaking changes are introduced; unsupported nested types (for example 
structs) continue to be evaluated after decoding.
   
   ## LLM-generated code disclosure
   
   This PR includes LLM-generated code and comments. All LLM-generated content 
has been manually reviewed and tested.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to