mbutrovich opened a new issue, #22795:
URL: https://github.com/apache/datafusion/issues/22795
## Is your feature request related to a problem or challenge?
The Parquet opener loads the page index (ColumnIndex plus OffsetIndex) for
any file whose scan has a page-pruning predicate, before it knows whether the
page index can prune anything. For predicates that row-group statistics already
resolve, this is pure I/O and parsing overhead that prunes zero pages.
The clearest case is `IS NOT NULL` on a column that has no nulls. In
`datafusion/pruning`, `IS NOT NULL` pruning rewrites to `null_count !=
row_count`, so a container is pruned only when it is entirely null. On a
non-null column no page is ever all-null, so the page index is loaded and
prunes nothing. On a wide fact table scanned with `IS NOT NULL` filters on
non-null join keys, this adds roughly 280 KB of page index per file. Across
tens of thousands of files that is gigabytes of wasted reads.
This surfaced downstream in DataFusion Comet (apache/datafusion-comet#3978):
a TPC-DS q88 scan loads about 2.8 GB of page index for `IS NOT NULL` filters on
non-null foreign keys, pruning nothing.
## Describe the solution you'd like
Gate the page index load on whether row-group statistics leave any work for
it to do.
Row-group pruning sorts each row group into one of three buckets:
1. **Pruned**: RG statistics prove no row matches. The whole row group is
dropped and the page index is irrelevant.
2. **Fully matched**: RG statistics prove every row matches. The page index
cannot prune anything (justified below).
3. **Inconclusive**: RG statistics prove neither. Some rows might match and
some might not.
The page index can only prune in bucket 3. Page-index pruning removes a page
if and only if the predicate is provably false for every row on that page. A
page is a subset of the row group's rows. In bucket 2 the predicate is provably
true for every row in the row group, so it is true for every row on every page,
so no page can be all-non-matching and no page is prunable. There is nothing
left to refine. In bucket 3 there exist possibly-non-matching rows that may be
concentrated on some pages the page index can isolate, so the page index does
refine and must be loaded.
So the rule is: **skip the page index load only when every surviving row
group is in bucket 2 (fully matched). A single bucket 3 row group forces the
load.** Note that "row group could not be pruned" is the wrong condition,
because it merges buckets 2 and 3.
DataFusion already computes the relevant signal. PR #21637 added "fully
matched" detection and uses it to skip page-index pruning work for
fully-matched row groups. For `IS NOT NULL`, a row group with `null_count == 0`
is fully matched.
The gap is ordering. The opener state machine
(`datafusion/datasource-parquet/src/opener/mod.rs`) runs:
```
LoadMetadata (footer, PageIndexPolicy::Skip)
-> PrepareFilters
-> LoadPageIndex // page index I/O happens here
-> PruneWithStatistics // row-group stats pruning / fully-matched
decided here
-> ...
```
`LoadPageIndex` runs before `PruneWithStatistics`, so the fully-matched
determination that would prove the page index useless happens after the bytes
are already fetched. The existing optimization saves CPU (skips page-index
pruning work) but not I/O.
Proposed change: make the fully-matched determination available before the
page index load, and skip `load_page_index` when every surviving row group is
fully matched by the page-pruning predicate using row-group statistics alone.
Row-group statistics are present in the footer already loaded under
`PageIndexPolicy::Skip`, so no extra I/O is required to make this decision.
Concretely for the `IS NOT NULL` case: skip the load when, for every
referenced column, the row-group statistics report `null_count == Some(0)`.
## Describe alternatives you've considered
- Classify the page-pruning predicate by which statistics it uses
(`StatisticsType` in the pruning predicate's `RequiredColumns`) and skip the
load when it references only `NullCount` / `RowCount` and never `Min` / `Max`.
This is narrower than the fully-matched approach and still needs the row-group
null-count gate, so the fully-matched route is preferred because it already
exists and covers more predicate shapes.
- Cache the full metadata including the page index so repeated opens of the
same file pay the load only once. This helps when the page index is actually
useful but does not help the non-selective case, where the cheapest fix is to
not load it at all.
## Additional context
Correctness notes for the gate:
- **Fully matched must be null-aware.** For a predicate that rejects nulls,
such as `x > 50`, fully matched requires `min_value > 50` and `null_count ==
0`. If the null count is positive, an all-null page would be pruned by `x >
50`, so the page index still has value and the load must not be skipped. The
gate is only as correct as the underlying fully-matched computation's null
handling, so it must depend on the null-aware definition. This should be
verified in the #21637 logic before relying on it.
- **Missing statistics fall back to loading.** `Statistics.null_count` is
`optional` in the Parquet thrift spec, and a column chunk may carry no
`Statistics` at all. Treat a missing `null_count` (or missing statistics) as
"not provably zero" and load the page index. The `IS NOT NULL` skip condition
is therefore "statistics present and `null_count == Some(0)` for all referenced
columns," conservatively false otherwise. Modern writers emit row-group
`null_count` in practice, so the common case still benefits.
- **The fully-matched determination must use row-group statistics only**,
never the page index, since the whole point is to decide whether to load the
page index.
- **The change is a reorder of the opener state machine** so that
row-group-stats pruning / fully-matched runs before the page index load. The
staged structs (`FiltersPreparedParquetOpen`, `RowGroupsPrunedParquetOpen`, and
related) need rewiring, and the bloom-filter stage should be checked for any
dependence on the current ordering.
Relevant code:
- Opener state machine and stages:
`datafusion/datasource-parquet/src/opener/mod.rs`
- Page index load helper (the `missing_column_index || missing_offset_index`
guard): `load_page_index` in the same file
- Fully-matched page pruning: `PagePruningAccessPlanFilter` in
`datafusion/datasource-parquet/src/page_filter.rs`
- `IS NOT NULL` rewrite to `null_count != row_count`:
`datafusion/pruning/src/pruning_predicate.rs`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]