adriangb opened a new pull request, #22000:
URL: https://github.com/apache/datafusion/pull/22000
## Which issue does this PR close?
- Relates to #21624 (\`datafusion.execution.collect_statistics\` on wide
tables).
This PR is independent of (and lands cleanly without) the API
proposal in #21996 — they're orthogonal building blocks.
## Rationale for this change
DataFusion has the *machinery* for fine-grained parquet sampling
(\`ParquetAccessPlan\` with \`Skip\` / \`Scan\` /
\`Selection(RowSelection)\`)
but no public way to ask for a sample without constructing the access
plan by hand and stuffing it into \`PartitionedFile.extensions\`. That
works for one-off code but is awkward for:
* \`TABLESAMPLE\` SQL (any future implementation can lower to these
primitives instead of duplicating the row-group / page selection
logic).
* Ad-hoc data exploration — \"give me 1% of this dataset\" via the
CLI / DataFrame API.
* Layered helpers that want to compute approximate stats over a
bounded slice of data without scanning everything (e.g. an
optimizer-fed sampled-stats helper — see the linked POC).
* \`EXPLAIN ANALYZE\`-driven debug runs against a representative
slice instead of the full table.
This PR adds two opt-in builders on \`ParquetSource\`. Existing scans
are unchanged when neither builder is called.
## What changes are included in this PR?
### Public API on \`ParquetSource\`
\`\`\`rust
ParquetSource::new(schema)
.with_row_group_sampling(0.1) // keep ~10% of row groups per file
.with_row_fraction(0.05) // within each kept row group, keep ~5%
of rows
.with_row_cluster_size(8192); // controls window granularity (default
32 768)
\`\`\`
\`with_row_group_sampling(fraction)\`:
* Selection is deferred until the opener has loaded the parquet
footer, so we sample by real row-group index.
* Deterministic per \`(file_name, row_group_count, fraction)\` —
re-runs match.
* Always keeps at least one row group (target = \`max(1, ceil(N *
fraction))\`).
* No-op when \`fraction >= 1.0\`.
\`with_row_fraction(fraction)\`:
* Translates the per-row-group target into K contiguous windows
spread evenly across the row group, each placed at a random
offset within its stride. Window count = \`ceil(target /
cluster_size)\`.
* Materializes a \`RowSelection\` per kept row group; the parquet
reader uses the page index to read only the data pages covering
the selected rows. This gives \"page-level\" IO savings without
requiring per-column page alignment (which doesn't exist in
parquet).
* Falls back gracefully when the page index is missing — the reader
still returns the right rows, the IO win just disappears.
* Deterministic per \`(file_name, row_group_index, fraction, cluster_size)\`.
The two layers compose: \`row_group_fraction = 0.1\` × \`row_fraction =
0.1\` reads ~1% of the rows from ~10% of the row groups, with windows
spread out so the sample isn't clustered at one end of each row group.
### Why no \`with_file_sampling\`?
\`ParquetSource\` doesn't own the file list — that lives on
\`FileScanConfig.file_groups\`. A file-fraction setter here would have
been a no-op. Callers that want to drop files should rebuild the
\`FileScanConfig\` with a reduced \`file_groups\` directly.
### Internals
* New \`ParquetSampling\` struct (re-exported at the crate root).
* Plumbed through \`ParquetMorselizer\` → \`PreparedParquetOpen\`.
* Two free functions in \`opener.rs\` — \`apply_row_group_sampling\` and
\`apply_row_fraction_sampling\` — invoked from \`prune_row_groups\`
right after \`create_initial_plan\`.
* New dep: \`rand\` with the \`small_rng\` feature (already in workspace
Cargo.toml).
## Are these changes tested?
7 tests in \`datafusion-datasource-parquet::opener::test\`:
* \`row_group_sampling_keeps_target_count\` — \`ceil(N * fraction)\` math.
* \`row_group_sampling_is_deterministic\` — same inputs, same selection.
* \`row_group_sampling_differs_per_file\` — different file_name, different
sample.
* \`row_group_sampling_no_op_when_fraction_is_one\` — fraction >= 1.0 keeps
everything.
* \`row_group_sampling_target_at_least_one\` — \`fraction = 0.001\` over 100
row groups still keeps 1.
* \`row_group_sampling_end_to_end\` — writes a 4-row-group parquet to
\`InMemory\`, scans with \`fraction = 0.5\`, asserts exactly 6 rows out (2 row
groups × 3 rows).
* \`row_fraction_end_to_end\` — writes a 100-row single-row-group parquet,
scans with \`row_fraction = 0.1\` and \`cluster_size = 4\`, asserts the result
is in the \`(1, 16]\` range.
\`cargo build --workspace\`, \`cargo fmt --all\`, and
\`cargo clippy -p datafusion-datasource-parquet --all-targets -- -D
warnings\`
are clean.
## Are there any user-facing changes?
API additions only, all opt-in. No existing callers see behavior
changes. Public types added: \`ParquetSampling\` (struct),
\`ParquetSource::with_row_group_sampling\`,
\`ParquetSource::with_row_fraction\`,
\`ParquetSource::with_row_cluster_size\`,
\`ParquetSource::sampling()\`.
## Companion work
A consumer-side sketch — a sampled-stats helper that uses these
primitives to fill \`Absent\` slots in the [stats-request /
stats-response API #21996](https://github.com/apache/datafusion/pull/21996)
— lives on
[\`worktree-stats-mini-query-poc\`](https://github.com/pydantic/datafusion/tree/worktree-stats-mini-query-poc).
That work is **not** in scope here; this PR is just the parquet
sampling primitives. They stand on their own (e.g. for \`TABLESAMPLE\`)
regardless of whether #21996 lands.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]