adriangb opened a new pull request, #22000:
URL: https://github.com/apache/datafusion/pull/22000

   ## Which issue does this PR close?
   
   - Relates to #21624 (\`datafusion.execution.collect_statistics\` on wide 
tables).
     This PR is independent of (and lands cleanly without) the API
     proposal in #21996 — they're orthogonal building blocks.
   
   ## Rationale for this change
   
   DataFusion has the *machinery* for fine-grained parquet sampling
   (\`ParquetAccessPlan\` with \`Skip\` / \`Scan\` / 
\`Selection(RowSelection)\`)
   but no public way to ask for a sample without constructing the access
   plan by hand and stuffing it into \`PartitionedFile.extensions\`. That
   works for one-off code but is awkward for:
   
   * \`TABLESAMPLE\` SQL (any future implementation can lower to these
     primitives instead of duplicating the row-group / page selection
     logic).
   * Ad-hoc data exploration — \"give me 1% of this dataset\" via the
     CLI / DataFrame API.
   * Layered helpers that want to compute approximate stats over a
     bounded slice of data without scanning everything (e.g. an
     optimizer-fed sampled-stats helper — see the linked POC).
   * \`EXPLAIN ANALYZE\`-driven debug runs against a representative
     slice instead of the full table.
   
   This PR adds two opt-in builders on \`ParquetSource\`. Existing scans
   are unchanged when neither builder is called.
   
   ## What changes are included in this PR?
   
   ### Public API on \`ParquetSource\`
   
   \`\`\`rust
   ParquetSource::new(schema)
       .with_row_group_sampling(0.1)   // keep ~10% of row groups per file
       .with_row_fraction(0.05)        // within each kept row group, keep ~5% 
of rows
       .with_row_cluster_size(8192);   // controls window granularity (default 
32 768)
   \`\`\`
   
   \`with_row_group_sampling(fraction)\`:
   
   * Selection is deferred until the opener has loaded the parquet
     footer, so we sample by real row-group index.
   * Deterministic per \`(file_name, row_group_count, fraction)\` —
     re-runs match.
   * Always keeps at least one row group (target = \`max(1, ceil(N * 
fraction))\`).
   * No-op when \`fraction >= 1.0\`.
   
   \`with_row_fraction(fraction)\`:
   
   * Translates the per-row-group target into K contiguous windows
     spread evenly across the row group, each placed at a random
     offset within its stride. Window count = \`ceil(target /
     cluster_size)\`.
   * Materializes a \`RowSelection\` per kept row group; the parquet
     reader uses the page index to read only the data pages covering
     the selected rows. This gives \"page-level\" IO savings without
     requiring per-column page alignment (which doesn't exist in
     parquet).
   * Falls back gracefully when the page index is missing — the reader
     still returns the right rows, the IO win just disappears.
   * Deterministic per \`(file_name, row_group_index, fraction, cluster_size)\`.
   
   The two layers compose: \`row_group_fraction = 0.1\` × \`row_fraction =
   0.1\` reads ~1% of the rows from ~10% of the row groups, with windows
   spread out so the sample isn't clustered at one end of each row group.
   
   ### Why no \`with_file_sampling\`?
   
   \`ParquetSource\` doesn't own the file list — that lives on
   \`FileScanConfig.file_groups\`. A file-fraction setter here would have
   been a no-op. Callers that want to drop files should rebuild the
   \`FileScanConfig\` with a reduced \`file_groups\` directly.
   
   ### Internals
   
   * New \`ParquetSampling\` struct (re-exported at the crate root).
   * Plumbed through \`ParquetMorselizer\` → \`PreparedParquetOpen\`.
   * Two free functions in \`opener.rs\` — \`apply_row_group_sampling\` and
     \`apply_row_fraction_sampling\` — invoked from \`prune_row_groups\`
     right after \`create_initial_plan\`.
   * New dep: \`rand\` with the \`small_rng\` feature (already in workspace
     Cargo.toml).
   
   ## Are these changes tested?
   
   7 tests in \`datafusion-datasource-parquet::opener::test\`:
   
   * \`row_group_sampling_keeps_target_count\` — \`ceil(N * fraction)\` math.
   * \`row_group_sampling_is_deterministic\` — same inputs, same selection.
   * \`row_group_sampling_differs_per_file\` — different file_name, different 
sample.
   * \`row_group_sampling_no_op_when_fraction_is_one\` — fraction >= 1.0 keeps 
everything.
   * \`row_group_sampling_target_at_least_one\` — \`fraction = 0.001\` over 100 
row groups still keeps 1.
   * \`row_group_sampling_end_to_end\` — writes a 4-row-group parquet to 
\`InMemory\`, scans with \`fraction = 0.5\`, asserts exactly 6 rows out (2 row 
groups × 3 rows).
   * \`row_fraction_end_to_end\` — writes a 100-row single-row-group parquet, 
scans with \`row_fraction = 0.1\` and \`cluster_size = 4\`, asserts the result 
is in the \`(1, 16]\` range.
   
   \`cargo build --workspace\`, \`cargo fmt --all\`, and
   \`cargo clippy -p datafusion-datasource-parquet --all-targets -- -D 
warnings\`
   are clean.
   
   ## Are there any user-facing changes?
   
   API additions only, all opt-in. No existing callers see behavior
   changes. Public types added: \`ParquetSampling\` (struct),
   \`ParquetSource::with_row_group_sampling\`,
   \`ParquetSource::with_row_fraction\`,
   \`ParquetSource::with_row_cluster_size\`,
   \`ParquetSource::sampling()\`.
   
   ## Companion work
   
   A consumer-side sketch — a sampled-stats helper that uses these
   primitives to fill \`Absent\` slots in the [stats-request /
   stats-response API #21996](https://github.com/apache/datafusion/pull/21996)
   — lives on
   
[\`worktree-stats-mini-query-poc\`](https://github.com/pydantic/datafusion/tree/worktree-stats-mini-query-poc).
   That work is **not** in scope here; this PR is just the parquet
   sampling primitives. They stand on their own (e.g. for \`TABLESAMPLE\`)
   regardless of whether #21996 lands.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to