Dandandan opened a new pull request, #21766:
URL: https://github.com/apache/datafusion/pull/21766

   ## Which issue does this PR close?
   
   Follow-up to #21351 (Dynamic work scheduling in FileStream), which closed 
#20529 and explicitly deferred *\"splitting files into smaller units (e.g. 
across row groups)\"* as future work. This PR implements that.
   
   - Closes #.
   
   ## Rationale for this change
   
   With #21351, sibling FileStreams already steal **whole files** from a 
`SharedWorkSource` queue. But a single large parquet file still bottlenecks on 
one worker — the other N−1 sibling partitions sit idle even though each row 
group is independently readable. This shows up on single-file queries 
(ClickBench-style) and on the long-tail large-file case in multi-file scans.
   
   This PR adds row-group granularity: the worker that pops a file donates its 
other row groups back to the shared queue so idle siblings steal them.
   
   ## What changes are included in this PR?
   
   **Donation path** (`datafusion/datasource-parquet/src/opener.rs`):
   - New \`ParquetOpenState::SplitAndDonate\` state between \`LoadMetadata\` 
and \`PrepareFilters\`. After metadata load, the donor keeps the first eligible 
row group; each remaining one is pushed to the front of the shared queue as a 
\`PartitionedFile\` clone whose \`range\` is a one-byte \`FileRange\` at that 
row group's starting offset.
   - The existing \`prune_by_range\` path matches that offset and scopes the 
stealer to exactly that row group — no new extension types, no metadata carried 
through \`PartitionedFile.extensions\`, no access-plan donation.
   - If the caller pre-narrowed the scan with a \`file_range\` that still spans 
multiple row groups (byte-range file partitioning), splitting stays **inside** 
that range: donated ranges remain subsets of the caller's.
   - Guards:
     - Caller-supplied \`ParquetAccessPlan\` in \`extensions\` → respected 
as-is, no donation.
     - Single row group in scope (whole file, or caller range isolating one RG) 
→ no donation.
   
   **Shared queue plumbing**:
   - \`SharedWorkSource\` is now \`pub\`; gains \`push_front(items)\`, 
\`pop_front()\`, and \`Default\`.
   - \`FileSource::create_morselizer\` takes an extra 
\`Option<SharedWorkSource>\` parameter so format-specific morselizers can 
participate in donation. Non-parquet sources ignore it.
   - \`row_group_start_offset\` helper is extracted into 
\`row_group_filter.rs\` and reused by both \`prune_by_range\` and the new 
donation path.
   
   **Trade-offs** (v1):
   - Stealers re-read the parquet footer for their chunk. Object stores 
typically cache the range so this is cheap; carrying loaded metadata across 
siblings is left for a follow-up.
   - If a sibling drains the shared queue *before* the donor has donated, that 
sibling terminates (it observes an empty queue at \`scan_state.rs\`'s 
\`ScanAndReturn::Done\`). Accepted for v1; fixing requires splitter-handles / 
queue wakeup and can be added separately.
   
   ## Are these changes tested?
   
   Yes. Five new unit tests in \`datafusion/datasource-parquet/src/opener.rs\`:
   - \`row_group_split_donates_remaining_row_groups\` — donor reads RG 0; three 
donated chunks each read exactly their row group, in file order.
   - \`row_group_split_skips_single_row_group_file\` — no donation when the 
file has one row group.
   - \`row_group_split_respects_caller_access_plan\` — \`ParquetAccessPlan\` in 
extensions suppresses donation; caller plan executes as specified.
   - \`row_group_split_within_caller_file_range\` — caller byte range covering 
all RGs is split; donated ranges stay inside the caller range.
   - \`row_group_split_skips_when_caller_range_covers_single_row_group\` — 
narrow caller range isolating one RG suppresses donation.
   
   All existing \`datafusion-datasource\` and \`datafusion-datasource-parquet\` 
tests continue to pass. \`cargo clippy --all-targets --all-features -- -D 
warnings\` is clean on both crates.
   
   ## Are there any user-facing changes?
   
   Performance only — faster single-file and tail-file scans under sibling work 
stealing. No semantic or API changes visible to SQL users. \`SharedWorkSource\` 
becomes \`pub\` (it was \`pub(crate)\`); \`FileSource::create_morselizer\` 
gains one parameter — default implementations ignore it.
   
   ---
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to