zhuqi-lucas opened a new pull request, #10158:
URL: https://github.com/apache/arrow-rs/pull/10158

   # Which issue does this PR close?
   
   Closes #10148.
   
   # Rationale for this change
   
   Adaptive callers that maintain per-row-group state in lock-step with the 
decoder — e.g. dynamic row-group pruners that re-evaluate row-group statistics 
mid-scan, or per-RG \`RowFilter\` toggles that skip per-row evaluation when 
stats prove every row matches — currently have no way to know which row group 
the next reader will correspond to. \`try_next_reader\` can silently advance 
past row groups whose row selection is empty under the current 
\`with_row_selection\`, breaking the assumption that the queue of indices 
passed to \`with_row_groups\` maps 1:1 to the readers handed back.
   
   This is the API DataFusion's 
[#22450](https://github.com/apache/datafusion/pull/22450) (TopK runtime 
row-group pruning) needs to enable a per-RG fully-matched \`RowFilter\` skip 
optimization that the old \`split_runs\` design previously provided.
   
   # What changes are included in this PR?
   
   A new public method on \`ParquetPushDecoder\`:
   
   \`\`\`rust
   pub fn peek_next_row_group(&self) -> Option<usize>
   \`\`\`
   
   Returns the file-level row-group index that the next call to 
\`try_next_reader\` will yield a reader for, after applying any internal 
skipping (row selection emptiness, exhausted offset/limit budget). Returns 
\`None\` when no row groups remain, when the decoder sits inside a row group, 
or when every remaining row group would be skipped.
   
   # Implementation
   
   \`RowGroupFrontier::peek_next_row_group\` clones the offset/limit budget and 
the row-selection, then runs the same \`split_off\` walk that 
\`next_readable_row_group\` performs internally — returning the first row-group 
index whose simulated selection is non-empty (or, with predicates, the first 
index whose selection is non-empty regardless of budget). The clone keeps the 
call read-only; the cost is a single extra \`RowSelection::clone\` per peek.
   
   # Are these changes tested?
   
   Yes — four new lib tests:
   
   - \`test_peek_next_row_group_basic\` — peek before / between / after readers 
on the 2-RG fixture.
   - \`test_peek_next_row_group_respects_with_row_groups\` — explicit 
\`with_row_groups([1])\` reports \`Some(1)\`.
   - \`test_peek_next_row_group_skips_empty_selection\` — a \`RowSelection\` 
that skips all of RG 0 + part of RG 1 makes peek report \`Some(1)\`, mirroring 
\`next_readable_row_group\`'s skip behavior.
   - \`test_peek_next_row_group_finished\` — an empty \`with_row_groups\` (or a 
finished decoder) returns \`None\`.
   
   All 1219 existing parquet lib tests still pass.
   
   # Are there any user-facing changes?
   
   One new public method on \`ParquetPushDecoder\`. No existing API is changed; 
nothing breaks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to