zhuqi-lucas opened a new pull request, #10158: URL: https://github.com/apache/arrow-rs/pull/10158
# Which issue does this PR close? Closes #10148. # Rationale for this change Adaptive callers that maintain per-row-group state in lock-step with the decoder — e.g. dynamic row-group pruners that re-evaluate row-group statistics mid-scan, or per-RG \`RowFilter\` toggles that skip per-row evaluation when stats prove every row matches — currently have no way to know which row group the next reader will correspond to. \`try_next_reader\` can silently advance past row groups whose row selection is empty under the current \`with_row_selection\`, breaking the assumption that the queue of indices passed to \`with_row_groups\` maps 1:1 to the readers handed back. This is the API DataFusion's [#22450](https://github.com/apache/datafusion/pull/22450) (TopK runtime row-group pruning) needs to enable a per-RG fully-matched \`RowFilter\` skip optimization that the old \`split_runs\` design previously provided. # What changes are included in this PR? A new public method on \`ParquetPushDecoder\`: \`\`\`rust pub fn peek_next_row_group(&self) -> Option<usize> \`\`\` Returns the file-level row-group index that the next call to \`try_next_reader\` will yield a reader for, after applying any internal skipping (row selection emptiness, exhausted offset/limit budget). Returns \`None\` when no row groups remain, when the decoder sits inside a row group, or when every remaining row group would be skipped. # Implementation \`RowGroupFrontier::peek_next_row_group\` clones the offset/limit budget and the row-selection, then runs the same \`split_off\` walk that \`next_readable_row_group\` performs internally — returning the first row-group index whose simulated selection is non-empty (or, with predicates, the first index whose selection is non-empty regardless of budget). The clone keeps the call read-only; the cost is a single extra \`RowSelection::clone\` per peek. # Are these changes tested? Yes — four new lib tests: - \`test_peek_next_row_group_basic\` — peek before / between / after readers on the 2-RG fixture. - \`test_peek_next_row_group_respects_with_row_groups\` — explicit \`with_row_groups([1])\` reports \`Some(1)\`. - \`test_peek_next_row_group_skips_empty_selection\` — a \`RowSelection\` that skips all of RG 0 + part of RG 1 makes peek report \`Some(1)\`, mirroring \`next_readable_row_group\`'s skip behavior. - \`test_peek_next_row_group_finished\` — an empty \`with_row_groups\` (or a finished decoder) returns \`None\`. All 1219 existing parquet lib tests still pass. # Are there any user-facing changes? One new public method on \`ParquetPushDecoder\`. No existing API is changed; nothing breaks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
