adriangb commented on code in PR #19557:
URL: https://github.com/apache/datafusion/pull/19557#discussion_r2652244418
##########
datafusion/datasource-parquet/src/sort.rs:
##########
@@ -31,22 +31,34 @@ use std::collections::HashMap;
/// 3. Reconstructs the row selection for the new order
///
/// # Arguments
-/// * `row_selection` - Original row selection
+/// * `row_selection` - Original row selection (only covers row groups that
are scanned)
/// * `parquet_metadata` - Metadata containing row group information
+/// * `row_groups_to_scan` - Indexes of row groups that will be scanned (in
original order)
///
/// # Returns
/// A new `RowSelection` adjusted for reversed row group order
+///
+/// # Important Notes
+/// The input `row_selection` only covers the row groups specified in
`row_groups_to_scan`.
+/// Row groups that are skipped (not in `row_groups_to_scan`) are not
represented in the
+/// `row_selection` at all. This function needs `row_groups_to_scan` to
correctly map
+/// the selection back to the original row groups.
pub fn reverse_row_selection(
row_selection: &RowSelection,
parquet_metadata: &ParquetMetaData,
+ row_groups_to_scan: &[usize],
) -> Result<RowSelection> {
let rg_metadata = parquet_metadata.row_groups();
- // Build a mapping of row group index to its row range in the file
+ // Build a mapping of row group index to its row range, but ONLY for
+ // the row groups that are actually being scanned.
+ // The row numbers in this mapping are relative to the scanned row groups,
+ // not the entire file.
let mut rg_row_ranges: Vec<(usize, usize, usize)> =
- Vec::with_capacity(rg_metadata.len());
+ Vec::with_capacity(row_groups_to_scan.len());
let mut current_row = 0;
Review Comment:
Is the tracking for `current_row` still correct? We used to scan every row
so I think it was correct, but now we if we skip a row group won't
`current_row` be incorrect globally?
##########
datafusion/datasource-parquet/src/sort.rs:
##########
@@ -153,7 +166,10 @@ mod tests {
let selection =
RowSelection::from(vec![RowSelector::select(50),
RowSelector::skip(250)]);
- let reversed = reverse_row_selection(&selection, &metadata).unwrap();
+ let row_groups_to_scan = vec![0, 1, 2];
Review Comment:
Isn't `row_groups_to_scan` derived from the `RowSelection`? So in this case
it should be `vec![0]`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]