kosiew commented on code in PR #22940:
URL: https://github.com/apache/datafusion/pull/22940#discussion_r3426502695


##########
datafusion/datasource-parquet/src/access_plan.rs:
##########
@@ -169,6 +204,110 @@ impl ParquetAccessPlan {
         }
     }
 
+    /// Create a new `ParquetAccessPlan` from a file-level [`RowSelection`].
+    ///
+    /// The selection is interpreted across all rows in the file, in row group
+    /// order, and is split into row-group level access using 
`row_group_meta_data`.
+    /// Fully skipped row groups become [`RowGroupAccess::Skip`], fully 
selected
+    /// row groups become [`RowGroupAccess::Scan`], and partially selected row
+    /// groups become [`RowGroupAccess::Selection`].
+    ///
+    /// # Errors
+    ///
+    /// Returns an error if the selection does not specify exactly the same
+    /// number of rows as the file metadata.
+    pub fn try_new_from_overall_row_selection(
+        selection: RowSelection,
+        row_group_meta_data: &[RowGroupMetaData],
+    ) -> Result<Self> {
+        let selectors: Vec<RowSelector> = selection.into();

Review Comment:
   Nice work here. I think this could be simplified a bit by leaning on 
`RowSelection::split_off`, since it already handles the boundary splitting 
invariant.
   
   One possible shape would be to keep `let mut remaining_selection = 
selection;`, then for each row group call `let group_selection = 
remaining_selection.split_off(rg_rows);`. From there, derive `selected = 
group_selection.row_count()` and `skipped = 
group_selection.skipped_row_count()`, then map `(selected, skipped)` to `Skip`, 
`Scan`, or `Selection(group_selection)`.
   
   After the loop, the total row count check could also be more direct by 
validating `remaining_selection.row_count() + 
remaining_selection.skipped_row_count() == 0` along with the accumulated file 
count. That would remove the manual `current` / `leading` / `mixed` cursor 
state machine.



##########
datafusion/datasource-parquet/src/opener/mod.rs:
##########
@@ -1515,25 +1515,44 @@ fn constant_value_from_stats(
 
 /// Return the initial [`ParquetAccessPlan`]
 ///
-/// If the user has supplied one as an extension, use that
-/// otherwise return a plan that scans all row groups
+/// If the user has supplied a parquet access extension, use that; otherwise
+/// return a plan that scans all row groups.
 ///
-/// Returns an error if an invalid `ParquetAccessPlan` is provided
+/// Returns an error if an invalid parquet access extension is provided.
 ///
 /// Note: file_name is only used for error messages
 fn create_initial_plan(
     file_name: &str,
     extensions: &datafusion_datasource::FileExtensions,
-    row_group_count: usize,
+    rg_metadata: &[RowGroupMetaData],
 ) -> Result<ParquetAccessPlan> {
-    if let Some(access_plan) = extensions.get::<ParquetAccessPlan>() {
-        let plan_len = access_plan.len();
-        if plan_len != row_group_count {
+    let row_group_count = rg_metadata.len();
+    match (

Review Comment:
   This match might read a little more cleanly if it returned the `Result` 
directly instead of using early `return`s plus a trailing default branch.
   
   Something like:
   
   ```rust
   match (...) {
       (Some(_), Some(_)) => exec_err!(...),
       (Some(access_plan), None) => {
           ...
           Ok(access_plan.clone())
       }
       (None, Some(row_selection)) => 
ParquetAccessPlan::try_new_from_overall_row_selection(...),
       (None, None) => Ok(ParquetAccessPlan::new_all(row_group_count)),
   }
   ```
   
   That keeps all extension cases in one expression and avoids the extra 
fallthrough branch.



##########
datafusion/datasource-parquet/src/opener/mod.rs:
##########
@@ -1655,6 +1676,87 @@ mod test {
         preserve_order: bool,
     }
 
+    #[test]
+    fn create_initial_plan_from_parquet_row_selection_extension() {

Review Comment:
   Could we add one end-to-end scan test that attaches `ParquetRowSelection` to 
a `PartitionedFile` and asserts the returned rows?
   
   The unit tests cover conversion and mutual exclusion well, but since this is 
a public extension contract, it would be great to have one test proving the 
extension survives the path from `PartitionedFile` into the parquet opener and 
reader.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to