Re: [PR] Teach Datafusion to project only accessed struct leaves in row filter pushdown [datafusion]

via GitHub Wed, 11 Mar 2026 07:47:59 -0700


friendlymatthew commented on code in PR #20854:
URL: https://github.com/apache/datafusion/pull/20854#discussion_r2918863264



##########
datafusion/datasource-parquet/src/row_filter.rs:
##########
@@ -251,15 +251,26 @@ impl FilterCandidateBuilder {
             return Ok(None);
         };
 
+        let schema_descr = metadata.file_metadata().schema_descr();
         let root_indices: Vec<_> =
             required_columns.required_columns.into_iter().collect();
 
-        let leaf_indices = leaf_indices_for_roots(
-            &root_indices,
-            metadata.file_metadata().schema_descr(),
+        let mut leaf_indices = leaf_indices_for_roots(&root_indices, 
schema_descr);
+
+        let struct_leaf_indices = resolve_struct_field_leaves(
+            &required_columns.struct_field_accesses,
+            &self.file_schema,
+            schema_descr,
         );
+        leaf_indices.extend_from_slice(&struct_leaf_indices);
+        leaf_indices.sort_unstable();

Review Comment:
   sort and dedup are necessary since we concatenate `leaf_indices_for_roots` 
and `resolve_struct_field_leaves`. The first iterates parquet leaves 0..N 
collecting those belonging to regular (non-struct) cols, and the second does 
the same for struct field accesses. Both produce individually sorted output, 
but when a struct column appears before a regular column in the schema, the 
struct's leaf indices are numerically lower. 
   
   Dedup is needed because the same struct field can appear multiple times in a 
filter expression like `get_field(s, 'val') > 5 and get_field(s, 'val') < 100` 
producing duplicate entries in `struct_field_accesses`. Without dedup, we'd 
double count the compressed size of that column



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Teach Datafusion to project only accessed struct leaves in row filter pushdown [datafusion]

Reply via email to