Re: [PR] fix: UnnestExec preserves relevant equivalence properties of input [datafusion]

via GitHub Mon, 13 Oct 2025 09:29:08 -0700


tobixdev commented on code in PR #16985:
URL: https://github.com/apache/datafusion/pull/16985#discussion_r2426768993



##########
datafusion/physical-plan/src/unnest.rs:
##########
@@ -99,11 +107,64 @@ impl UnnestExec {
     /// This function creates the cache object that stores the plan properties 
such as schema, equivalence properties, ordering, partitioning, etc.
     fn compute_properties(
         input: &Arc<dyn ExecutionPlan>,
+        list_column_indices: &[ListUnnest],
+        struct_column_indices: &[usize],
         schema: SchemaRef,
     ) -> PlanProperties {
+        let list_column_indices: Vec<usize> = list_column_indices
+            .iter()
+            .map(|list_unnest| list_unnest.index_in_input_schema)
+            .collect();
+        let non_unnested_indices: Vec<usize> = input
+            .schema()
+            .fields()
+            .iter()
+            .enumerate()
+            .filter(|(idx, _)| {
+                !list_column_indices.contains(idx) && 
!struct_column_indices.contains(idx)

Review Comment:
   I think we have had multiple issues with quadratic planning time for a large 
amount of columns. I think we could get the same problem here as the `contains` 
is another linear scan, thus creating a quadratic runtime depending on the 
number of columns. I could also be wrong and this doesn't cause an issue.
   
   Maybe we could build a buffer and then simply index into the buffer on 
whether this is unnested (not tested):
   
   ```
           let input_schema = input.schema();
           let mut unnested_indices = BooleanBufferBuilder::new(input.len());
           unnested_indices.append_n(input.len(), false);
           for list_unnest in list_column_indices {
               unnested_indices.set_bit(list_unnest.index_in_input_schema, 
true);
           }
           for list_unnest in struct_column_indices {
               unnested_indices.set_bit(*list_unnest, true)
           }
           let unnested_indices = unnested_indices.finish();
   
           let non_unnested_indices: Vec<usize> = 
(0..input_schema.fields().len())
               .filter(|idx| !unnested_indices.value(*idx))
               .collect();
   ```
   
   Otherwise, I think changing the iterator to 
`(0..input_schema.fields().len())` would help with readability as you don't 
seem to be using the actual field.



##########
datafusion/physical-plan/src/unnest.rs:
##########
@@ -99,11 +107,64 @@ impl UnnestExec {
     /// This function creates the cache object that stores the plan properties 
such as schema, equivalence properties, ordering, partitioning, etc.
     fn compute_properties(
         input: &Arc<dyn ExecutionPlan>,
+        list_column_indices: &[ListUnnest],
+        struct_column_indices: &[usize],
         schema: SchemaRef,
     ) -> PlanProperties {
+        let list_column_indices: Vec<usize> = list_column_indices
+            .iter()
+            .map(|list_unnest| list_unnest.index_in_input_schema)
+            .collect();
+        let non_unnested_indices: Vec<usize> = input
+            .schema()
+            .fields()
+            .iter()
+            .enumerate()
+            .filter(|(idx, _)| {
+                !list_column_indices.contains(idx) && 
!struct_column_indices.contains(idx)
+            })
+            .map(|(idx, _)| idx)
+            .collect();
+
+        // Manually build projection mapping from non-unnested input columns 
to their positions in the output
+        let input_schema = input.schema();
+        let projection_mapping: ProjectionMapping = non_unnested_indices
+            .iter()
+            .map(|&input_idx| {
+                // Find what index the input column has in the output schema
+                let input_field = input_schema.field(input_idx);
+                let output_idx = schema
+                    .fields()
+                    .iter()
+                    .position(|output_field| output_field.name() == 
input_field.name())

Review Comment:
   After what I've seen, the unwrap cannot "overwrite" or replace an existing 
column, right? If this is the case, this should be good :+1: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix: UnnestExec preserves relevant equivalence properties of input [datafusion]

Reply via email to