Re: [PR] feat(substrait): modular substrait producer [datafusion]

via GitHub Fri, 27 Dec 2024 13:42:51 -0800


vbarua commented on code in PR #13931:
URL: https://github.com/apache/datafusion/pull/13931#discussion_r1898697426



##########
datafusion/substrait/src/logical_plan/producer.rs:
##########
@@ -730,32 +1035,23 @@ fn to_substrait_named_struct(schema: &DFSchemaRef) -> 
Result<NamedStruct> {
 }
 
 fn to_substrait_join_expr(
-    state: &dyn SubstraitPlanningState,
+    producer: &mut impl SubstraitProducer,
     join_conditions: &Vec<(Expr, Expr)>,
     eq_op: Operator,
-    left_schema: &DFSchemaRef,
-    right_schema: &DFSchemaRef,
-    extensions: &mut Extensions,
+    join_schema: &DFSchemaRef,
 ) -> Result<Option<Expression>> {
     // Only support AND conjunction for each binary expression in join 
conditions
     let mut exprs: Vec<Expression> = vec![];
     for (left, right) in join_conditions {
-        // Parse left
-        let l = to_substrait_rex(state, left, left_schema, 0, extensions)?;
-        // Parse right
-        let r = to_substrait_rex(
-            state,
-            right,
-            right_schema,
-            left_schema.fields().len(), // offset to return the correct index
-            extensions,
-        )?;
+        let l = producer.consume_expr(left, join_schema)?;
+        let r = producer.consume_expr(right, join_schema)?;

Review Comment:
   We no longer need to track the column offset explicitly.
   
   The column offset code was added as part of 
https://github.com/apache/datafusion/pull/6135 to handle queries like
   ```sql
   SELECT d1.b, d2.c
   FROM data d1
   JOIN data d2 ON d1.b = d2.e
   ```
   which caused issue because the left and right inputs both had the same name. 
This could potentially cause column name references in DataFusion to converted 
incorrectly into Substrait column indices in some cases. Additionally, there 
were issues with duplicate schema errors.
   
   However, the introduction and usage of 
https://github.com/apache/datafusion/blob/a08dc0af8f798afe73a2fdf170ab36e72ad7e782/datafusion/substrait/src/logical_plan/consumer.rs#L1772-L1777
 which was needed because different tables might have columns with the same 
name, now means that we can concatenate the left and right schemas of a join 
without issues. Then if the DataFusion column references are unambiguous, we 
can simply look up the columns in the join schema to get the correct index.
   
   This was the only place were the column offset was used. Removing this here 
allowed me to remove the `col_ref_offset` argument from a number of functions, 
which IMO simplifies the API substantially.
   
   For further verification, a test has been added for in in 
`roundtrip_logical_plan.rs`
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat(substrait): modular substrait producer [datafusion]

Reply via email to