Re: [PR] feat(substrait): modular substrait producer [datafusion]

via GitHub Mon, 30 Dec 2024 12:56:05 -0800


vbarua commented on code in PR #13931:
URL: https://github.com/apache/datafusion/pull/13931#discussion_r1899786865



##########
datafusion/substrait/src/logical_plan/producer.rs:
##########
@@ -730,32 +1035,23 @@ fn to_substrait_named_struct(schema: &DFSchemaRef) -> 
Result<NamedStruct> {
 }
 
 fn to_substrait_join_expr(
-    state: &dyn SubstraitPlanningState,
+    producer: &mut impl SubstraitProducer,
     join_conditions: &Vec<(Expr, Expr)>,
     eq_op: Operator,
-    left_schema: &DFSchemaRef,
-    right_schema: &DFSchemaRef,
-    extensions: &mut Extensions,
+    join_schema: &DFSchemaRef,
 ) -> Result<Option<Expression>> {
     // Only support AND conjunction for each binary expression in join 
conditions
     let mut exprs: Vec<Expression> = vec![];
     for (left, right) in join_conditions {
-        // Parse left
-        let l = to_substrait_rex(state, left, left_schema, 0, extensions)?;
-        // Parse right
-        let r = to_substrait_rex(
-            state,
-            right,
-            right_schema,
-            left_schema.fields().len(), // offset to return the correct index
-            extensions,
-        )?;
+        let l = producer.consume_expr(left, join_schema)?;
+        let r = producer.consume_expr(right, join_schema)?;

Review Comment:
   > Does this change affect the produced substrait plan?
   
   Checking with the test I added, it does not actually.
   
   Taking a step back as well. When I was updating this code I was curious why 
the code in https://github.com/apache/datafusion/pull/6135 was added, and I may 
have misunderstood what it was trying to fix.
   
   What I did notice when refactoring this code was that we were already 
concatenating the schemas and then using that to convert the filter on the join
   
https://github.com/apache/datafusion/blob/9b5995fa024d95c19e1270447e13f3c9dd428c69/datafusion/substrait/src/logical_plan/producer.rs#L465-L475
   
   If it can be used to convert the filter, which itself can contain 
expressions referencing either side of the join, then it should be possible to 
use that schema to convert the expressions in the join condition as well. Based 
on this, I removed the column offset code.
   
   As I understand it, something like
   ```sql
   SELECT *
   FROM foo
     JOIN foo ON id = id
   ```
   where `foo` has an `id` column, is rejected with
   ```
   Schema error: Ambiguous reference to unqualified field table_name
   ```
   
   If we qualify both side
   ```sql
   SELECT *
   FROM foo
     JOIN foo ON l.id = r.id
   ```
   it works fine. If DataFusion rejects queries where the column name 
references is ambiguous, it should be possible to look up the column in the 
combined schema generally.
   
   You're work in https://github.com/apache/datafusion/pull/11049 made it 
possible to read plans where both sides of the join had columns with the same 
name, which would otherwise fail. That probably affected the testing code, but 
not the producer behaviour.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat(substrait): modular substrait producer [datafusion]

Reply via email to