vbarua commented on code in PR #13931:
URL: https://github.com/apache/datafusion/pull/13931#discussion_r1899786865
##########
datafusion/substrait/src/logical_plan/producer.rs:
##########
@@ -730,32 +1035,23 @@ fn to_substrait_named_struct(schema: &DFSchemaRef) ->
Result<NamedStruct> {
}
fn to_substrait_join_expr(
- state: &dyn SubstraitPlanningState,
+ producer: &mut impl SubstraitProducer,
join_conditions: &Vec<(Expr, Expr)>,
eq_op: Operator,
- left_schema: &DFSchemaRef,
- right_schema: &DFSchemaRef,
- extensions: &mut Extensions,
+ join_schema: &DFSchemaRef,
) -> Result<Option<Expression>> {
// Only support AND conjunction for each binary expression in join
conditions
let mut exprs: Vec<Expression> = vec![];
for (left, right) in join_conditions {
- // Parse left
- let l = to_substrait_rex(state, left, left_schema, 0, extensions)?;
- // Parse right
- let r = to_substrait_rex(
- state,
- right,
- right_schema,
- left_schema.fields().len(), // offset to return the correct index
- extensions,
- )?;
+ let l = producer.consume_expr(left, join_schema)?;
+ let r = producer.consume_expr(right, join_schema)?;
Review Comment:
> Does this change affect the produced substrait plan?
Checking with the test I added, it does not actually.
Taking a step back as well. When I was updating this code I was curious why
the code in https://github.com/apache/datafusion/pull/6135 was added, and I may
have misunderstood what it was trying to fix.
What I did notice when refactoring this code was that we were already
concatenating the schemas and then using that to convert the filter on the join
https://github.com/apache/datafusion/blob/9b5995fa024d95c19e1270447e13f3c9dd428c69/datafusion/substrait/src/logical_plan/producer.rs#L465-L475
If it can be used to convert the filter, which itself can contain
expressions referencing either side of the join, then it should be possible to
use that schema to convert the expressions in the join condition as well. Based
on this, I removed the column offset code.
As I understand it, something like
```sql
SELECT *
FROM foo
JOIN foo ON id = id
```
where `foo` has an `id` column, is rejected with
```
Schema error: Ambiguous reference to unqualified field table_name
```
If we qualify both side
```sql
SELECT *
FROM foo
JOIN foo ON l.id = r.id
```
it works fine. If DataFusion rejects queries where the column name
references is ambiguous, it should be possible to look up the column in the
combined schema generally.
You're work in https://github.com/apache/datafusion/pull/11049 made it
possible to read plans where both sides of the join had columns with the same
name, which would otherwise fail. That probably affected the testing code, but
not the producer behaviour.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]