vbarua commented on code in PR #13931: URL: https://github.com/apache/datafusion/pull/13931#discussion_r1898697426
########## datafusion/substrait/src/logical_plan/producer.rs: ########## @@ -730,32 +1035,23 @@ fn to_substrait_named_struct(schema: &DFSchemaRef) -> Result<NamedStruct> { } fn to_substrait_join_expr( - state: &dyn SubstraitPlanningState, + producer: &mut impl SubstraitProducer, join_conditions: &Vec<(Expr, Expr)>, eq_op: Operator, - left_schema: &DFSchemaRef, - right_schema: &DFSchemaRef, - extensions: &mut Extensions, + join_schema: &DFSchemaRef, ) -> Result<Option<Expression>> { // Only support AND conjunction for each binary expression in join conditions let mut exprs: Vec<Expression> = vec![]; for (left, right) in join_conditions { - // Parse left - let l = to_substrait_rex(state, left, left_schema, 0, extensions)?; - // Parse right - let r = to_substrait_rex( - state, - right, - right_schema, - left_schema.fields().len(), // offset to return the correct index - extensions, - )?; + let l = producer.consume_expr(left, join_schema)?; + let r = producer.consume_expr(right, join_schema)?; Review Comment: We no longer need to track the column offset explicitly. The column offset code was added as part of https://github.com/apache/datafusion/pull/6135 to handle queries like ```sql SELECT d1.b, d2.c FROM data d1 JOIN data d2 ON d1.b = d2.e ``` which caused issue because the left and right inputs both had the same name. This could potentially cause column name references in DataFusion to converted incorrectly into Substrait column indices in some cases. Additionally, there were issues with duplicate schema errors. However, the introduction and usage of https://github.com/apache/datafusion/blob/a08dc0af8f798afe73a2fdf170ab36e72ad7e782/datafusion/substrait/src/logical_plan/consumer.rs#L1772-L1777 which was needed because different tables might have columns with the same name, now means that we can concatenate the left and right schemas of a join without issues. Then if the DataFusion column references are unambiguous, we can simply look up the columns in the join schema to get the correct index. This was the only place were the column offset was used. Removing this here allowed me to remove the `col_ref_offset` argument from a number of functions, which IMO simplifies the API substantially. For further verification, a test has been added for in in `roundtrip_logical_plan.rs` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org