NGA-TRAN commented on code in PR #17478: URL: https://github.com/apache/datafusion/pull/17478#discussion_r2356371040
########## datafusion/expr/src/logical_plan/builder.rs: ########## @@ -2787,4 +2765,33 @@ mod tests { Ok(()) } + + #[test] + fn test_unique_field_aliases() { + let t1_field_1 = Field::new("a", DataType::Int32, false); + let t2_field_1 = Field::new("a", DataType::Int32, false); + let t2_field_3 = Field::new("a", DataType::Int32, false); + let t2_field_4 = Field::new("a:1", DataType::Int32, false); + let t1_field_2 = Field::new("b", DataType::Int32, false); + let t2_field_2 = Field::new("b", DataType::Int32, false); + + let fields = vec![ + t1_field_1, t2_field_1, t1_field_2, t2_field_2, t2_field_3, t2_field_4, + ]; + let fields = Fields::from(fields); + + let remove_redundant = unique_field_aliases(&fields); + + assert_eq!( + remove_redundant, + vec![ + None, + Some("a:1".to_string()), + None, + Some("b:1".to_string()), + Some("a:2".to_string()), + Some("a:1:1".to_string()), + ] Review Comment: `a, a, b, b, a, a:1` becomes `_, a:1, _, b:1, a:2, a:1:1`? Does it mean we keep the last of the same contiguous values and add 1, 2, 3, ... into the same but non contiguous? Can you add this comment on top of the test? ########## datafusion/expr/src/logical_plan/builder.rs: ########## @@ -1517,37 +1518,42 @@ impl ValuesFields { } } -// `name_map` tracks a mapping between a field name and the number of appearances of that field. -// -// Some field names might already come to this function with the count (number of times it appeared) -// as a suffix e.g. id:1, so there's still a chance of name collisions, for example, -// if these three fields passed to this function: "col:1", "col" and "col", the function -// would rename them to -> col:1, col, col:1 causing a posteriror error when building the DFSchema. -// that's why we need the `seen` set, so the fields are always unique. -// -pub fn change_redundant_column(fields: &Fields) -> Vec<Field> { - let mut name_map = HashMap::new(); - let mut seen: HashSet<String> = HashSet::new(); +// Return a list of aliases so that if applied to `fields` it would result in a unique name for each +// column, regardless of qualification. The returned vector length is equal to the number of fields +// as input and will optionally contain the alias that needs to be assigned to the column in the +// same position in order to maintain the uniqueness property. Review Comment: Can you add example in this comment? Something like you use in the test below: `a, a, b, b, a, a:1` becomes `_, a:1, _, b:1, a:2, a:1:1` It is also good to explain why you do this. Maybe another more general example for it? ########## datafusion/core/src/physical_planner.rs: ########## @@ -3562,4 +3448,44 @@ digraph { Ok(()) } + + #[tokio::test] + async fn subquery_alias_confusing_the_optimizer() -> Result<()> { Review Comment: ```suggestion // Reproducer for DataFusion issue #17405: // // The following SQL is semantically invalid. Notably, the `SELECT left_table.a, right_table.a` clause is missing from the explicit logical plan: // // SELECT a FROM ( // -- SELECT left_table.a, right_table.a // FROM left_table // FULL JOIN right_table ON left_table.a = right_table.a // ) AS alias // GROUP BY a; // // As a result, the variables within `alias` subquery are not properly distinguished, which leads to a bug when converting the logical plan into a physical plan. // // The proposed fix is to implicitly insert a ProjectionExec node to represent the missing SELECT clause. async fn subquery_alias_confusing_the_optimizer() -> Result<()> { ``` ########## datafusion/core/src/physical_planner.rs: ########## @@ -3562,4 +3448,44 @@ digraph { Ok(()) } + + #[tokio::test] Review Comment: For other reviewers, this is a reproducer of the bug. I have run this test before the fix and it fails. Great job to reproduce this with logical and physical plan @notfilippo. I suggest you add the below comment so we know what the bug is -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org