Jefffrey commented on issue #3353: URL: https://github.com/apache/datafusion/issues/3353#issuecomment-3602057897
Was poking around this a bit, some notes. **Updated context** Test is now located here: https://github.com/apache/datafusion/blob/73562e8ea14efac5ed6827dddb37362d7019266e/datafusion/sqllogictest/test_files/aggregate.slt#L1418-L1423 There is also an expected error test here: https://github.com/apache/datafusion/blob/73562e8ea14efac5ed6827dddb37362d7019266e/datafusion/sqllogictest/test_files/aggregate.slt#L142-L144 **Cause** In the SQL planner we find the aggregate expressions here: https://github.com/apache/datafusion/blob/c89cb70781dd3e8c52f7cbfff51c2b850900a132/datafusion/sql/src/select.rs#L236-L255 See `find_aggregate_exprs()`, which essentially drops the aliases as mentioned: https://github.com/apache/datafusion/blob/c89cb70781dd3e8c52f7cbfff51c2b850900a132/datafusion/expr/src/utils.rs#L608-L615 We could alter this to preserve alises like so: ```diff pub fn find_aggregate_exprs<'a>(exprs: impl IntoIterator<Item = &'a Expr>) -> Vec<Expr> { find_exprs_in_exprs(exprs, &|nested_expr| { - matches!(nested_expr, Expr::AggregateFunction { .. }) + let is_agg = matches!(nested_expr, Expr::AggregateFunction { .. }); + let is_alias_agg = if let Expr::Alias(Alias { expr, .. }) = nested_expr { + matches!(**expr, Expr::AggregateFunction { .. }) + } else { + false + }; + is_agg || is_alias_agg }) } ``` Which technically fixes this query, however has a knock-on effect, for example breaking a few queries in the tpcds_planning suite. - There's also some changes to plans since it allows the alias to stay within an aggregate node instead of being a separate projection node Something interesting is that we can replicate this query via the DataFrame API which allows this query; this makes sense since the bug seems to be in the SQL planning code. See: ```rust let ctx = SessionContext::new(); let testdata = datafusion::test_util::arrow_test_data(); ctx.register_csv( "aggregate_test_100", format!("{testdata}/csv/aggregate_test_100.csv"), CsvReadOptions::default(), ) .await?; println!("DF api"); ctx.table("aggregate_test_100") .await? .aggregate( vec![], vec![ approx_distinct(col("c9")).alias("count_c9"), approx_distinct(cast(col("c9"), DataType::Utf8View)) .alias("count_c9_str"), ], )? .explain(false, false)? .show() .await?; println!("SQL api"); let query = "SELECT approx_distinct(c9) AS count_c9, approx_distinct(cast(c9 as varchar)) count_c9_str FROM aggregate_test_100"; ctx.sql(query).await?.explain(false, false)?.show().await?; ``` Where the output (on current main) is: ``` DF api +---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | logical_plan | Aggregate: groupBy=[[]], aggr=[[approx_distinct(aggregate_test_100.c9) AS count_c9, approx_distinct(CAST(aggregate_test_100.c9 AS Utf8View)) AS count_c9_str]] | | | TableScan: aggregate_test_100 projection=[c9] | | physical_plan | AggregateExec: mode=Final, gby=[], aggr=[count_c9, count_c9_str] | | | CoalescePartitionsExec | | | AggregateExec: mode=Partial, gby=[], aggr=[count_c9, count_c9_str] | | | RepartitionExec: partitioning=RoundRobinBatch(12), input_partitions=1 | | | DataSourceExec: file_groups={1 group: [[Users/jeffrey/Code/datafusion/testing/data/csv/aggregate_test_100.csv]]}, projection=[c9], file_type=csv, has_header=true | | | | +---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ SQL api Error: SchemaError(DuplicateUnqualifiedField { name: "approx_distinct(aggregate_test_100.c9)" }, Some("")) ``` So I wonder if we should try fix SQL API to be permissive like DataFrame API (preserving aliases in aggregations) but that seems to require larger scale fixes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
