Re: [I] Fix `csv_query_approx_count` once binary type coercion PR is merged [datafusion]

via GitHub Tue, 02 Dec 2025 05:27:21 -0800


Jefffrey commented on issue #3353:
URL: https://github.com/apache/datafusion/issues/3353#issuecomment-3602057897


   Was poking around this a bit, some notes.
   
   **Updated context**
   
   Test is now located here:
   
   
https://github.com/apache/datafusion/blob/73562e8ea14efac5ed6827dddb37362d7019266e/datafusion/sqllogictest/test_files/aggregate.slt#L1418-L1423
   
   There is also an expected error test here:
   
   
https://github.com/apache/datafusion/blob/73562e8ea14efac5ed6827dddb37362d7019266e/datafusion/sqllogictest/test_files/aggregate.slt#L142-L144
   
   **Cause**
   
   In the SQL planner we find the aggregate expressions here:
   
   
https://github.com/apache/datafusion/blob/c89cb70781dd3e8c52f7cbfff51c2b850900a132/datafusion/sql/src/select.rs#L236-L255
   
   See `find_aggregate_exprs()`, which essentially drops the aliases as 
mentioned:
   
   
https://github.com/apache/datafusion/blob/c89cb70781dd3e8c52f7cbfff51c2b850900a132/datafusion/expr/src/utils.rs#L608-L615
   
   We could alter this to preserve alises like so:
   
   ```diff
   pub fn find_aggregate_exprs<'a>(exprs: impl IntoIterator<Item = &'a Expr>) 
-> Vec<Expr> {
       find_exprs_in_exprs(exprs, &|nested_expr| {
   -        matches!(nested_expr, Expr::AggregateFunction { .. })
   +        let is_agg = matches!(nested_expr, Expr::AggregateFunction { .. });
   +        let is_alias_agg = if let Expr::Alias(Alias { expr, .. }) = 
nested_expr {
   +            matches!(**expr, Expr::AggregateFunction { .. })
   +        } else {
   +            false
   +        };
   +        is_agg || is_alias_agg
       })
   }
   ```
   
   Which technically fixes this query, however has a knock-on effect, for 
example breaking a few queries in the tpcds_planning suite.
   
   - There's also some changes to plans since it allows the alias to stay 
within an aggregate node instead of being a separate projection node
   
   Something interesting is that we can replicate this query via the DataFrame 
API which allows this query; this makes sense since the bug seems to be in the 
SQL planning code. See:
   
   ```rust
       let ctx = SessionContext::new();
   
       let testdata = datafusion::test_util::arrow_test_data();
       ctx.register_csv(
           "aggregate_test_100",
           format!("{testdata}/csv/aggregate_test_100.csv"),
           CsvReadOptions::default(),
       )
       .await?;
   
       println!("DF api");
       ctx.table("aggregate_test_100")
           .await?
           .aggregate(
               vec![],
               vec![
                   approx_distinct(col("c9")).alias("count_c9"),
                   approx_distinct(cast(col("c9"), DataType::Utf8View))
                       .alias("count_c9_str"),
               ],
           )?
           .explain(false, false)?
           .show()
           .await?;
   
       println!("SQL api");
       let query = "SELECT
         approx_distinct(c9) AS count_c9,
         approx_distinct(cast(c9 as varchar)) count_c9_str
       FROM aggregate_test_100";
       ctx.sql(query).await?.explain(false, false)?.show().await?;
   ```
   
   Where the output (on current main) is:
   
   ```
   DF api
   
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                                                
                               |
   
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Aggregate: groupBy=[[]], 
aggr=[[approx_distinct(aggregate_test_100.c9) AS count_c9, 
approx_distinct(CAST(aggregate_test_100.c9 AS Utf8View)) AS count_c9_str]]      
      |
   |               |   TableScan: aggregate_test_100 projection=[c9]            
                                                                                
                               |
   | physical_plan | AggregateExec: mode=Final, gby=[], aggr=[count_c9, 
count_c9_str]                                                                   
                                       |
   |               |   CoalescePartitionsExec                                   
                                                                                
                               |
   |               |     AggregateExec: mode=Partial, gby=[], aggr=[count_c9, 
count_c9_str]                                                                   
                                 |
   |               |       RepartitionExec: partitioning=RoundRobinBatch(12), 
input_partitions=1                                                              
                                 |
   |               |         DataSourceExec: file_groups={1 group: 
[[Users/jeffrey/Code/datafusion/testing/data/csv/aggregate_test_100.csv]]}, 
projection=[c9], file_type=csv, has_header=true |
   |               |                                                            
                                                                                
                               |
   
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   SQL api
   Error: SchemaError(DuplicateUnqualifiedField { name: 
"approx_distinct(aggregate_test_100.c9)" }, Some(""))
   ```
   
   So I wonder if we should try fix SQL API to be permissive like DataFrame API 
(preserving aliases in aggregations) but that seems to require larger scale 
fixes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Fix `csv_query_approx_count` once binary type coercion PR is merged [datafusion]

Reply via email to