[GitHub] [arrow-datafusion] alex-natzka opened a new issue #1854: Aggregation incorrect in case that GROUP BY contains duplicate (unqualified) column names

GitBox Thu, 17 Feb 2022 06:55:44 -0800


alex-natzka opened a new issue #1854:
URL: https://github.com/apache/arrow-datafusion/issues/1854



   **Describe the bug**
   When the GROUP BY statement of a SELECT query contains two different column 
names with the same unqualified name (e.g. `table1.col1` and `table2.col1`), 
the aggregation result is wrong: one of the columns is not used to create 
groups; the groups from the other column get copied into the column that was 
not used.
   
   This happens e.g. when aggregating over the results of a JOIN. A minimal 
example can easily be constructed by cross joining a table with itself:
   `SELECT a.col1, b.col1, SUM(a.col2) FROM mytable AS a CROSS JOIN mytable AS 
b GROUP BY a.col1, b.col1`
   When running this SQL query, the `DefaultPhysicalPlanner` will map this to 
an `ExecutionPlan` that does not group by `b.col1`, and in the output `b.col1` 
will simply be a copy of `a.col1`.
   
   **To Reproduce**
   Steps to reproduce the behavior:
   
   Add the following test to `datafusion/tests/sql/aggregates.rs`:
   
   ```rust
   #[tokio::test]
   async fn csv_query_sum_crossjoin() {
       let mut ctx = ExecutionContext::new();
       register_aggregate_csv_by_sql(&mut ctx).await;
       let sql = "SELECT a.c1, b.c1, SUM(a.c2) FROM aggregate_test_100 as a 
CROSS JOIN aggregate_test_100 as b GROUP BY a.c1, b.c1";
       let actual = execute_to_batches(&mut ctx, sql).await;
       let expected = vec![
           "+----+----+-----------+",
           "| c1 | c1 | SUM(a.c2) |",
           "+----+----+-----------+",
           "| a  | a  | 1260      |",
           "| a  | b  | 1140      |",
           "| a  | c  | 1260      |",
           "| a  | d  | 1080      |",
           "| a  | e  | 1260      |",
           "| b  | a  | 1302      |",
           "| b  | b  | 1178      |",
           "| b  | c  | 1302      |",
           "| b  | d  | 1116      |",
           "| b  | e  | 1302      |",
           "| c  | a  | 1176      |",
           "| c  | b  | 1064      |",
           "| c  | c  | 1176      |",
           "| c  | d  | 1008      |",
           "| c  | e  | 1176      |",
           "| d  | a  | 924       |",
           "| d  | b  | 836       |",
           "| d  | c  | 924       |",
           "| d  | d  | 792       |",
           "| d  | e  | 924       |",
           "| e  | a  | 1323      |",
           "| e  | b  | 1197      |",
           "| e  | c  | 1323      |",
           "| e  | d  | 1134      |",
           "| e  | e  | 1323      |",
           "+----+----+-----------+",
       ];
       assert_batches_eq!(expected, &actual);
   }
   ```
   Instead of the expected output, we obtain only 5 rows, corresponding to the 
groups `a,b,c,d,e`:
   ```text
       "+----+----+-----------+",
       "| c1 | c1 | SUM(a.c2) |",
       "+----+----+-----------+",
       "| a  | a  | 6000      |",
       "| b  | b  | 6200      |",
       "| c  | c  | 5600      |",
       "| d  | d  | 4400      |",
       "| e  | e  | 6300      |",
       "+----+----+-----------+",
   ```
   
   **Expected behavior**
   Aggregation output should be correct when grouping by different columns that 
share one unqualified name.
   
   **Additional context**
   I tracked the bug down to a problem in `DefaultPhysicalPlanner`:
   when planning a `LogicalPlan::Aggregate`, two `HashAggregateExec`s are 
created, a partial aggregation and then a final aggregation. In [this piece of 
code](https://github.com/apache/arrow-datafusion/blob/67fe623bd5882d0ce567e3aa3145a96fdf563443/datafusion/src/physical_plan/planner.rs#L526-L529),
 the input column indices for the second `HashAggregateExec` are computed by 
using the `col` function to find the columns from `groups` in the output schema 
of the _first_ `HaschAggregateExec`.
   `col` uses the column's _unqualified_ name and apparently returns the first 
occurence of a column with that name. If two column names share their 
unqualified names, `col` will return a `Column` with the same index for both of 
them. This results in the second `HashAggregateExec` using the same column 
twice instead of two distinct columns in its `group_exprs`.
   
   I am preparing a PR with a fix. It is already implemented on my fork and can 
be found here: 
https://github.com/alex-natzka/arrow-datafusion/tree/fix_agg_same_column_names


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alex-natzka opened a new issue #1854: Aggregation incorrect in case that GROUP BY contains duplicate (unqualified) column names

Reply via email to