alex-natzka opened a new issue #1854:
URL: https://github.com/apache/arrow-datafusion/issues/1854
**Describe the bug**
When the GROUP BY statement of a SELECT query contains two different column
names with the same unqualified name (e.g. `table1.col1` and `table2.col1`),
the aggregation result is wrong: one of the columns is not used to create
groups; the groups from the other column get copied into the column that was
not used.
This happens e.g. when aggregating over the results of a JOIN. A minimal
example can easily be constructed by cross joining a table with itself:
`SELECT a.col1, b.col1, SUM(a.col2) FROM mytable AS a CROSS JOIN mytable AS
b GROUP BY a.col1, b.col1`
When running this SQL query, the `DefaultPhysicalPlanner` will map this to
an `ExecutionPlan` that does not group by `b.col1`, and in the output `b.col1`
will simply be a copy of `a.col1`.
**To Reproduce**
Steps to reproduce the behavior:
Add the following test to `datafusion/tests/sql/aggregates.rs`:
```rust
#[tokio::test]
async fn csv_query_sum_crossjoin() {
let mut ctx = ExecutionContext::new();
register_aggregate_csv_by_sql(&mut ctx).await;
let sql = "SELECT a.c1, b.c1, SUM(a.c2) FROM aggregate_test_100 as a
CROSS JOIN aggregate_test_100 as b GROUP BY a.c1, b.c1";
let actual = execute_to_batches(&mut ctx, sql).await;
let expected = vec![
"+----+----+-----------+",
"| c1 | c1 | SUM(a.c2) |",
"+----+----+-----------+",
"| a | a | 1260 |",
"| a | b | 1140 |",
"| a | c | 1260 |",
"| a | d | 1080 |",
"| a | e | 1260 |",
"| b | a | 1302 |",
"| b | b | 1178 |",
"| b | c | 1302 |",
"| b | d | 1116 |",
"| b | e | 1302 |",
"| c | a | 1176 |",
"| c | b | 1064 |",
"| c | c | 1176 |",
"| c | d | 1008 |",
"| c | e | 1176 |",
"| d | a | 924 |",
"| d | b | 836 |",
"| d | c | 924 |",
"| d | d | 792 |",
"| d | e | 924 |",
"| e | a | 1323 |",
"| e | b | 1197 |",
"| e | c | 1323 |",
"| e | d | 1134 |",
"| e | e | 1323 |",
"+----+----+-----------+",
];
assert_batches_eq!(expected, &actual);
}
```
Instead of the expected output, we obtain only 5 rows, corresponding to the
groups `a,b,c,d,e`:
```text
"+----+----+-----------+",
"| c1 | c1 | SUM(a.c2) |",
"+----+----+-----------+",
"| a | a | 6000 |",
"| b | b | 6200 |",
"| c | c | 5600 |",
"| d | d | 4400 |",
"| e | e | 6300 |",
"+----+----+-----------+",
```
**Expected behavior**
Aggregation output should be correct when grouping by different columns that
share one unqualified name.
**Additional context**
I tracked the bug down to a problem in `DefaultPhysicalPlanner`:
when planning a `LogicalPlan::Aggregate`, two `HashAggregateExec`s are
created, a partial aggregation and then a final aggregation. In [this piece of
code](https://github.com/apache/arrow-datafusion/blob/67fe623bd5882d0ce567e3aa3145a96fdf563443/datafusion/src/physical_plan/planner.rs#L526-L529),
the input column indices for the second `HashAggregateExec` are computed by
using the `col` function to find the columns from `groups` in the output schema
of the _first_ `HaschAggregateExec`.
`col` uses the column's _unqualified_ name and apparently returns the first
occurence of a column with that name. If two column names share their
unqualified names, `col` will return a `Column` with the same index for both of
them. This results in the second `HashAggregateExec` using the same column
twice instead of two distinct columns in its `group_exprs`.
I am preparing a PR with a fix. It is already implemented on my fork and can
be found here:
https://github.com/alex-natzka/arrow-datafusion/tree/fix_agg_same_column_names
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]