albel727 opened a new issue, #4667:
URL: https://github.com/apache/arrow-datafusion/issues/4667
The title says it all. But in other words:
if a table constructed with `UNION [DISTINCT]` is used as a subtable for a
`SELECT` statement,
and not all columns of the table are mentioned in the select, the output is
incorrectly
deduplicated while considering only the mentioned columns, instead of all
columns of the subtable.
## To Reproduce
Consider a simple table, constructed with a UNION statement:
```
+----+------+
| id | data |
+----+------+
| 1 | 10 |
| 1 | 20 |
+----+------+
```
Let's select the `id` column from it:
```rust
let ctx = SessionContext::new();
let df = ctx.sql("SELECT id FROM (SELECT 1 as id, 10 as data UNION SELECT 1,
20)").await?;
df.show().await?;
```
## Expected output
```
+----+
| id |
+----+
| 1 |
| 1 |
+----+
```
## Actual output
```
+----+
| id |
+----+
| 1 |
+----+
```
## Additional context
Of course, in this particular example this wrong behavior could be worked
around by using `UNION ALL`, but this isn't always possible.
E.g. the following query, which detects rows that have same ids but
differing data in other columns, can't be (easily) rewritten in terms of `UNION
ALL`:
```rust
let df = ctx.sql("SELECT id FROM (SELECT * FROM tbl1 UNION SELECT * FROM
tbl2) GROUP BY id HAVING count(*) = 2").await?;
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]