albel727 opened a new issue, #4667:
URL: https://github.com/apache/arrow-datafusion/issues/4667

   The title says it all. But in other words: 
   if a table constructed with `UNION [DISTINCT]` is used as a subtable for a 
`SELECT` statement,
   and not all columns of the table are mentioned in the select, the output is 
incorrectly 
   deduplicated while considering only the mentioned columns, instead of all 
columns of the subtable.
   
   ## To Reproduce
   Consider a simple table, constructed with a UNION statement:
   ```
   +----+------+
   | id | data |
   +----+------+
   | 1  | 10   |
   | 1  | 20   |
   +----+------+
   ```
   Let's select the `id` column from it:
   ```rust
   let ctx = SessionContext::new();
   let df = ctx.sql("SELECT id FROM (SELECT 1 as id, 10 as data UNION SELECT 1, 
20)").await?;
   df.show().await?;
   ```
   ## Expected output
   ```
   +----+
   | id |
   +----+
   | 1  |
   | 1  |
   +----+
   ```
   ## Actual output
   ```
   +----+
   | id |
   +----+
   | 1  |
   +----+
   ```
   ## Additional context
   Of course, in this particular example this wrong behavior could be worked 
around by using `UNION ALL`, but this isn't always possible.
   E.g. the following query, which detects rows that have same ids but 
differing data in other columns, can't be (easily) rewritten in terms of `UNION 
ALL`:
   ```rust
   let df = ctx.sql("SELECT id FROM (SELECT * FROM tbl1 UNION SELECT * FROM 
tbl2) GROUP BY id HAVING count(*) = 2").await?;
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to