gruuya commented on issue #7827:
URL: 
https://github.com/apache/arrow-datafusion/issues/7827#issuecomment-1775264634

   I'd like to take a shot at this if there're no other takers. For reference, 
here's a formal description of this clause from the [Postgres 
docs](https://www.postgresql.org/docs/current/sql-select.html#SQL-DISTINCT):
   `
   SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of each 
set of rows where the given expressions evaluate to equal. The DISTINCT ON 
expressions are interpreted using the same rules as for ORDER BY (see above). 
Note that the “first row” of each set is unpredictable unless ORDER BY is used 
to ensure that the desired row appears first.
   `
   
   From what I see, the most promising options to implement this are:
   
   1. Add a new `GroupsAccumulator` that accumulates only a single value based 
on a given (`ORDER BY`) expression, i.e. something like argmax/argmin. In other 
words the above functionality could be obtained if one projects the `DISTINCT 
ON` expressions alongside the selection expressions, and then argmaxes them 
based on the `ORDER BY` expressions (if none are present the problem reduces to 
a simple group by). This might enable re-use of some aggregation/grouping 
logic, and could itself be re-used somewhere else.
   2. Implement a custom `DISTINCT ON` logic/physical plan, that would 
basically do the same thing as option 1 above, but without trying to adhere to 
existing aggregation/grouping semantics. I think this variant is likely if 
option 1 turns out to be infeasible for some reason. 
   3. Based on my (potentially flawed) observation that `DISTINCT ON` is a 
subset of window expressions, i.e. that
       ```sql
       select distinct on (username) username, browser, logged_at from logins 
order by username, logged_at desc
       ```
       can be re-written as 
       ```sql
       select distinct 
           first_value(username) over w as username, 
           first_value(browser) over w as browser, 
           first_value(logged_at) over w as logged_at 
       from logins 
       window w as (partition by username order by username, logged_at desc) 
       order by username, logged_at desc
       ```
       this could be used to construct an alternative logical/physical plan 
representation using the already existing primitives (i.e. window plans). 
Besides being hacky, I think this is sub-optimal since in the case of window 
functions the entire table seems to be materialized, and only after that 
`DISTINCT`d (which is represented via a grouping logical plan after 
optimizations). Alternatively, the window physical plan could be revised to 
take into account a limit per window, which solve this problem and perhaps 
address #6899.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to