gruuya commented on issue #7827: URL: https://github.com/apache/arrow-datafusion/issues/7827#issuecomment-1775264634
I'd like to take a shot at this if there're no other takers. For reference, here's a formal description of this clause from the [Postgres docs](https://www.postgresql.org/docs/current/sql-select.html#SQL-DISTINCT): ` SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY (see above). Note that the “first row” of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first. ` From what I see, the most promising options to implement this are: 1. Add a new `GroupsAccumulator` that accumulates only a single value based on a given (`ORDER BY`) expression, i.e. something like argmax/argmin. In other words the above functionality could be obtained if one projects the `DISTINCT ON` expressions alongside the selection expressions, and then argmaxes them based on the `ORDER BY` expressions (if none are present the problem reduces to a simple group by). This might enable re-use of some aggregation/grouping logic, and could itself be re-used somewhere else. 2. Implement a custom `DISTINCT ON` logic/physical plan, that would basically do the same thing as option 1 above, but without trying to adhere to existing aggregation/grouping semantics. I think this variant is likely if option 1 turns out to be infeasible for some reason. 3. Based on my (potentially flawed) observation that `DISTINCT ON` is a subset of window expressions, i.e. that ```sql select distinct on (username) username, browser, logged_at from logins order by username, logged_at desc ``` can be re-written as ```sql select distinct first_value(username) over w as username, first_value(browser) over w as browser, first_value(logged_at) over w as logged_at from logins window w as (partition by username order by username, logged_at desc) order by username, logged_at desc ``` this could be used to construct an alternative logical/physical plan representation using the already existing primitives (i.e. window plans). Besides being hacky, I think this is sub-optimal since in the case of window functions the entire table seems to be materialized, and only after that `DISTINCT`d (which is represented via a grouping logical plan after optimizations). Alternatively, the window physical plan could be revised to take into account a limit per window, which solve this problem and perhaps address #6899. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org