[
https://issues.apache.org/jira/browse/CALCITE-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18080186#comment-18080186
]
Yu Xu commented on CALCITE-2366:
--------------------------------
There are still some differences between DISTINCT ON and ANY_VALUE in my view:
1.ANY_VALUE returns any row; which row it returns is uncertain and depends on
the underlying executor implementation.
{code:java}
SELECT deptno, ANY_VALUE(empno), ANY_VALUE(ename)
FROM emp
GROUP BY deptno; {code}
You might get empno=7369 or empno=7499; the result is unpredictable. If you
need the highest-paid or earliest-hired employee, ANY_VALUE simply won't meet
your needs.
The `DISTINCT ON` directive, when used in conjunction with `ORDER BY`, allows
for precise control over which rows are retained.
{code:java}
SELECT DISTINCT ON (deptno) empno, ename, sal
FROM emp
ORDER BY deptno, sal DESC; -- Explicitly retain the highest-paid employee in
each department. {code}
2. When replacing DISTINCT ON with ANY_VALUE, you must: Wrap ANY_VALUE() around
each of the non-grouping columns individually Loss sorting control The code
becomes verbose and ambiguous.
Using `DISTINCT ON`, the declarative syntax immediately reveals that it "groups
by `deptno`, retaining the first row of each sorted group," naturally
integrating into the existing `ORDER BY` semantic system.
3. `DISTINCT ON` performs deduplication on an ordered data stream, therefore:
{code:java}
SELECT DISTINCT ON (deptno) empno, ename
FROM emp
ORDER BY deptno, hiredate DESC
LIMIT 10; {code}
The semantics are: first group by department, then sort by hire date in
descending order within each group, take the first person in each group, and
finally take the top 10 results. Using ANY_VALUE with subqueries makes it
difficult to elegantly express this complex semantics of "sorting,
partitioning, deduplicating, and truncating".
4. Better PostgreSQL ecosystem compatibility: DISTINCT ON is a signature syntax
of PostgreSQL, used by a large number of existing queries, ORM frameworks, and
BI tools. Calcite, as an SQL parsing and optimization infrastructure, supports
this syntax and is directly compatible with these ecosystems, without requiring
users to manually rewrite it as a {*}window function + CTE{*}.
5. After `DISTINCT ON` is converted to `ROW_NUMBER() OVER (PARTITION BY ...
ORDER BY ...) + filter` in the RelOpt layer, Calcite's optimizer can: Utilize
existing window function optimization rules Merge operators with Sort, Filter,
and Project Select more efficient execution strategies (such as Top-N per
partition) during the physical planning phase However, `ANY_VALUE`, as an
aggregation function, has a completely different optimization path and cannot
utilize sorting information.
> Add support for ANY_VALUE function
> ----------------------------------
>
> Key: CALCITE-2366
> URL: https://issues.apache.org/jira/browse/CALCITE-2366
> Project: Calcite
> Issue Type: New Feature
> Affects Versions: 1.16.0
> Reporter: Gautam Parai
> Assignee: Gautam Parai
> Priority: Major
> Fix For: 1.17.0
>
>
> We had discussions on the Apache Calcite [1] and Apache Drill [2] mailing
> lists regarding an equivalent for DISTINCT ON. The community seems to prefer
> the ANY_VALUE. This Jira is a placeholder for implementing the ANY_VALUE
> aggregate function in Apache Calcite.
> [1][https://lists.apache.org/thread.html/f2007a489d3a5741875bcc8a1edd8d5c3715e5114ac45058c3b3a42d@%3Cdev.calcite.apache.org%3E]
> [2][https://lists.apache.org/thread.html/2517eef7410aed4e88b9515f7e4256335215c1ad39a2676a08d21cb9@%3Cdev.drill.apache.org%3E]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)