Re: DISTINCT ON clause

Aman Sinha Wed, 04 Apr 2018 15:34:28 -0700

Would the Calcite community be ok with having a new function called
'ANY_VALUE()' which is of type SqlAggFunction ?
The usage of this would be with a GROUP BY,  similar to what Gian described
for Druid but with the difference that the
keyword any_value makes it explicit that there is no ordering requirement.
If someone really needs ordering, then there is always the
fallback of the ROW_NUMBER window function syntax.


-Aman


On Wed, Mar 7, 2018 at 3:50 PM, Gian Merlino <g...@imply.io> wrote:

> In Druid I was hoping to eventually deal with this using FIRST and LAST
> aggregators. So the original query would be:
>
>   SELECT a, FIRST(b), FIRST(c) FROM T GROUP BY a
>
> However, we have only implemented FIRST and LAST for numeric types, not for
> strings or arrays yet. And we also haven't wired it into Druid SQL yet
> (they are available in Druid's native query language only). Also in the
> Druid the FIRST/LAST aggregators pick by first/last values by timestamp,
> not by the ORDER BY clause of the query in general.
>
> The idea would be you use them either when you really do want the
> first/last value to be seen (by time), or when you just want _any_ value
> and you don't care which one.
>
> Gian
>
> On Wed, Mar 7, 2018 at 6:16 PM, Julian Hyde <jh...@apache.org> wrote:
>
> > I always wished there was a variant of LIMIT that operated per key. E.g.
> > return the youngest 2 employees in each department:
> >
> > SELECT *
> > FROM Amp
> > ORDER BY deptno LIMIT 2, birthdate DESC
> >
> > (This is not standard syntax.)
> >
> > DISTINCT ON is the special case LIMIT 1. But LIMIT 1 would not mess with
> > SQL evaluation order the way DISTINCT ON does.
> >
> > Also, per-key limit is something that a sort-limit physical operator
> could
> > usefully and efficiently do.
> >
> > > On Mar 4, 2018, at 8:09 PM, Aman Sinha <amansi...@apache.org> wrote:
> > >
> > > So far, I haven't had much success.. other databases such as Oracle,
> SQL
> > > Server and DB2 don't seem to support this at least based on their
> > published
> > > SQL reference guides.
> > >
> > > One way people work around this is to use ROW_NUMBER() window function
> to
> > > achieve similar results.
> > > e.g
> > >     SELECT a, b, c FROM (SELECT a, b, c , ROW_NUMBER() OVER (PARTITION
> BY
> > > a ORDER BY <some column>) as rownum FROM t) WHERE rownum = 1;
> > >
> > > Regarding the aggregate functions allowing arbitrary values in a
> group, I
> > > am not aware of databases that do that.  I am not sure how to verify
> this
> > > though.
> > >
> > > -Aman
> > >
> > > On Thu, Feb 22, 2018 at 10:06 PM, Aman Sinha <amansi...@apache.org>
> > wrote:
> > >
> > >> I will try to find out more about this during the next few
> days....what
> > do
> > >> Oracle, SQL Server, DB2 support.
> > >>
> > >> On Wed, Feb 21, 2018 at 2:12 PM, Julian Hyde <jh...@apache.org>
> wrote:
> > >>
> > >>> Can you do some research, and see if any other databases do anything
> > >>> similar? Since Postgres isn’t standard, maybe we can improve upon it
> a
> > bit.
> > >>>
> > >>> Also, do any databases have an aggregate function that takes an
> > arbitrary
> > >>> value in the group, or asserts that all values are the same? We would
> > use
> > >>> that when we translate the syntactic sugar to algebra.
> > >>>
> > >>> Julian
> > >>>
> > >>>
> > >>>> On Feb 21, 2018, at 2:02 PM, Aman Sinha <amansi...@apache.org>
> wrote:
> > >>>>
> > >>>> It is a useful functionality, especially since the SELECT list can
> > >>> contain
> > >>>> mix of primitive types and array or map type columns.
> > >>>> I do see your point about the Postgres semantics of ORDER BY vs
> > DISTINCT
> > >>>> ON.
> > >>>> I don't like the hidden semantics of MySQL either.
> > >>>> Perhaps the compromise solution is to introduce functions such as
> > >>>> FIRST_ROW() aggregation function (similar to lead/lag window
> > functions)
> > >>> and
> > >>>> use GROUP BY.
> > >>>>
> > >>>> -Aman
> > >>>>
> > >>>> On Wed, Feb 21, 2018 at 1:31 PM, Julian Hyde <jh...@apache.org>
> > wrote:
> > >>>>
> > >>>>> I can see that it would be useful. But there are a couple of things
> > >>> about
> > >>>>> it that are messy.
> > >>>>>
> > >>>>> I think the syntax is a bit clunky because it uses parentheses; a
> > query
> > >>>>> would be difficult to read if people would like multiple columns,
> > >>>>> expressions, and aliases.
> > >>>>>
> > >>>>> I also think the semantics are messy. If you read
> > >>>>> https://www.postgresql.org/docs/9.5/static/sql-select.html#
> > >>> SQL-DISTINCT <
> > >>>>> https://www.postgresql.org/docs/9.5/static/sql-select.html#
> > >>> SQL-DISTINCT>
> > >>>>> you will see that “DISTINCT ON” is evaluated after the ORDER BY
> > clause,
> > >>>>> whereas regular DISTINCT is evaluated before the ORDER BY clause.
> So
> > >>> there
> > >>>>> will be a bizarre interaction if DISTINCT ON is used with UNION and
> > >>> ORDER
> > >>>>> BY.
> > >>>>>
> > >>>>> By the way, good ol’ MySQL doesn’t have this problem; you can just
> > >>> write
> > >>>>>
> > >>>>> SELECT a, b, c FROM t GROUP BY a
> > >>>>>
> > >>>>> and it just picks the first value of b and c. This “feature” is one
> > of
> > >>> my
> > >>>>> least favorite things about MySQL, so let’s not emulate it. :)
> > >>>>>
> > >>>>> Julian
> > >>>>>
> > >>>>>
> > >>>>>> On Feb 21, 2018, at 12:44 PM, Aman Sinha <amansi...@apache.org>
> > >>> wrote:
> > >>>>>>
> > >>>>>> The DISTINCT 'ON' clause is not supported but I am trying to see
> if
> > >>> there
> > >>>>>> are other people who have run into this.  One of the use cases I
> > have
> > >>>>>> intuitively maps to something like this:
> > >>>>>>
> > >>>>>> SELECT DISTINCT *ON (a)*,  b, c  FROM T
> > >>>>>>
> > >>>>>> Here  suppose 'a' is an INT  and b, c are some complex types such
> as
> > >>>>>> array.  In my example, the values of  b and c happen to be the
> same
> > in
> > >>>>> all
> > >>>>>> rows belonging to the same group of 'a', so I just want the first
> > row.
> > >>>>>> Since these are arrays, I cannot use MIN(b) GROUP BY a .   Other
> > >>>>>> alternative is to create a new aggregate function that picks the
> > first
> > >>>>>> value but it makes the syntax verbose especially with large number
> > of
> > >>>>> such
> > >>>>>> columns.
> > >>>>>>
> > >>>>>> Incidentally, Postgres supports this [1]
> > >>>>>>
> > >>>>>> [1]
> > >>>>>> https://www.postgresql.org/docs/current/static/sql-
> > >>>>> select.html#SQL-DISTINCT
> > >>>>>>
> > >>>>>>
> > >>>>>> -Aman
> > >>>>>
> > >>>>>
> > >>>
> > >>>
> > >>
> >
> >
>

Re: DISTINCT ON clause

Reply via email to