Re: DISTINCT ON clause

Chris Baynes Mon, 09 Apr 2018 09:48:08 -0700

The concept of a LIMIT per key as Julian suggested would actually be really
useful for a couple of reasons:


1. It's a very common query in olap scenarios, and it would be nice to have
a simpler way than window functions to express it in sql (and for us
through the RelBuilder)
2. Not all databases support window functions but perhaps something
similar, and in those cases it's really difficult to push-down

Examples:
a) Clickhouse (a jdbc sql database): doesn't support window funcs, but has
their own syntax for limit per key [1]:
// top 5 products for each country by sales
SELECT country, product_id, sum(sales) s FROM T GROUP BY country,
product_id ORDER BY s LIMIT 5 BY country
b) Druid: right now the druid topN only supports single dimension, but if
it ever supported more nesting it would be very difficult to access that
functionality through the sql interface

If there were a way to express LIMIT BY this would make things much easier.

[1]
http://clickhouse-docs.readthedocs.io/en/latest/query_language/queries.html#limit-n-by-modifier

- Chris

On Thu, Apr 5, 2018 at 12:33 AM, Aman Sinha <amansi...@apache.org> wrote:

> Would the Calcite community be ok with having a new function called
> 'ANY_VALUE()' which is of type SqlAggFunction ?
> The usage of this would be with a GROUP BY,  similar to what Gian described
> for Druid but with the difference that the
> keyword any_value makes it explicit that there is no ordering requirement.
> If someone really needs ordering, then there is always the
> fallback of the ROW_NUMBER window function syntax.
>
> -Aman
>
>
> On Wed, Mar 7, 2018 at 3:50 PM, Gian Merlino <g...@imply.io> wrote:
>
> > In Druid I was hoping to eventually deal with this using FIRST and LAST
> > aggregators. So the original query would be:
> >
> >   SELECT a, FIRST(b), FIRST(c) FROM T GROUP BY a
> >
> > However, we have only implemented FIRST and LAST for numeric types, not
> for
> > strings or arrays yet. And we also haven't wired it into Druid SQL yet
> > (they are available in Druid's native query language only). Also in the
> > Druid the FIRST/LAST aggregators pick by first/last values by timestamp,
> > not by the ORDER BY clause of the query in general.
> >
> > The idea would be you use them either when you really do want the
> > first/last value to be seen (by time), or when you just want _any_ value
> > and you don't care which one.
> >
> > Gian
> >
> > On Wed, Mar 7, 2018 at 6:16 PM, Julian Hyde <jh...@apache.org> wrote:
> >
> > > I always wished there was a variant of LIMIT that operated per key.
> E.g.
> > > return the youngest 2 employees in each department:
> > >
> > > SELECT *
> > > FROM Amp
> > > ORDER BY deptno LIMIT 2, birthdate DESC
> > >
> > > (This is not standard syntax.)
> > >
> > > DISTINCT ON is the special case LIMIT 1. But LIMIT 1 would not mess
> with
> > > SQL evaluation order the way DISTINCT ON does.
> > >
> > > Also, per-key limit is something that a sort-limit physical operator
> > could
> > > usefully and efficiently do.
> > >
> > > > On Mar 4, 2018, at 8:09 PM, Aman Sinha <amansi...@apache.org> wrote:
> > > >
> > > > So far, I haven't had much success.. other databases such as Oracle,
> > SQL
> > > > Server and DB2 don't seem to support this at least based on their
> > > published
> > > > SQL reference guides.
> > > >
> > > > One way people work around this is to use ROW_NUMBER() window
> function
> > to
> > > > achieve similar results.
> > > > e.g
> > > >     SELECT a, b, c FROM (SELECT a, b, c , ROW_NUMBER() OVER
> (PARTITION
> > BY
> > > > a ORDER BY <some column>) as rownum FROM t) WHERE rownum = 1;
> > > >
> > > > Regarding the aggregate functions allowing arbitrary values in a
> > group, I
> > > > am not aware of databases that do that.  I am not sure how to verify
> > this
> > > > though.
> > > >
> > > > -Aman
> > > >
> > > > On Thu, Feb 22, 2018 at 10:06 PM, Aman Sinha <amansi...@apache.org>
> > > wrote:
> > > >
> > > >> I will try to find out more about this during the next few
> > days....what
> > > do
> > > >> Oracle, SQL Server, DB2 support.
> > > >>
> > > >> On Wed, Feb 21, 2018 at 2:12 PM, Julian Hyde <jh...@apache.org>
> > wrote:
> > > >>
> > > >>> Can you do some research, and see if any other databases do
> anything
> > > >>> similar? Since Postgres isn’t standard, maybe we can improve upon
> it
> > a
> > > bit.
> > > >>>
> > > >>> Also, do any databases have an aggregate function that takes an
> > > arbitrary
> > > >>> value in the group, or asserts that all values are the same? We
> would
> > > use
> > > >>> that when we translate the syntactic sugar to algebra.
> > > >>>
> > > >>> Julian
> > > >>>
> > > >>>
> > > >>>> On Feb 21, 2018, at 2:02 PM, Aman Sinha <amansi...@apache.org>
> > wrote:
> > > >>>>
> > > >>>> It is a useful functionality, especially since the SELECT list can
> > > >>> contain
> > > >>>> mix of primitive types and array or map type columns.
> > > >>>> I do see your point about the Postgres semantics of ORDER BY vs
> > > DISTINCT
> > > >>>> ON.
> > > >>>> I don't like the hidden semantics of MySQL either.
> > > >>>> Perhaps the compromise solution is to introduce functions such as
> > > >>>> FIRST_ROW() aggregation function (similar to lead/lag window
> > > functions)
> > > >>> and
> > > >>>> use GROUP BY.
> > > >>>>
> > > >>>> -Aman
> > > >>>>
> > > >>>> On Wed, Feb 21, 2018 at 1:31 PM, Julian Hyde <jh...@apache.org>
> > > wrote:
> > > >>>>
> > > >>>>> I can see that it would be useful. But there are a couple of
> things
> > > >>> about
> > > >>>>> it that are messy.
> > > >>>>>
> > > >>>>> I think the syntax is a bit clunky because it uses parentheses; a
> > > query
> > > >>>>> would be difficult to read if people would like multiple columns,
> > > >>>>> expressions, and aliases.
> > > >>>>>
> > > >>>>> I also think the semantics are messy. If you read
> > > >>>>> https://www.postgresql.org/docs/9.5/static/sql-select.html#
> > > >>> SQL-DISTINCT <
> > > >>>>> https://www.postgresql.org/docs/9.5/static/sql-select.html#
> > > >>> SQL-DISTINCT>
> > > >>>>> you will see that “DISTINCT ON” is evaluated after the ORDER BY
> > > clause,
> > > >>>>> whereas regular DISTINCT is evaluated before the ORDER BY clause.
> > So
> > > >>> there
> > > >>>>> will be a bizarre interaction if DISTINCT ON is used with UNION
> and
> > > >>> ORDER
> > > >>>>> BY.
> > > >>>>>
> > > >>>>> By the way, good ol’ MySQL doesn’t have this problem; you can
> just
> > > >>> write
> > > >>>>>
> > > >>>>> SELECT a, b, c FROM t GROUP BY a
> > > >>>>>
> > > >>>>> and it just picks the first value of b and c. This “feature” is
> one
> > > of
> > > >>> my
> > > >>>>> least favorite things about MySQL, so let’s not emulate it. :)
> > > >>>>>
> > > >>>>> Julian
> > > >>>>>
> > > >>>>>
> > > >>>>>> On Feb 21, 2018, at 12:44 PM, Aman Sinha <amansi...@apache.org>
> > > >>> wrote:
> > > >>>>>>
> > > >>>>>> The DISTINCT 'ON' clause is not supported but I am trying to see
> > if
> > > >>> there
> > > >>>>>> are other people who have run into this.  One of the use cases I
> > > have
> > > >>>>>> intuitively maps to something like this:
> > > >>>>>>
> > > >>>>>> SELECT DISTINCT *ON (a)*,  b, c  FROM T
> > > >>>>>>
> > > >>>>>> Here  suppose 'a' is an INT  and b, c are some complex types
> such
> > as
> > > >>>>>> array.  In my example, the values of  b and c happen to be the
> > same
> > > in
> > > >>>>> all
> > > >>>>>> rows belonging to the same group of 'a', so I just want the
> first
> > > row.
> > > >>>>>> Since these are arrays, I cannot use MIN(b) GROUP BY a .   Other
> > > >>>>>> alternative is to create a new aggregate function that picks the
> > > first
> > > >>>>>> value but it makes the syntax verbose especially with large
> number
> > > of
> > > >>>>> such
> > > >>>>>> columns.
> > > >>>>>>
> > > >>>>>> Incidentally, Postgres supports this [1]
> > > >>>>>>
> > > >>>>>> [1]
> > > >>>>>> https://www.postgresql.org/docs/current/static/sql-
> > > >>>>> select.html#SQL-DISTINCT
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> -Aman
> > > >>>>>
> > > >>>>>
> > > >>>
> > > >>>
> > > >>
> > >
> > >
> >
>

Re: DISTINCT ON clause

Reply via email to