Would the Calcite community be ok with having a new function called 'ANY_VALUE()' which is of type SqlAggFunction ? The usage of this would be with a GROUP BY, similar to what Gian described for Druid but with the difference that the keyword any_value makes it explicit that there is no ordering requirement. If someone really needs ordering, then there is always the fallback of the ROW_NUMBER window function syntax.
-Aman On Wed, Mar 7, 2018 at 3:50 PM, Gian Merlino <g...@imply.io> wrote: > In Druid I was hoping to eventually deal with this using FIRST and LAST > aggregators. So the original query would be: > > SELECT a, FIRST(b), FIRST(c) FROM T GROUP BY a > > However, we have only implemented FIRST and LAST for numeric types, not for > strings or arrays yet. And we also haven't wired it into Druid SQL yet > (they are available in Druid's native query language only). Also in the > Druid the FIRST/LAST aggregators pick by first/last values by timestamp, > not by the ORDER BY clause of the query in general. > > The idea would be you use them either when you really do want the > first/last value to be seen (by time), or when you just want _any_ value > and you don't care which one. > > Gian > > On Wed, Mar 7, 2018 at 6:16 PM, Julian Hyde <jh...@apache.org> wrote: > > > I always wished there was a variant of LIMIT that operated per key. E.g. > > return the youngest 2 employees in each department: > > > > SELECT * > > FROM Amp > > ORDER BY deptno LIMIT 2, birthdate DESC > > > > (This is not standard syntax.) > > > > DISTINCT ON is the special case LIMIT 1. But LIMIT 1 would not mess with > > SQL evaluation order the way DISTINCT ON does. > > > > Also, per-key limit is something that a sort-limit physical operator > could > > usefully and efficiently do. > > > > > On Mar 4, 2018, at 8:09 PM, Aman Sinha <amansi...@apache.org> wrote: > > > > > > So far, I haven't had much success.. other databases such as Oracle, > SQL > > > Server and DB2 don't seem to support this at least based on their > > published > > > SQL reference guides. > > > > > > One way people work around this is to use ROW_NUMBER() window function > to > > > achieve similar results. > > > e.g > > > SELECT a, b, c FROM (SELECT a, b, c , ROW_NUMBER() OVER (PARTITION > BY > > > a ORDER BY <some column>) as rownum FROM t) WHERE rownum = 1; > > > > > > Regarding the aggregate functions allowing arbitrary values in a > group, I > > > am not aware of databases that do that. I am not sure how to verify > this > > > though. > > > > > > -Aman > > > > > > On Thu, Feb 22, 2018 at 10:06 PM, Aman Sinha <amansi...@apache.org> > > wrote: > > > > > >> I will try to find out more about this during the next few > days....what > > do > > >> Oracle, SQL Server, DB2 support. > > >> > > >> On Wed, Feb 21, 2018 at 2:12 PM, Julian Hyde <jh...@apache.org> > wrote: > > >> > > >>> Can you do some research, and see if any other databases do anything > > >>> similar? Since Postgres isn’t standard, maybe we can improve upon it > a > > bit. > > >>> > > >>> Also, do any databases have an aggregate function that takes an > > arbitrary > > >>> value in the group, or asserts that all values are the same? We would > > use > > >>> that when we translate the syntactic sugar to algebra. > > >>> > > >>> Julian > > >>> > > >>> > > >>>> On Feb 21, 2018, at 2:02 PM, Aman Sinha <amansi...@apache.org> > wrote: > > >>>> > > >>>> It is a useful functionality, especially since the SELECT list can > > >>> contain > > >>>> mix of primitive types and array or map type columns. > > >>>> I do see your point about the Postgres semantics of ORDER BY vs > > DISTINCT > > >>>> ON. > > >>>> I don't like the hidden semantics of MySQL either. > > >>>> Perhaps the compromise solution is to introduce functions such as > > >>>> FIRST_ROW() aggregation function (similar to lead/lag window > > functions) > > >>> and > > >>>> use GROUP BY. > > >>>> > > >>>> -Aman > > >>>> > > >>>> On Wed, Feb 21, 2018 at 1:31 PM, Julian Hyde <jh...@apache.org> > > wrote: > > >>>> > > >>>>> I can see that it would be useful. But there are a couple of things > > >>> about > > >>>>> it that are messy. > > >>>>> > > >>>>> I think the syntax is a bit clunky because it uses parentheses; a > > query > > >>>>> would be difficult to read if people would like multiple columns, > > >>>>> expressions, and aliases. > > >>>>> > > >>>>> I also think the semantics are messy. If you read > > >>>>> https://www.postgresql.org/docs/9.5/static/sql-select.html# > > >>> SQL-DISTINCT < > > >>>>> https://www.postgresql.org/docs/9.5/static/sql-select.html# > > >>> SQL-DISTINCT> > > >>>>> you will see that “DISTINCT ON” is evaluated after the ORDER BY > > clause, > > >>>>> whereas regular DISTINCT is evaluated before the ORDER BY clause. > So > > >>> there > > >>>>> will be a bizarre interaction if DISTINCT ON is used with UNION and > > >>> ORDER > > >>>>> BY. > > >>>>> > > >>>>> By the way, good ol’ MySQL doesn’t have this problem; you can just > > >>> write > > >>>>> > > >>>>> SELECT a, b, c FROM t GROUP BY a > > >>>>> > > >>>>> and it just picks the first value of b and c. This “feature” is one > > of > > >>> my > > >>>>> least favorite things about MySQL, so let’s not emulate it. :) > > >>>>> > > >>>>> Julian > > >>>>> > > >>>>> > > >>>>>> On Feb 21, 2018, at 12:44 PM, Aman Sinha <amansi...@apache.org> > > >>> wrote: > > >>>>>> > > >>>>>> The DISTINCT 'ON' clause is not supported but I am trying to see > if > > >>> there > > >>>>>> are other people who have run into this. One of the use cases I > > have > > >>>>>> intuitively maps to something like this: > > >>>>>> > > >>>>>> SELECT DISTINCT *ON (a)*, b, c FROM T > > >>>>>> > > >>>>>> Here suppose 'a' is an INT and b, c are some complex types such > as > > >>>>>> array. In my example, the values of b and c happen to be the > same > > in > > >>>>> all > > >>>>>> rows belonging to the same group of 'a', so I just want the first > > row. > > >>>>>> Since these are arrays, I cannot use MIN(b) GROUP BY a . Other > > >>>>>> alternative is to create a new aggregate function that picks the > > first > > >>>>>> value but it makes the syntax verbose especially with large number > > of > > >>>>> such > > >>>>>> columns. > > >>>>>> > > >>>>>> Incidentally, Postgres supports this [1] > > >>>>>> > > >>>>>> [1] > > >>>>>> https://www.postgresql.org/docs/current/static/sql- > > >>>>> select.html#SQL-DISTINCT > > >>>>>> > > >>>>>> > > >>>>>> -Aman > > >>>>> > > >>>>> > > >>> > > >>> > > >> > > > > >