Two-stage aggregation functions

2015-04-16 Thread Marcin Karpinski
Hi, I've been looking lately at a possibility of writing a custom UDAF and I noticed that the function interface supports only sequential aggregation of all results into a single final value. While the COUNT operator is internally planned as a composition of two aggregation stages, other aggregati

Re: Counting large numbers of unique values

2015-04-11 Thread Marcin Karpinski
uld easily add something > onto > > > > Drill like this, it'd be hugely beneficial. > > > > > > > > On Wed, Apr 8, 2015 at 8:25 AM, Ted Dunning > > > wrote: > > > > > > > > > Marcin, > > > > > > > >

Re: Counting large numbers of unique values

2015-04-08 Thread Marcin Karpinski
t that your data is already known to be sorted and > > thus the sort step should be omitted? > > > > > > On Tue, Apr 7, 2015 at 3:21 PM, Marcin Karpinski > > wrote: > > > > > @Jacques, thanks for the information - I'm definitely going to check > out

Re: Counting large numbers of unique values

2015-04-07 Thread Marcin Karpinski
wrote: > > > > > Drill already does most of this type of transformation. If you do an > > > 'EXPLAIN PLAN FOR ' > > > you will see that it first does a grouping on the column and then > applies > > > the COUNT(column). The first level grouping can be

Re: Counting large numbers of unique values

2015-04-07 Thread Marcin Karpinski
That would be great - I'm all listening :) On Tue, Apr 7, 2015 at 7:22 PM, Ted Dunning wrote: > On Tue, Apr 7, 2015 at 9:19 AM, Marcin Karpinski > wrote: > > > @ Ted, ideally, I'd like to get exact results, but in case of real > > problems, we could perhaps set

Re: Counting large numbers of unique values

2015-04-07 Thread Marcin Karpinski
grouping on the column and then applies > > the COUNT(column). The first level grouping can be done either based on > > sorting or hashing and this is configurable through a system option. > > > > Aman > > > > On Tue, Apr 7, 2015 at 3:30 AM, Marcin Karpinski &g

Re: Counting large numbers of unique values

2015-04-07 Thread Marcin Karpinski
> 'EXPLAIN PLAN FOR ' > you will see that it first does a grouping on the column and then applies > the COUNT(column). The first level grouping can be done either based on > sorting or hashing and this is configurable through a system option. > > Aman > > On Tu

Counting large numbers of unique values

2015-04-07 Thread Marcin Karpinski
Hi Guys, I have a specific use case for Drill, in which I'd like to be able to count unique values in columns with tens millions of distinct values. The COUNT DISTINCT method, unfortunately, does not scale both time- and memory-wise and the idea is to sort the data beforehand by the values of that