Re: Counting large numbers of unique values

Adam Gilmore Wed, 08 Apr 2015 23:39:07 -0700

Ted - I'd be really interested in doing something like that (approximate
aggregation results).  This would be very interesting in terms of standard
deviation, median, etc.


I know there is another project out there that trades off speed vs accuracy
(the name of which escapes me).  If we could easily add something onto
Drill like this, it'd be hugely beneficial.

On Wed, Apr 8, 2015 at 8:25 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> Marcin,
>
> They did comment.  The answer is that the default is to use hashed
> aggregation (which will be faster when there is lots of memory) with the
> option to use sort aggregation (which is basically what you were
> suggesting).
>
> Did you mean to suggest that your data is already known to be sorted and
> thus the sort step should be omitted?
>
>
> On Tue, Apr 7, 2015 at 3:21 PM, Marcin Karpinski <mkarpin...@opera.com>
> wrote:
>
> > @Jacques, thanks for the information - I'm definitely going to check out
> > that option.
> >
> > I'm also curious that none of you guys commented on my original idea of
> > counting distinct values by a simple aggregation of pre-sorted data - is
> it
> > because it doesn't make sense to you guys, or because you think your
> > suggestions are easier to implement?
> >
> > On Tue, Apr 7, 2015 at 5:55 PM, Jacques Nadeau <jacq...@apache.org>
> wrote:
> >
> > > Two additional notes here:
> > >
> > > Drill can actually do an aggregation using either a hash table based
> > > aggregation or a sort based aggregation.  By default, generally the
> hash
> > > aggregation will be selected first.  However, you can disable hash
> based
> > > aggregation if you specifically think that a sort based aggregation
> will
> > > perform better for use case.  You can do this by running the command
> > ALTER
> > > SESSION SET `planner.enable_hashagg` = FALSE;
> > >
> > > We have always had it on our roadmap to implement an approximate count
> > > distinct function but haven't gotten to it yet.  As Ted mentions, using
> > > this technique would substantially reduce data shuffling and could be
> > done
> > > with a moderate level of effort since our UDAF interface is pluggable.
> > >
> > >
> > >
> > > On Tue, Apr 7, 2015 at 8:20 AM, Ted Dunning <ted.dunn...@gmail.com>
> > wrote:
> > >
> > > > How precise do your counts need to be?  Can you accept a fraction of
> a
> > > > percent statistical error?
> > > >
> > > >
> > > >
> > > > On Tue, Apr 7, 2015 at 8:11 AM, Aman Sinha <asi...@maprtech.com>
> > wrote:
> > > >
> > > > > Drill already does most of this type of transformation.  If you do
> an
> > > > > 'EXPLAIN PLAN FOR <your count(distinct) query>'
> > > > > you will see that it first does a grouping on the column and then
> > > applies
> > > > > the COUNT(column).  The first level grouping can be done either
> based
> > > on
> > > > > sorting or hashing and this is configurable through a system
> option.
> > > > >
> > > > > Aman
> > > > >
> > > > > On Tue, Apr 7, 2015 at 3:30 AM, Marcin Karpinski <
> > mkarpin...@opera.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Guys,
> > > > > >
> > > > > > I have a specific use case for Drill, in which I'd like to be
> able
> > to
> > > > > count
> > > > > > unique values in columns with tens millions of distinct values.
> The
> > > > COUNT
> > > > > > DISTINCT method, unfortunately, does not scale both time- and
> > > > memory-wise
> > > > > > and the idea is to sort the data beforehand by the values of that
> > > > column
> > > > > > (let's call it ID), to have the row groups split at new a new ID
> > > > boundary
> > > > > > and to extend Drill with an alternative version of COUNT that
> would
> > > > > simply
> > > > > > count the number of times the ID changes through out the entire
> > > table.
> > > > > This
> > > > > > way, we could expect that counting unique values of pre-sorted
> > > columns
> > > > > > could have complexity comparable to that of the regular COUNT
> > > operator
> > > > (a
> > > > > > full scan). So, to sum up, I have three questions:
> > > > > >
> > > > > > 1. Can such a scenario be realized in Drill?
> > > > > > 2. Can it be done in a modular way (eg, a dedicated UDAF or an
> > > > operator),
> > > > > > so without heavy hacking throughout entire Drill?
> > > > > > 3. How to do it?
> > > > > >
> > > > > > Our initial experience with Drill was very good - it's an
> excellent
> > > > tool.
> > > > > > But in order to be able to adopt it, we need to sort out this one
> > > > central
> > > > > > issue.
> > > > > >
> > > > > > Cheers,
> > > > > > Marcin
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Counting large numbers of unique values

Reply via email to