Two additional notes here:

Drill can actually do an aggregation using either a hash table based
aggregation or a sort based aggregation.  By default, generally the hash
aggregation will be selected first.  However, you can disable hash based
aggregation if you specifically think that a sort based aggregation will
perform better for use case.  You can do this by running the command ALTER
SESSION SET `planner.enable_hashagg` = FALSE;

We have always had it on our roadmap to implement an approximate count
distinct function but haven't gotten to it yet.  As Ted mentions, using
this technique would substantially reduce data shuffling and could be done
with a moderate level of effort since our UDAF interface is pluggable.



On Tue, Apr 7, 2015 at 8:20 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> How precise do your counts need to be?  Can you accept a fraction of a
> percent statistical error?
>
>
>
> On Tue, Apr 7, 2015 at 8:11 AM, Aman Sinha <asi...@maprtech.com> wrote:
>
> > Drill already does most of this type of transformation.  If you do an
> > 'EXPLAIN PLAN FOR <your count(distinct) query>'
> > you will see that it first does a grouping on the column and then applies
> > the COUNT(column).  The first level grouping can be done either based on
> > sorting or hashing and this is configurable through a system option.
> >
> > Aman
> >
> > On Tue, Apr 7, 2015 at 3:30 AM, Marcin Karpinski <mkarpin...@opera.com>
> > wrote:
> >
> > > Hi Guys,
> > >
> > > I have a specific use case for Drill, in which I'd like to be able to
> > count
> > > unique values in columns with tens millions of distinct values. The
> COUNT
> > > DISTINCT method, unfortunately, does not scale both time- and
> memory-wise
> > > and the idea is to sort the data beforehand by the values of that
> column
> > > (let's call it ID), to have the row groups split at new a new ID
> boundary
> > > and to extend Drill with an alternative version of COUNT that would
> > simply
> > > count the number of times the ID changes through out the entire table.
> > This
> > > way, we could expect that counting unique values of pre-sorted columns
> > > could have complexity comparable to that of the regular COUNT operator
> (a
> > > full scan). So, to sum up, I have three questions:
> > >
> > > 1. Can such a scenario be realized in Drill?
> > > 2. Can it be done in a modular way (eg, a dedicated UDAF or an
> operator),
> > > so without heavy hacking throughout entire Drill?
> > > 3. How to do it?
> > >
> > > Our initial experience with Drill was very good - it's an excellent
> tool.
> > > But in order to be able to adopt it, we need to sort out this one
> central
> > > issue.
> > >
> > > Cheers,
> > > Marcin
> > >
> >
>

Reply via email to