@Jacques, thanks for the information - I'm definitely going to check out
that option.

I'm also curious that none of you guys commented on my original idea of
counting distinct values by a simple aggregation of pre-sorted data - is it
because it doesn't make sense to you guys, or because you think your
suggestions are easier to implement?

On Tue, Apr 7, 2015 at 5:55 PM, Jacques Nadeau <jacq...@apache.org> wrote:

> Two additional notes here:
>
> Drill can actually do an aggregation using either a hash table based
> aggregation or a sort based aggregation.  By default, generally the hash
> aggregation will be selected first.  However, you can disable hash based
> aggregation if you specifically think that a sort based aggregation will
> perform better for use case.  You can do this by running the command ALTER
> SESSION SET `planner.enable_hashagg` = FALSE;
>
> We have always had it on our roadmap to implement an approximate count
> distinct function but haven't gotten to it yet.  As Ted mentions, using
> this technique would substantially reduce data shuffling and could be done
> with a moderate level of effort since our UDAF interface is pluggable.
>
>
>
> On Tue, Apr 7, 2015 at 8:20 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:
>
> > How precise do your counts need to be?  Can you accept a fraction of a
> > percent statistical error?
> >
> >
> >
> > On Tue, Apr 7, 2015 at 8:11 AM, Aman Sinha <asi...@maprtech.com> wrote:
> >
> > > Drill already does most of this type of transformation.  If you do an
> > > 'EXPLAIN PLAN FOR <your count(distinct) query>'
> > > you will see that it first does a grouping on the column and then
> applies
> > > the COUNT(column).  The first level grouping can be done either based
> on
> > > sorting or hashing and this is configurable through a system option.
> > >
> > > Aman
> > >
> > > On Tue, Apr 7, 2015 at 3:30 AM, Marcin Karpinski <mkarpin...@opera.com
> >
> > > wrote:
> > >
> > > > Hi Guys,
> > > >
> > > > I have a specific use case for Drill, in which I'd like to be able to
> > > count
> > > > unique values in columns with tens millions of distinct values. The
> > COUNT
> > > > DISTINCT method, unfortunately, does not scale both time- and
> > memory-wise
> > > > and the idea is to sort the data beforehand by the values of that
> > column
> > > > (let's call it ID), to have the row groups split at new a new ID
> > boundary
> > > > and to extend Drill with an alternative version of COUNT that would
> > > simply
> > > > count the number of times the ID changes through out the entire
> table.
> > > This
> > > > way, we could expect that counting unique values of pre-sorted
> columns
> > > > could have complexity comparable to that of the regular COUNT
> operator
> > (a
> > > > full scan). So, to sum up, I have three questions:
> > > >
> > > > 1. Can such a scenario be realized in Drill?
> > > > 2. Can it be done in a modular way (eg, a dedicated UDAF or an
> > operator),
> > > > so without heavy hacking throughout entire Drill?
> > > > 3. How to do it?
> > > >
> > > > Our initial experience with Drill was very good - it's an excellent
> > tool.
> > > > But in order to be able to adopt it, we need to sort out this one
> > central
> > > > issue.
> > > >
> > > > Cheers,
> > > > Marcin
> > > >
> > >
> >
>

Reply via email to