The thing is that with 60m of unique values (which are hashes anyway) the
grouping may not scale well. I've been doing tests on a single machine so
far (32 cores, 64GB RAM) and the COUNT DISTINCT query wouldn't complete
(while other queries gave very encouraging results). So my idea is to
facilitate the run-time computation by doing so offline computation in the
ETL job. Initially, I thought that a custom UDAF could implement this
functionality, but I got the impression after going through the source code
that it won't do. I may be wrong with my assumptions here, we're going to
do testing on a 10-node cluster soon but my understanding is that grouping
by so many distinct values is not really scalable. I'd be happy to be
proven wrong :)

On Tue, Apr 7, 2015 at 5:11 PM, Aman Sinha <asi...@maprtech.com> wrote:

> Drill already does most of this type of transformation.  If you do an
> 'EXPLAIN PLAN FOR <your count(distinct) query>'
> you will see that it first does a grouping on the column and then applies
> the COUNT(column).  The first level grouping can be done either based on
> sorting or hashing and this is configurable through a system option.
>
> Aman
>
> On Tue, Apr 7, 2015 at 3:30 AM, Marcin Karpinski <mkarpin...@opera.com>
> wrote:
>
> > Hi Guys,
> >
> > I have a specific use case for Drill, in which I'd like to be able to
> count
> > unique values in columns with tens millions of distinct values. The COUNT
> > DISTINCT method, unfortunately, does not scale both time- and memory-wise
> > and the idea is to sort the data beforehand by the values of that column
> > (let's call it ID), to have the row groups split at new a new ID boundary
> > and to extend Drill with an alternative version of COUNT that would
> simply
> > count the number of times the ID changes through out the entire table.
> This
> > way, we could expect that counting unique values of pre-sorted columns
> > could have complexity comparable to that of the regular COUNT operator (a
> > full scan). So, to sum up, I have three questions:
> >
> > 1. Can such a scenario be realized in Drill?
> > 2. Can it be done in a modular way (eg, a dedicated UDAF or an operator),
> > so without heavy hacking throughout entire Drill?
> > 3. How to do it?
> >
> > Our initial experience with Drill was very good - it's an excellent tool.
> > But in order to be able to adopt it, we need to sort out this one central
> > issue.
> >
> > Cheers,
> > Marcin
> >
>

Reply via email to