@ Ted, ideally, I'd like to get exact results, but in case of real problems, we could perhaps settle on approximate counting. Is there already such a functionality in Drill?
Cheers, Marcin On Tue, Apr 7, 2015 at 5:20 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > How precise do your counts need to be? Can you accept a fraction of a > percent statistical error? > > > > On Tue, Apr 7, 2015 at 8:11 AM, Aman Sinha <asi...@maprtech.com> wrote: > > > Drill already does most of this type of transformation. If you do an > > 'EXPLAIN PLAN FOR <your count(distinct) query>' > > you will see that it first does a grouping on the column and then applies > > the COUNT(column). The first level grouping can be done either based on > > sorting or hashing and this is configurable through a system option. > > > > Aman > > > > On Tue, Apr 7, 2015 at 3:30 AM, Marcin Karpinski <mkarpin...@opera.com> > > wrote: > > > > > Hi Guys, > > > > > > I have a specific use case for Drill, in which I'd like to be able to > > count > > > unique values in columns with tens millions of distinct values. The > COUNT > > > DISTINCT method, unfortunately, does not scale both time- and > memory-wise > > > and the idea is to sort the data beforehand by the values of that > column > > > (let's call it ID), to have the row groups split at new a new ID > boundary > > > and to extend Drill with an alternative version of COUNT that would > > simply > > > count the number of times the ID changes through out the entire table. > > This > > > way, we could expect that counting unique values of pre-sorted columns > > > could have complexity comparable to that of the regular COUNT operator > (a > > > full scan). So, to sum up, I have three questions: > > > > > > 1. Can such a scenario be realized in Drill? > > > 2. Can it be done in a modular way (eg, a dedicated UDAF or an > operator), > > > so without heavy hacking throughout entire Drill? > > > 3. How to do it? > > > > > > Our initial experience with Drill was very good - it's an excellent > tool. > > > But in order to be able to adopt it, we need to sort out this one > central > > > issue. > > > > > > Cheers, > > > Marcin > > > > > >