@Jacques, thanks for the information - I'm definitely going to check out that option.
I'm also curious that none of you guys commented on my original idea of counting distinct values by a simple aggregation of pre-sorted data - is it because it doesn't make sense to you guys, or because you think your suggestions are easier to implement? On Tue, Apr 7, 2015 at 5:55 PM, Jacques Nadeau <jacq...@apache.org> wrote: > Two additional notes here: > > Drill can actually do an aggregation using either a hash table based > aggregation or a sort based aggregation. By default, generally the hash > aggregation will be selected first. However, you can disable hash based > aggregation if you specifically think that a sort based aggregation will > perform better for use case. You can do this by running the command ALTER > SESSION SET `planner.enable_hashagg` = FALSE; > > We have always had it on our roadmap to implement an approximate count > distinct function but haven't gotten to it yet. As Ted mentions, using > this technique would substantially reduce data shuffling and could be done > with a moderate level of effort since our UDAF interface is pluggable. > > > > On Tue, Apr 7, 2015 at 8:20 AM, Ted Dunning <ted.dunn...@gmail.com> wrote: > > > How precise do your counts need to be? Can you accept a fraction of a > > percent statistical error? > > > > > > > > On Tue, Apr 7, 2015 at 8:11 AM, Aman Sinha <asi...@maprtech.com> wrote: > > > > > Drill already does most of this type of transformation. If you do an > > > 'EXPLAIN PLAN FOR <your count(distinct) query>' > > > you will see that it first does a grouping on the column and then > applies > > > the COUNT(column). The first level grouping can be done either based > on > > > sorting or hashing and this is configurable through a system option. > > > > > > Aman > > > > > > On Tue, Apr 7, 2015 at 3:30 AM, Marcin Karpinski <mkarpin...@opera.com > > > > > wrote: > > > > > > > Hi Guys, > > > > > > > > I have a specific use case for Drill, in which I'd like to be able to > > > count > > > > unique values in columns with tens millions of distinct values. The > > COUNT > > > > DISTINCT method, unfortunately, does not scale both time- and > > memory-wise > > > > and the idea is to sort the data beforehand by the values of that > > column > > > > (let's call it ID), to have the row groups split at new a new ID > > boundary > > > > and to extend Drill with an alternative version of COUNT that would > > > simply > > > > count the number of times the ID changes through out the entire > table. > > > This > > > > way, we could expect that counting unique values of pre-sorted > columns > > > > could have complexity comparable to that of the regular COUNT > operator > > (a > > > > full scan). So, to sum up, I have three questions: > > > > > > > > 1. Can such a scenario be realized in Drill? > > > > 2. Can it be done in a modular way (eg, a dedicated UDAF or an > > operator), > > > > so without heavy hacking throughout entire Drill? > > > > 3. How to do it? > > > > > > > > Our initial experience with Drill was very good - it's an excellent > > tool. > > > > But in order to be able to adopt it, we need to sort out this one > > central > > > > issue. > > > > > > > > Cheers, > > > > Marcin > > > > > > > > > >