Marcin, They did comment. The answer is that the default is to use hashed aggregation (which will be faster when there is lots of memory) with the option to use sort aggregation (which is basically what you were suggesting).
Did you mean to suggest that your data is already known to be sorted and thus the sort step should be omitted? On Tue, Apr 7, 2015 at 3:21 PM, Marcin Karpinski <mkarpin...@opera.com> wrote: > @Jacques, thanks for the information - I'm definitely going to check out > that option. > > I'm also curious that none of you guys commented on my original idea of > counting distinct values by a simple aggregation of pre-sorted data - is it > because it doesn't make sense to you guys, or because you think your > suggestions are easier to implement? > > On Tue, Apr 7, 2015 at 5:55 PM, Jacques Nadeau <jacq...@apache.org> wrote: > > > Two additional notes here: > > > > Drill can actually do an aggregation using either a hash table based > > aggregation or a sort based aggregation. By default, generally the hash > > aggregation will be selected first. However, you can disable hash based > > aggregation if you specifically think that a sort based aggregation will > > perform better for use case. You can do this by running the command > ALTER > > SESSION SET `planner.enable_hashagg` = FALSE; > > > > We have always had it on our roadmap to implement an approximate count > > distinct function but haven't gotten to it yet. As Ted mentions, using > > this technique would substantially reduce data shuffling and could be > done > > with a moderate level of effort since our UDAF interface is pluggable. > > > > > > > > On Tue, Apr 7, 2015 at 8:20 AM, Ted Dunning <ted.dunn...@gmail.com> > wrote: > > > > > How precise do your counts need to be? Can you accept a fraction of a > > > percent statistical error? > > > > > > > > > > > > On Tue, Apr 7, 2015 at 8:11 AM, Aman Sinha <asi...@maprtech.com> > wrote: > > > > > > > Drill already does most of this type of transformation. If you do an > > > > 'EXPLAIN PLAN FOR <your count(distinct) query>' > > > > you will see that it first does a grouping on the column and then > > applies > > > > the COUNT(column). The first level grouping can be done either based > > on > > > > sorting or hashing and this is configurable through a system option. > > > > > > > > Aman > > > > > > > > On Tue, Apr 7, 2015 at 3:30 AM, Marcin Karpinski < > mkarpin...@opera.com > > > > > > > wrote: > > > > > > > > > Hi Guys, > > > > > > > > > > I have a specific use case for Drill, in which I'd like to be able > to > > > > count > > > > > unique values in columns with tens millions of distinct values. The > > > COUNT > > > > > DISTINCT method, unfortunately, does not scale both time- and > > > memory-wise > > > > > and the idea is to sort the data beforehand by the values of that > > > column > > > > > (let's call it ID), to have the row groups split at new a new ID > > > boundary > > > > > and to extend Drill with an alternative version of COUNT that would > > > > simply > > > > > count the number of times the ID changes through out the entire > > table. > > > > This > > > > > way, we could expect that counting unique values of pre-sorted > > columns > > > > > could have complexity comparable to that of the regular COUNT > > operator > > > (a > > > > > full scan). So, to sum up, I have three questions: > > > > > > > > > > 1. Can such a scenario be realized in Drill? > > > > > 2. Can it be done in a modular way (eg, a dedicated UDAF or an > > > operator), > > > > > so without heavy hacking throughout entire Drill? > > > > > 3. How to do it? > > > > > > > > > > Our initial experience with Drill was very good - it's an excellent > > > tool. > > > > > But in order to be able to adopt it, we need to sort out this one > > > central > > > > > issue. > > > > > > > > > > Cheers, > > > > > Marcin > > > > > > > > > > > > > > >