ASAIK, Drill's planner currently does not expose the sort-ness of underlying data; if the data is pre-sorted, Drill planner would not recognize that, and still would require a sort operator for sort-based aggregation. Part of the reason is that Drill does not have a centralized meta-store, to keep track of the meta data of the data/tables/files. Therefore, if the the underlying data does not expose the sort-ness, planner does not have such knowledge.
We do plan to enhance the planer / storage plugin interface, such that planner would be able to utilize the sort-ness / partition columns information. On Tue, Apr 7, 2015 at 3:25 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > Marcin, > > They did comment. The answer is that the default is to use hashed > aggregation (which will be faster when there is lots of memory) with the > option to use sort aggregation (which is basically what you were > suggesting). > > Did you mean to suggest that your data is already known to be sorted and > thus the sort step should be omitted? > > > On Tue, Apr 7, 2015 at 3:21 PM, Marcin Karpinski <mkarpin...@opera.com> > wrote: > > > @Jacques, thanks for the information - I'm definitely going to check out > > that option. > > > > I'm also curious that none of you guys commented on my original idea of > > counting distinct values by a simple aggregation of pre-sorted data - is > it > > because it doesn't make sense to you guys, or because you think your > > suggestions are easier to implement? > > > > On Tue, Apr 7, 2015 at 5:55 PM, Jacques Nadeau <jacq...@apache.org> > wrote: > > > > > Two additional notes here: > > > > > > Drill can actually do an aggregation using either a hash table based > > > aggregation or a sort based aggregation. By default, generally the > hash > > > aggregation will be selected first. However, you can disable hash > based > > > aggregation if you specifically think that a sort based aggregation > will > > > perform better for use case. You can do this by running the command > > ALTER > > > SESSION SET `planner.enable_hashagg` = FALSE; > > > > > > We have always had it on our roadmap to implement an approximate count > > > distinct function but haven't gotten to it yet. As Ted mentions, using > > > this technique would substantially reduce data shuffling and could be > > done > > > with a moderate level of effort since our UDAF interface is pluggable. > > > > > > > > > > > > On Tue, Apr 7, 2015 at 8:20 AM, Ted Dunning <ted.dunn...@gmail.com> > > wrote: > > > > > > > How precise do your counts need to be? Can you accept a fraction of > a > > > > percent statistical error? > > > > > > > > > > > > > > > > On Tue, Apr 7, 2015 at 8:11 AM, Aman Sinha <asi...@maprtech.com> > > wrote: > > > > > > > > > Drill already does most of this type of transformation. If you do > an > > > > > 'EXPLAIN PLAN FOR <your count(distinct) query>' > > > > > you will see that it first does a grouping on the column and then > > > applies > > > > > the COUNT(column). The first level grouping can be done either > based > > > on > > > > > sorting or hashing and this is configurable through a system > option. > > > > > > > > > > Aman > > > > > > > > > > On Tue, Apr 7, 2015 at 3:30 AM, Marcin Karpinski < > > mkarpin...@opera.com > > > > > > > > > wrote: > > > > > > > > > > > Hi Guys, > > > > > > > > > > > > I have a specific use case for Drill, in which I'd like to be > able > > to > > > > > count > > > > > > unique values in columns with tens millions of distinct values. > The > > > > COUNT > > > > > > DISTINCT method, unfortunately, does not scale both time- and > > > > memory-wise > > > > > > and the idea is to sort the data beforehand by the values of that > > > > column > > > > > > (let's call it ID), to have the row groups split at new a new ID > > > > boundary > > > > > > and to extend Drill with an alternative version of COUNT that > would > > > > > simply > > > > > > count the number of times the ID changes through out the entire > > > table. > > > > > This > > > > > > way, we could expect that counting unique values of pre-sorted > > > columns > > > > > > could have complexity comparable to that of the regular COUNT > > > operator > > > > (a > > > > > > full scan). So, to sum up, I have three questions: > > > > > > > > > > > > 1. Can such a scenario be realized in Drill? > > > > > > 2. Can it be done in a modular way (eg, a dedicated UDAF or an > > > > operator), > > > > > > so without heavy hacking throughout entire Drill? > > > > > > 3. How to do it? > > > > > > > > > > > > Our initial experience with Drill was very good - it's an > excellent > > > > tool. > > > > > > But in order to be able to adopt it, we need to sort out this one > > > > central > > > > > > issue. > > > > > > > > > > > > Cheers, > > > > > > Marcin > > > > > > > > > > > > > > > > > > > > >