Hi Guys, I have a specific use case for Drill, in which I'd like to be able to count unique values in columns with tens millions of distinct values. The COUNT DISTINCT method, unfortunately, does not scale both time- and memory-wise and the idea is to sort the data beforehand by the values of that column (let's call it ID), to have the row groups split at new a new ID boundary and to extend Drill with an alternative version of COUNT that would simply count the number of times the ID changes through out the entire table. This way, we could expect that counting unique values of pre-sorted columns could have complexity comparable to that of the regular COUNT operator (a full scan). So, to sum up, I have three questions:
1. Can such a scenario be realized in Drill? 2. Can it be done in a modular way (eg, a dedicated UDAF or an operator), so without heavy hacking throughout entire Drill? 3. How to do it? Our initial experience with Drill was very good - it's an excellent tool. But in order to be able to adopt it, we need to sort out this one central issue. Cheers, Marcin