Hi Guys,

I have a specific use case for Drill, in which I'd like to be able to count
unique values in columns with tens millions of distinct values. The COUNT
DISTINCT method, unfortunately, does not scale both time- and memory-wise
and the idea is to sort the data beforehand by the values of that column
(let's call it ID), to have the row groups split at new a new ID boundary
and to extend Drill with an alternative version of COUNT that would simply
count the number of times the ID changes through out the entire table. This
way, we could expect that counting unique values of pre-sorted columns
could have complexity comparable to that of the regular COUNT operator (a
full scan). So, to sum up, I have three questions:

1. Can such a scenario be realized in Drill?
2. Can it be done in a modular way (eg, a dedicated UDAF or an operator),
so without heavy hacking throughout entire Drill?
3. How to do it?

Our initial experience with Drill was very good - it's an excellent tool.
But in order to be able to adopt it, we need to sort out this one central
issue.

Cheers,
Marcin

Reply via email to