Re: Counting large numbers of unique values

Ted Dunning Sat, 11 Apr 2015 05:44:27 -0700

Jason

I attribute the existence of such nice javadoc to your influence.  But should 
that doc say something about how ordinary types are injected? Or the protocol 
used to decode when state should be serialized?


Even just a pointer would be handy. 

Sent from my iPhone

> On Apr 10, 2015, at 11:06, Jason Altekruse <altekruseja...@gmail.com> wrote:
> 
> I'll open a doc ticket for this, but just a heads up, the UDF interfaces
> changed recently to remove the RecordBatch parameter from the setup method.
> This interface exposed too many system details to UDFs, to get the same
> system state to UDFs we are now exclusively using Injectable types.
> 
> The available Injectable types are given and documented here (will also add
> this to the doc ticket):
> https://github.com/apache/drill/blob/af7a52beeaed565e243d2d54ba75337ab95b924d/exec/java-exec/src/main/java/org/apache/drill/exec/ops/UdfUtilities.java
> 
> 
> 
> On Fri, Apr 10, 2015 at 7:51 AM, Abdel Hakim Deneche <adene...@maprtech.com>
> wrote:
> 
>> Drill Wiki is a good starting place:
>> 
>> https://cwiki.apache.org/confluence/display/DRILL/Develop+Custom+Functions
>> 
>> This page has links to 2 complete code examples of UDFs:
>> 
>> 
>> https://cwiki.apache.org/confluence/display/DRILL/Custom+Function+Interfaces
>> 
>>> On Thu, Apr 9, 2015 at 9:41 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
>>> 
>>> On Thu, Apr 9, 2015 at 1:38 AM, Adam Gilmore <dragoncu...@gmail.com>
>>> wrote:
>>> 
>>>> Ted - I'd be really interested in doing something like that
>> (approximate
>>>> aggregation results).  This would be very interesting in terms of
>>> standard
>>>> deviation, median, etc.
>>> 
>>> Yes.  This is a very common need.
>>> 
>>> Writing a UDAF is a matter (mostly) of filling in a code generation
>>> template that accounts for the types you want to handle.  As is common,
>>> there is a state update call and a finalize call that gets the final
>>> result.
>>> 
>>> Can anybody point Adam to an example?  Last time I tried this, I started
>> by
>>> finding the code for sum and worked from that.  Some pointers would save
>>> some dead ends relative to the process I went through.
>>> 
>>> 
>>> 
>>> 
>>>> 
>>>> I know there is another project out there that trades off speed vs
>>> accuracy
>>>> (the name of which escapes me).  If we could easily add something onto
>>>> Drill like this, it'd be hugely beneficial.
>>>> 
>>>>> On Wed, Apr 8, 2015 at 8:25 AM, Ted Dunning <ted.dunn...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Marcin,
>>>>> 
>>>>> They did comment.  The answer is that the default is to use hashed
>>>>> aggregation (which will be faster when there is lots of memory) with
>>> the
>>>>> option to use sort aggregation (which is basically what you were
>>>>> suggesting).
>>>>> 
>>>>> Did you mean to suggest that your data is already known to be sorted
>>> and
>>>>> thus the sort step should be omitted?
>>>>> 
>>>>> 
>>>>> On Tue, Apr 7, 2015 at 3:21 PM, Marcin Karpinski <
>> mkarpin...@opera.com
>>>> 
>>>>> wrote:
>>>>> 
>>>>>> @Jacques, thanks for the information - I'm definitely going to
>> check
>>>> out
>>>>>> that option.
>>>>>> 
>>>>>> I'm also curious that none of you guys commented on my original
>> idea
>>> of
>>>>>> counting distinct values by a simple aggregation of pre-sorted
>> data -
>>>> is
>>>>> it
>>>>>> because it doesn't make sense to you guys, or because you think
>> your
>>>>>> suggestions are easier to implement?
>>>>>> 
>>>>>> On Tue, Apr 7, 2015 at 5:55 PM, Jacques Nadeau <jacq...@apache.org
>>> 
>>>>> wrote:
>>>>>> 
>>>>>>> Two additional notes here:
>>>>>>> 
>>>>>>> Drill can actually do an aggregation using either a hash table
>>> based
>>>>>>> aggregation or a sort based aggregation.  By default, generally
>> the
>>>>> hash
>>>>>>> aggregation will be selected first.  However, you can disable
>> hash
>>>>> based
>>>>>>> aggregation if you specifically think that a sort based
>> aggregation
>>>>> will
>>>>>>> perform better for use case.  You can do this by running the
>>> command
>>>>>> ALTER
>>>>>>> SESSION SET `planner.enable_hashagg` = FALSE;
>>>>>>> 
>>>>>>> We have always had it on our roadmap to implement an approximate
>>>> count
>>>>>>> distinct function but haven't gotten to it yet.  As Ted mentions,
>>>> using
>>>>>>> this technique would substantially reduce data shuffling and
>> could
>>> be
>>>>>> done
>>>>>>> with a moderate level of effort since our UDAF interface is
>>>> pluggable.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Apr 7, 2015 at 8:20 AM, Ted Dunning <
>> ted.dunn...@gmail.com
>>>> 
>>>>>> wrote:
>>>>>>> 
>>>>>>>> How precise do your counts need to be?  Can you accept a
>> fraction
>>>> of
>>>>> a
>>>>>>>> percent statistical error?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, Apr 7, 2015 at 8:11 AM, Aman Sinha <
>> asi...@maprtech.com>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Drill already does most of this type of transformation.  If
>> you
>>>> do
>>>>> an
>>>>>>>>> 'EXPLAIN PLAN FOR <your count(distinct) query>'
>>>>>>>>> you will see that it first does a grouping on the column and
>>> then
>>>>>>> applies
>>>>>>>>> the COUNT(column).  The first level grouping can be done
>> either
>>>>> based
>>>>>>> on
>>>>>>>>> sorting or hashing and this is configurable through a system
>>>>> option.
>>>>>>>>> 
>>>>>>>>> Aman
>>>>>>>>> 
>>>>>>>>> On Tue, Apr 7, 2015 at 3:30 AM, Marcin Karpinski <
>>>>>> mkarpin...@opera.com
>>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi Guys,
>>>>>>>>>> 
>>>>>>>>>> I have a specific use case for Drill, in which I'd like to
>> be
>>>>> able
>>>>>> to
>>>>>>>>> count
>>>>>>>>>> unique values in columns with tens millions of distinct
>>> values.
>>>>> The
>>>>>>>> COUNT
>>>>>>>>>> DISTINCT method, unfortunately, does not scale both time-
>> and
>>>>>>>> memory-wise
>>>>>>>>>> and the idea is to sort the data beforehand by the values
>> of
>>>> that
>>>>>>>> column
>>>>>>>>>> (let's call it ID), to have the row groups split at new a
>> new
>>>> ID
>>>>>>>> boundary
>>>>>>>>>> and to extend Drill with an alternative version of COUNT
>> that
>>>>> would
>>>>>>>>> simply
>>>>>>>>>> count the number of times the ID changes through out the
>>> entire
>>>>>>> table.
>>>>>>>>> This
>>>>>>>>>> way, we could expect that counting unique values of
>>> pre-sorted
>>>>>>> columns
>>>>>>>>>> could have complexity comparable to that of the regular
>> COUNT
>>>>>>> operator
>>>>>>>> (a
>>>>>>>>>> full scan). So, to sum up, I have three questions:
>>>>>>>>>> 
>>>>>>>>>> 1. Can such a scenario be realized in Drill?
>>>>>>>>>> 2. Can it be done in a modular way (eg, a dedicated UDAF or
>>> an
>>>>>>>> operator),
>>>>>>>>>> so without heavy hacking throughout entire Drill?
>>>>>>>>>> 3. How to do it?
>>>>>>>>>> 
>>>>>>>>>> Our initial experience with Drill was very good - it's an
>>>>> excellent
>>>>>>>> tool.
>>>>>>>>>> But in order to be able to adopt it, we need to sort out
>> this
>>>> one
>>>>>>>> central
>>>>>>>>>> issue.
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> Marcin
>> 
>> 
>> 
>> --
>> 
>> Abdelhakim Deneche
>> 
>> Software Engineer
>> 
>>  <http://www.mapr.com/>
>> 
>> 
>> Now Available - Free Hadoop On-Demand Training
>> <
>> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
>>

Re: Counting large numbers of unique values

Reply via email to