Re: complex data structure aggregators?

Charles Givre Mon, 12 Aug 2019 16:34:36 -0700

HI Ted, 
You might want to take a look at this repo: 
https://github.com/cgivre/drill-stats-function/blob/master/src/main/java/org/apache/drill/contrib/function/DrillStatsFunctions.java
 
<https://github.com/cgivre/drill-stats-function/blob/master/src/main/java/org/apache/drill/contrib/function/DrillStatsFunctions.java>
This was an experiment to see if I could write a function to calculate a 
median.  I found a streaming algorithm to do so, but it required the use of two 
stacks.  This was more of a "can I do this" type challenge than a "will this 
really work well" but I did get it to work.  In any event, the way I did it was 
to use the @Workspace and use an ObjectHolder.  Maybe this will help you out.
-- C



> On Aug 12, 2019, at 6:03 PM, Paul Rogers <[email protected]> wrote:
> 
> Hi Ted,
> 
> You are now at the point that you'll have to experiment. Drill provides an 
> annotation for aggregate state:  @Workspace. The value must be declared as a 
> "holder". You'll have to check if VarBinaryHolder is allowed, and, if so, how 
> you allocate memory and remember the offset into the array. (My guess is that 
> this may not work.)
> @Workspace does allow you to specify a holder for a Java object, but such 
> objects won't be spilled to disk when, say, the hash aggregate spills. This 
> means your aggregate will work fine at small scale, then mysteriously fail 
> once moved into production. Fun.
> 
> Unless aggregate UDFs are special, they can return a VarChar or VarBinary 
> result. The book explains how to do this for VarChar, some poking around in 
> the Drill source should identify how to do so for VarBinary. (There are 
> crufty details about allocating space, copying over data, etc.)
> 
> FWIW: There is a pile of information on UDF internals on my GitHub Wiki. [1] 
> Aggregate UDFS are covered in [2]. Once we learn the answers to your specific 
> questions, we can add the info to the Wiki.
> 
> Thanks,
> - Paul
> 
> [1] https://github.com/paul-rogers/drill/wiki/UDFs-Background-Information
> 
> 
> [2] https://github.com/paul-rogers/drill/wiki/Aggregate-UDFs
> 
> 
> 
> 
> 
> 
>    On Monday, August 12, 2019, 01:19:33 PM PDT, Ted Dunning 
> <[email protected]> wrote:  
> 
> I am trying to figure out how to build an approximate percentile estimator.
> 
> I have a fancy data structure that will do this. It can live in bounded
> memory with no allocation. I can add numbers to the digest easily enough.
> And the required results can be extracted from the structure.
> 
> What I would need to know:
> 
> - how to use a fixed array of bytes as the state of an aggregating UDF
> 
> - how to pass in an argument to an aggregator OR (better) how to use the
> binary result of an aggregator in another function.
> 
> On Mon, Aug 12, 2019 at 11:25 AM Charles Givre <[email protected]> wrote:
> 
>> Ted,
>> Can we ask what it is you are trying to build a UDF for?
>> --C
>> 
>>> On Aug 12, 2019, at 2:23 PM, Paul Rogers <[email protected]>
>> wrote:
>>> 
>>> Hi Ted,
>>> 
>>> Thanks for the link; I suspected there was some trick for stddev. The
>> point still stands that, if the algorithm requires multiple passes over the
>> data (ML, say), can't be done in Drill.
>>> 
>>> Each UDF must return exactly one value. It can return a map if you want
>> multiple values (though someone would have to check that projection works
>> to convert these to scalar top-level values). AFAIK, a UDF can produce a
>> binary buffer as output (type VarBinary). But, an aggregate UDF cannot
>> accumulate a VarChar or VarBinary because Drill cannot insert values into
>> an existing variable-length vector.
>>> 
>>> UDFs need your knack for finding a workaround to get your job done; they
>> have pretty strong limitations on the surface.
>>> 
>>> Thanks,
>>> - Paul
>>> 
>>> 
>>> 
>>>     On Monday, August 12, 2019, 10:59:56 AM PDT, Ted Dunning <
>> [email protected]> wrote:
>>> 
>>> Is it possible for a UDF to produce multiple scalar results? Can it
>> produce
>>> a binary result?
>>> 
>>> Also, as a nit, standard deviation doesn't require buffering all the
>> data.
>>> It just requires that you have three accumulators, one for count, one for
>>> mean and one for mean squared deviation.  There is a slightly tricky
>>> algorithm called Welford's algorithm
>>> <
>> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm
>>> 
>>> which
>>> allows good numerical stability while computing this on-line.
>>> 
>>> On Mon, Aug 12, 2019 at 9:01 AM Paul Rogers <[email protected]>
>>> wrote:
>>> 
>>>> Hi Ted,
>>>> 
>>>> Last I checked (when we wrote the book chapter on the subject),
>> aggregate
>>>> state are limited to scalars and Drill-defined types. There is no
>> support
>>>> to spill aggregate state, so that state will be lost if spilling is
>>>> required to handle large aggregate batches. The current solution works
>> for
>>>> simple cases such as totals and averages.
>>>> 
>>>> Aggregate UDFs share no state, so it is not possible for one function to
>>>> use state accumulated by another. If, for example, you want sum, average
>>>> and standard deviation, you'll have to accumulate the total three times,
>>>> average twice, and so on. Note that the std dev function will require
>>>> buffering all data in one's own array (without any spilling or other
>>>> support), to allow computing the (X-bar - X)^2 part of the calculation.
>>>> 
>>>> A UDF can emit a byte array (have to check it this is true of aggregate
>>>> UDFs). A VarChar is simply a special kind of array, and UDFs can emit a
>>>> VarChar.
>>>> 
>>>> All this is from memory and so is only approximately accurate. YMMV.
>>>> 
>>>> Thanks,
>>>> - Paul
>>>> 
>>>> 
>>>> 
>>>>     On Monday, August 12, 2019, 07:35:47 AM PDT, Ted Dunning <
>>>> [email protected]> wrote:
>>>> 
>>>>   What is the current state of building aggregators that have complex
>> state
>>>> via UDFs?
>>>> 
>>>> Is it possible to define multi-level aggregators in a UDF?
>>>> 
>>>> Can the output of a UDF be a byte array?
>>>> 
>>>> 
>>>> (these are three different questions)
>>>> 
>> 
>>

Re: complex data structure aggregators?

Reply via email to