Ted, 
Can we ask what it is you are trying to build a UDF for?
--C

> On Aug 12, 2019, at 2:23 PM, Paul Rogers <par0...@yahoo.com.INVALID> wrote:
> 
> Hi Ted,
> 
> Thanks for the link; I suspected there was some trick for stddev. The point 
> still stands that, if the algorithm requires multiple passes over the data 
> (ML, say), can't be done in Drill.
> 
> Each UDF must return exactly one value. It can return a map if you want 
> multiple values (though someone would have to check that projection works to 
> convert these to scalar top-level values). AFAIK, a UDF can produce a binary 
> buffer as output (type VarBinary). But, an aggregate UDF cannot accumulate a 
> VarChar or VarBinary because Drill cannot insert values into an existing 
> variable-length vector.
> 
> UDFs need your knack for finding a workaround to get your job done; they have 
> pretty strong limitations on the surface.
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Monday, August 12, 2019, 10:59:56 AM PDT, Ted Dunning 
> <ted.dunn...@gmail.com> wrote:  
> 
> Is it possible for a UDF to produce multiple scalar results? Can it produce
> a binary result?
> 
> Also, as a nit, standard deviation doesn't require buffering all the data.
> It just requires that you have three accumulators, one for count, one for
> mean and one for mean squared deviation.  There is a slightly tricky
> algorithm called Welford's algorithm
> <https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm>
> which
> allows good numerical stability while computing this on-line.
> 
> On Mon, Aug 12, 2019 at 9:01 AM Paul Rogers <par0...@yahoo.com.invalid>
> wrote:
> 
>> Hi Ted,
>> 
>> Last I checked (when we wrote the book chapter on the subject), aggregate
>> state are limited to scalars and Drill-defined types. There is no support
>> to spill aggregate state, so that state will be lost if spilling is
>> required to handle large aggregate batches. The current solution works for
>> simple cases such as totals and averages.
>> 
>> Aggregate UDFs share no state, so it is not possible for one function to
>> use state accumulated by another. If, for example, you want sum, average
>> and standard deviation, you'll have to accumulate the total three times,
>> average twice, and so on. Note that the std dev function will require
>> buffering all data in one's own array (without any spilling or other
>> support), to allow computing the (X-bar - X)^2 part of the calculation.
>> 
>> A UDF can emit a byte array (have to check it this is true of aggregate
>> UDFs). A VarChar is simply a special kind of array, and UDFs can emit a
>> VarChar.
>> 
>> All this is from memory and so is only approximately accurate. YMMV.
>> 
>> Thanks,
>> - Paul
>> 
>> 
>> 
>>     On Monday, August 12, 2019, 07:35:47 AM PDT, Ted Dunning <
>> ted.dunn...@gmail.com> wrote:
>> 
>>   What is the current state of building aggregators that have complex state
>> via UDFs?
>> 
>> Is it possible to define multi-level aggregators in a UDF?
>> 
>> Can the output of a UDF be a byte array?
>> 
>> 
>> (these are three different questions)
>> 

Reply via email to