Hi Ted,

Thanks for the link; I suspected there was some trick for stddev. The point 
still stands that, if the algorithm requires multiple passes over the data (ML, 
say), can't be done in Drill.

Each UDF must return exactly one value. It can return a map if you want 
multiple values (though someone would have to check that projection works to 
convert these to scalar top-level values). AFAIK, a UDF can produce a binary 
buffer as output (type VarBinary). But, an aggregate UDF cannot accumulate a 
VarChar or VarBinary because Drill cannot insert values into an existing 
variable-length vector.

UDFs need your knack for finding a workaround to get your job done; they have 
pretty strong limitations on the surface.

Thanks,
- Paul

 

    On Monday, August 12, 2019, 10:59:56 AM PDT, Ted Dunning 
<ted.dunn...@gmail.com> wrote:  
 
 Is it possible for a UDF to produce multiple scalar results? Can it produce
a binary result?

Also, as a nit, standard deviation doesn't require buffering all the data.
It just requires that you have three accumulators, one for count, one for
mean and one for mean squared deviation.  There is a slightly tricky
algorithm called Welford's algorithm
<https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm>
which
allows good numerical stability while computing this on-line.

On Mon, Aug 12, 2019 at 9:01 AM Paul Rogers <par0...@yahoo.com.invalid>
wrote:

> Hi Ted,
>
> Last I checked (when we wrote the book chapter on the subject), aggregate
> state are limited to scalars and Drill-defined types. There is no support
> to spill aggregate state, so that state will be lost if spilling is
> required to handle large aggregate batches. The current solution works for
> simple cases such as totals and averages.
>
> Aggregate UDFs share no state, so it is not possible for one function to
> use state accumulated by another. If, for example, you want sum, average
> and standard deviation, you'll have to accumulate the total three times,
> average twice, and so on. Note that the std dev function will require
> buffering all data in one's own array (without any spilling or other
> support), to allow computing the (X-bar - X)^2 part of the calculation.
>
> A UDF can emit a byte array (have to check it this is true of aggregate
> UDFs). A VarChar is simply a special kind of array, and UDFs can emit a
> VarChar.
>
> All this is from memory and so is only approximately accurate. YMMV.
>
> Thanks,
> - Paul
>
>
>
>    On Monday, August 12, 2019, 07:35:47 AM PDT, Ted Dunning <
> ted.dunn...@gmail.com> wrote:
>
>  What is the current state of building aggregators that have complex state
> via UDFs?
>
> Is it possible to define multi-level aggregators in a UDF?
>
> Can the output of a UDF be a byte array?
>
>
> (these are three different questions)
>
  

Reply via email to