Can UDFs accumulate a fixed length binary value?
On Mon, Aug 12, 2019 at 11:23 AM Paul Rogers <[email protected]> wrote: > Hi Ted, > > Thanks for the link; I suspected there was some trick for stddev. The > point still stands that, if the algorithm requires multiple passes over the > data (ML, say), can't be done in Drill. > > Each UDF must return exactly one value. It can return a map if you want > multiple values (though someone would have to check that projection works > to convert these to scalar top-level values). AFAIK, a UDF can produce a > binary buffer as output (type VarBinary). But, an aggregate UDF cannot > accumulate a VarChar or VarBinary because Drill cannot insert values into > an existing variable-length vector. > > UDFs need your knack for finding a workaround to get your job done; they > have pretty strong limitations on the surface. > > Thanks, > - Paul > > > > On Monday, August 12, 2019, 10:59:56 AM PDT, Ted Dunning < > [email protected]> wrote: > > Is it possible for a UDF to produce multiple scalar results? Can it > produce > a binary result? > > Also, as a nit, standard deviation doesn't require buffering all the data. > It just requires that you have three accumulators, one for count, one for > mean and one for mean squared deviation. There is a slightly tricky > algorithm called Welford's algorithm > < > https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm > > > which > allows good numerical stability while computing this on-line. > > On Mon, Aug 12, 2019 at 9:01 AM Paul Rogers <[email protected]> > wrote: > > > Hi Ted, > > > > Last I checked (when we wrote the book chapter on the subject), aggregate > > state are limited to scalars and Drill-defined types. There is no support > > to spill aggregate state, so that state will be lost if spilling is > > required to handle large aggregate batches. The current solution works > for > > simple cases such as totals and averages. > > > > Aggregate UDFs share no state, so it is not possible for one function to > > use state accumulated by another. If, for example, you want sum, average > > and standard deviation, you'll have to accumulate the total three times, > > average twice, and so on. Note that the std dev function will require > > buffering all data in one's own array (without any spilling or other > > support), to allow computing the (X-bar - X)^2 part of the calculation. > > > > A UDF can emit a byte array (have to check it this is true of aggregate > > UDFs). A VarChar is simply a special kind of array, and UDFs can emit a > > VarChar. > > > > All this is from memory and so is only approximately accurate. YMMV. > > > > Thanks, > > - Paul > > > > > > > > On Monday, August 12, 2019, 07:35:47 AM PDT, Ted Dunning < > > [email protected]> wrote: > > > > What is the current state of building aggregators that have complex > state > > via UDFs? > > > > Is it possible to define multi-level aggregators in a UDF? > > > > Can the output of a UDF be a byte array? > > > > > > (these are three different questions) > > >
