I am trying to figure out how to build an approximate percentile estimator.
I have a fancy data structure that will do this. It can live in bounded memory with no allocation. I can add numbers to the digest easily enough. And the required results can be extracted from the structure. What I would need to know: - how to use a fixed array of bytes as the state of an aggregating UDF - how to pass in an argument to an aggregator OR (better) how to use the binary result of an aggregator in another function. On Mon, Aug 12, 2019 at 11:25 AM Charles Givre <[email protected]> wrote: > Ted, > Can we ask what it is you are trying to build a UDF for? > --C > > > On Aug 12, 2019, at 2:23 PM, Paul Rogers <[email protected]> > wrote: > > > > Hi Ted, > > > > Thanks for the link; I suspected there was some trick for stddev. The > point still stands that, if the algorithm requires multiple passes over the > data (ML, say), can't be done in Drill. > > > > Each UDF must return exactly one value. It can return a map if you want > multiple values (though someone would have to check that projection works > to convert these to scalar top-level values). AFAIK, a UDF can produce a > binary buffer as output (type VarBinary). But, an aggregate UDF cannot > accumulate a VarChar or VarBinary because Drill cannot insert values into > an existing variable-length vector. > > > > UDFs need your knack for finding a workaround to get your job done; they > have pretty strong limitations on the surface. > > > > Thanks, > > - Paul > > > > > > > > On Monday, August 12, 2019, 10:59:56 AM PDT, Ted Dunning < > [email protected]> wrote: > > > > Is it possible for a UDF to produce multiple scalar results? Can it > produce > > a binary result? > > > > Also, as a nit, standard deviation doesn't require buffering all the > data. > > It just requires that you have three accumulators, one for count, one for > > mean and one for mean squared deviation. There is a slightly tricky > > algorithm called Welford's algorithm > > < > https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm > > > > which > > allows good numerical stability while computing this on-line. > > > > On Mon, Aug 12, 2019 at 9:01 AM Paul Rogers <[email protected]> > > wrote: > > > >> Hi Ted, > >> > >> Last I checked (when we wrote the book chapter on the subject), > aggregate > >> state are limited to scalars and Drill-defined types. There is no > support > >> to spill aggregate state, so that state will be lost if spilling is > >> required to handle large aggregate batches. The current solution works > for > >> simple cases such as totals and averages. > >> > >> Aggregate UDFs share no state, so it is not possible for one function to > >> use state accumulated by another. If, for example, you want sum, average > >> and standard deviation, you'll have to accumulate the total three times, > >> average twice, and so on. Note that the std dev function will require > >> buffering all data in one's own array (without any spilling or other > >> support), to allow computing the (X-bar - X)^2 part of the calculation. > >> > >> A UDF can emit a byte array (have to check it this is true of aggregate > >> UDFs). A VarChar is simply a special kind of array, and UDFs can emit a > >> VarChar. > >> > >> All this is from memory and so is only approximately accurate. YMMV. > >> > >> Thanks, > >> - Paul > >> > >> > >> > >> On Monday, August 12, 2019, 07:35:47 AM PDT, Ted Dunning < > >> [email protected]> wrote: > >> > >> What is the current state of building aggregators that have complex > state > >> via UDFs? > >> > >> Is it possible to define multi-level aggregators in a UDF? > >> > >> Can the output of a UDF be a byte array? > >> > >> > >> (these are three different questions) > >> > >
