Charles, That might work. The t-digest will give us a median estimate.
On Mon, Aug 12, 2019 at 4:33 PM Charles Givre <cgi...@gmail.com> wrote: > HI Ted, > You might want to take a look at this repo: > https://github.com/cgivre/drill-stats-function/blob/master/src/main/java/org/apache/drill/contrib/function/DrillStatsFunctions.java > < > https://github.com/cgivre/drill-stats-function/blob/master/src/main/java/org/apache/drill/contrib/function/DrillStatsFunctions.java > > > This was an experiment to see if I could write a function to calculate a > median. I found a streaming algorithm to do so, but it required the use of > two stacks. This was more of a "can I do this" type challenge than a "will > this really work well" but I did get it to work. In any event, the way I > did it was to use the @Workspace and use an ObjectHolder. Maybe this will > help you out. > -- C > > > > On Aug 12, 2019, at 6:03 PM, Paul Rogers <par0...@yahoo.com.INVALID> > wrote: > > > > Hi Ted, > > > > You are now at the point that you'll have to experiment. Drill provides > an annotation for aggregate state: @Workspace. The value must be declared > as a "holder". You'll have to check if VarBinaryHolder is allowed, and, if > so, how you allocate memory and remember the offset into the array. (My > guess is that this may not work.) > > @Workspace does allow you to specify a holder for a Java object, but > such objects won't be spilled to disk when, say, the hash aggregate spills. > This means your aggregate will work fine at small scale, then mysteriously > fail once moved into production. Fun. > > > > Unless aggregate UDFs are special, they can return a VarChar or > VarBinary result. The book explains how to do this for VarChar, some poking > around in the Drill source should identify how to do so for VarBinary. > (There are crufty details about allocating space, copying over data, etc.) > > > > FWIW: There is a pile of information on UDF internals on my GitHub Wiki. > [1] Aggregate UDFS are covered in [2]. Once we learn the answers to your > specific questions, we can add the info to the Wiki. > > > > Thanks, > > - Paul > > > > [1] > https://github.com/paul-rogers/drill/wiki/UDFs-Background-Information > > > > > > [2] https://github.com/paul-rogers/drill/wiki/Aggregate-UDFs > > > > > > > > > > > > > > On Monday, August 12, 2019, 01:19:33 PM PDT, Ted Dunning < > ted.dunn...@gmail.com> wrote: > > > > I am trying to figure out how to build an approximate percentile > estimator. > > > > I have a fancy data structure that will do this. It can live in bounded > > memory with no allocation. I can add numbers to the digest easily enough. > > And the required results can be extracted from the structure. > > > > What I would need to know: > > > > - how to use a fixed array of bytes as the state of an aggregating UDF > > > > - how to pass in an argument to an aggregator OR (better) how to use the > > binary result of an aggregator in another function. > > > > On Mon, Aug 12, 2019 at 11:25 AM Charles Givre <cgi...@gmail.com> wrote: > > > >> Ted, > >> Can we ask what it is you are trying to build a UDF for? > >> --C > >> > >>> On Aug 12, 2019, at 2:23 PM, Paul Rogers <par0...@yahoo.com.INVALID> > >> wrote: > >>> > >>> Hi Ted, > >>> > >>> Thanks for the link; I suspected there was some trick for stddev. The > >> point still stands that, if the algorithm requires multiple passes over > the > >> data (ML, say), can't be done in Drill. > >>> > >>> Each UDF must return exactly one value. It can return a map if you want > >> multiple values (though someone would have to check that projection > works > >> to convert these to scalar top-level values). AFAIK, a UDF can produce a > >> binary buffer as output (type VarBinary). But, an aggregate UDF cannot > >> accumulate a VarChar or VarBinary because Drill cannot insert values > into > >> an existing variable-length vector. > >>> > >>> UDFs need your knack for finding a workaround to get your job done; > they > >> have pretty strong limitations on the surface. > >>> > >>> Thanks, > >>> - Paul > >>> > >>> > >>> > >>> On Monday, August 12, 2019, 10:59:56 AM PDT, Ted Dunning < > >> ted.dunn...@gmail.com> wrote: > >>> > >>> Is it possible for a UDF to produce multiple scalar results? Can it > >> produce > >>> a binary result? > >>> > >>> Also, as a nit, standard deviation doesn't require buffering all the > >> data. > >>> It just requires that you have three accumulators, one for count, one > for > >>> mean and one for mean squared deviation. There is a slightly tricky > >>> algorithm called Welford's algorithm > >>> < > >> > https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm > >>> > >>> which > >>> allows good numerical stability while computing this on-line. > >>> > >>> On Mon, Aug 12, 2019 at 9:01 AM Paul Rogers <par0...@yahoo.com.invalid > > > >>> wrote: > >>> > >>>> Hi Ted, > >>>> > >>>> Last I checked (when we wrote the book chapter on the subject), > >> aggregate > >>>> state are limited to scalars and Drill-defined types. There is no > >> support > >>>> to spill aggregate state, so that state will be lost if spilling is > >>>> required to handle large aggregate batches. The current solution works > >> for > >>>> simple cases such as totals and averages. > >>>> > >>>> Aggregate UDFs share no state, so it is not possible for one function > to > >>>> use state accumulated by another. If, for example, you want sum, > average > >>>> and standard deviation, you'll have to accumulate the total three > times, > >>>> average twice, and so on. Note that the std dev function will require > >>>> buffering all data in one's own array (without any spilling or other > >>>> support), to allow computing the (X-bar - X)^2 part of the > calculation. > >>>> > >>>> A UDF can emit a byte array (have to check it this is true of > aggregate > >>>> UDFs). A VarChar is simply a special kind of array, and UDFs can emit > a > >>>> VarChar. > >>>> > >>>> All this is from memory and so is only approximately accurate. YMMV. > >>>> > >>>> Thanks, > >>>> - Paul > >>>> > >>>> > >>>> > >>>> On Monday, August 12, 2019, 07:35:47 AM PDT, Ted Dunning < > >>>> ted.dunn...@gmail.com> wrote: > >>>> > >>>> What is the current state of building aggregators that have complex > >> state > >>>> via UDFs? > >>>> > >>>> Is it possible to define multi-level aggregators in a UDF? > >>>> > >>>> Can the output of a UDF be a byte array? > >>>> > >>>> > >>>> (these are three different questions) > >>>> > >> > >> > >