HI Ted, You might want to take a look at this repo: https://github.com/cgivre/drill-stats-function/blob/master/src/main/java/org/apache/drill/contrib/function/DrillStatsFunctions.java <https://github.com/cgivre/drill-stats-function/blob/master/src/main/java/org/apache/drill/contrib/function/DrillStatsFunctions.java> This was an experiment to see if I could write a function to calculate a median. I found a streaming algorithm to do so, but it required the use of two stacks. This was more of a "can I do this" type challenge than a "will this really work well" but I did get it to work. In any event, the way I did it was to use the @Workspace and use an ObjectHolder. Maybe this will help you out. -- C
> On Aug 12, 2019, at 6:03 PM, Paul Rogers <par0...@yahoo.com.INVALID> wrote: > > Hi Ted, > > You are now at the point that you'll have to experiment. Drill provides an > annotation for aggregate state: @Workspace. The value must be declared as a > "holder". You'll have to check if VarBinaryHolder is allowed, and, if so, how > you allocate memory and remember the offset into the array. (My guess is that > this may not work.) > @Workspace does allow you to specify a holder for a Java object, but such > objects won't be spilled to disk when, say, the hash aggregate spills. This > means your aggregate will work fine at small scale, then mysteriously fail > once moved into production. Fun. > > Unless aggregate UDFs are special, they can return a VarChar or VarBinary > result. The book explains how to do this for VarChar, some poking around in > the Drill source should identify how to do so for VarBinary. (There are > crufty details about allocating space, copying over data, etc.) > > FWIW: There is a pile of information on UDF internals on my GitHub Wiki. [1] > Aggregate UDFS are covered in [2]. Once we learn the answers to your specific > questions, we can add the info to the Wiki. > > Thanks, > - Paul > > [1] https://github.com/paul-rogers/drill/wiki/UDFs-Background-Information > > > [2] https://github.com/paul-rogers/drill/wiki/Aggregate-UDFs > > > > > > > On Monday, August 12, 2019, 01:19:33 PM PDT, Ted Dunning > <ted.dunn...@gmail.com> wrote: > > I am trying to figure out how to build an approximate percentile estimator. > > I have a fancy data structure that will do this. It can live in bounded > memory with no allocation. I can add numbers to the digest easily enough. > And the required results can be extracted from the structure. > > What I would need to know: > > - how to use a fixed array of bytes as the state of an aggregating UDF > > - how to pass in an argument to an aggregator OR (better) how to use the > binary result of an aggregator in another function. > > On Mon, Aug 12, 2019 at 11:25 AM Charles Givre <cgi...@gmail.com> wrote: > >> Ted, >> Can we ask what it is you are trying to build a UDF for? >> --C >> >>> On Aug 12, 2019, at 2:23 PM, Paul Rogers <par0...@yahoo.com.INVALID> >> wrote: >>> >>> Hi Ted, >>> >>> Thanks for the link; I suspected there was some trick for stddev. The >> point still stands that, if the algorithm requires multiple passes over the >> data (ML, say), can't be done in Drill. >>> >>> Each UDF must return exactly one value. It can return a map if you want >> multiple values (though someone would have to check that projection works >> to convert these to scalar top-level values). AFAIK, a UDF can produce a >> binary buffer as output (type VarBinary). But, an aggregate UDF cannot >> accumulate a VarChar or VarBinary because Drill cannot insert values into >> an existing variable-length vector. >>> >>> UDFs need your knack for finding a workaround to get your job done; they >> have pretty strong limitations on the surface. >>> >>> Thanks, >>> - Paul >>> >>> >>> >>> On Monday, August 12, 2019, 10:59:56 AM PDT, Ted Dunning < >> ted.dunn...@gmail.com> wrote: >>> >>> Is it possible for a UDF to produce multiple scalar results? Can it >> produce >>> a binary result? >>> >>> Also, as a nit, standard deviation doesn't require buffering all the >> data. >>> It just requires that you have three accumulators, one for count, one for >>> mean and one for mean squared deviation. There is a slightly tricky >>> algorithm called Welford's algorithm >>> < >> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm >>> >>> which >>> allows good numerical stability while computing this on-line. >>> >>> On Mon, Aug 12, 2019 at 9:01 AM Paul Rogers <par0...@yahoo.com.invalid> >>> wrote: >>> >>>> Hi Ted, >>>> >>>> Last I checked (when we wrote the book chapter on the subject), >> aggregate >>>> state are limited to scalars and Drill-defined types. There is no >> support >>>> to spill aggregate state, so that state will be lost if spilling is >>>> required to handle large aggregate batches. The current solution works >> for >>>> simple cases such as totals and averages. >>>> >>>> Aggregate UDFs share no state, so it is not possible for one function to >>>> use state accumulated by another. If, for example, you want sum, average >>>> and standard deviation, you'll have to accumulate the total three times, >>>> average twice, and so on. Note that the std dev function will require >>>> buffering all data in one's own array (without any spilling or other >>>> support), to allow computing the (X-bar - X)^2 part of the calculation. >>>> >>>> A UDF can emit a byte array (have to check it this is true of aggregate >>>> UDFs). A VarChar is simply a special kind of array, and UDFs can emit a >>>> VarChar. >>>> >>>> All this is from memory and so is only approximately accurate. YMMV. >>>> >>>> Thanks, >>>> - Paul >>>> >>>> >>>> >>>> On Monday, August 12, 2019, 07:35:47 AM PDT, Ted Dunning < >>>> ted.dunn...@gmail.com> wrote: >>>> >>>> What is the current state of building aggregators that have complex >> state >>>> via UDFs? >>>> >>>> Is it possible to define multi-level aggregators in a UDF? >>>> >>>> Can the output of a UDF be a byte array? >>>> >>>> >>>> (these are three different questions) >>>> >> >>