Charles,

That might work. The t-digest will give us a median estimate.



On Mon, Aug 12, 2019 at 4:33 PM Charles Givre <cgi...@gmail.com> wrote:

> HI Ted,
> You might want to take a look at this repo:
> https://github.com/cgivre/drill-stats-function/blob/master/src/main/java/org/apache/drill/contrib/function/DrillStatsFunctions.java
> <
> https://github.com/cgivre/drill-stats-function/blob/master/src/main/java/org/apache/drill/contrib/function/DrillStatsFunctions.java
> >
> This was an experiment to see if I could write a function to calculate a
> median.  I found a streaming algorithm to do so, but it required the use of
> two stacks.  This was more of a "can I do this" type challenge than a "will
> this really work well" but I did get it to work.  In any event, the way I
> did it was to use the @Workspace and use an ObjectHolder.  Maybe this will
> help you out.
> -- C
>
>
> > On Aug 12, 2019, at 6:03 PM, Paul Rogers <par0...@yahoo.com.INVALID>
> wrote:
> >
> > Hi Ted,
> >
> > You are now at the point that you'll have to experiment. Drill provides
> an annotation for aggregate state:  @Workspace. The value must be declared
> as a "holder". You'll have to check if VarBinaryHolder is allowed, and, if
> so, how you allocate memory and remember the offset into the array. (My
> guess is that this may not work.)
> > @Workspace does allow you to specify a holder for a Java object, but
> such objects won't be spilled to disk when, say, the hash aggregate spills.
> This means your aggregate will work fine at small scale, then mysteriously
> fail once moved into production. Fun.
> >
> > Unless aggregate UDFs are special, they can return a VarChar or
> VarBinary result. The book explains how to do this for VarChar, some poking
> around in the Drill source should identify how to do so for VarBinary.
> (There are crufty details about allocating space, copying over data, etc.)
> >
> > FWIW: There is a pile of information on UDF internals on my GitHub Wiki.
> [1] Aggregate UDFS are covered in [2]. Once we learn the answers to your
> specific questions, we can add the info to the Wiki.
> >
> > Thanks,
> > - Paul
> >
> > [1]
> https://github.com/paul-rogers/drill/wiki/UDFs-Background-Information
> >
> >
> > [2] https://github.com/paul-rogers/drill/wiki/Aggregate-UDFs
> >
> >
> >
> >
> >
> >
> >    On Monday, August 12, 2019, 01:19:33 PM PDT, Ted Dunning <
> ted.dunn...@gmail.com> wrote:
> >
> > I am trying to figure out how to build an approximate percentile
> estimator.
> >
> > I have a fancy data structure that will do this. It can live in bounded
> > memory with no allocation. I can add numbers to the digest easily enough.
> > And the required results can be extracted from the structure.
> >
> > What I would need to know:
> >
> > - how to use a fixed array of bytes as the state of an aggregating UDF
> >
> > - how to pass in an argument to an aggregator OR (better) how to use the
> > binary result of an aggregator in another function.
> >
> > On Mon, Aug 12, 2019 at 11:25 AM Charles Givre <cgi...@gmail.com> wrote:
> >
> >> Ted,
> >> Can we ask what it is you are trying to build a UDF for?
> >> --C
> >>
> >>> On Aug 12, 2019, at 2:23 PM, Paul Rogers <par0...@yahoo.com.INVALID>
> >> wrote:
> >>>
> >>> Hi Ted,
> >>>
> >>> Thanks for the link; I suspected there was some trick for stddev. The
> >> point still stands that, if the algorithm requires multiple passes over
> the
> >> data (ML, say), can't be done in Drill.
> >>>
> >>> Each UDF must return exactly one value. It can return a map if you want
> >> multiple values (though someone would have to check that projection
> works
> >> to convert these to scalar top-level values). AFAIK, a UDF can produce a
> >> binary buffer as output (type VarBinary). But, an aggregate UDF cannot
> >> accumulate a VarChar or VarBinary because Drill cannot insert values
> into
> >> an existing variable-length vector.
> >>>
> >>> UDFs need your knack for finding a workaround to get your job done;
> they
> >> have pretty strong limitations on the surface.
> >>>
> >>> Thanks,
> >>> - Paul
> >>>
> >>>
> >>>
> >>>     On Monday, August 12, 2019, 10:59:56 AM PDT, Ted Dunning <
> >> ted.dunn...@gmail.com> wrote:
> >>>
> >>> Is it possible for a UDF to produce multiple scalar results? Can it
> >> produce
> >>> a binary result?
> >>>
> >>> Also, as a nit, standard deviation doesn't require buffering all the
> >> data.
> >>> It just requires that you have three accumulators, one for count, one
> for
> >>> mean and one for mean squared deviation.  There is a slightly tricky
> >>> algorithm called Welford's algorithm
> >>> <
> >>
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm
> >>>
> >>> which
> >>> allows good numerical stability while computing this on-line.
> >>>
> >>> On Mon, Aug 12, 2019 at 9:01 AM Paul Rogers <par0...@yahoo.com.invalid
> >
> >>> wrote:
> >>>
> >>>> Hi Ted,
> >>>>
> >>>> Last I checked (when we wrote the book chapter on the subject),
> >> aggregate
> >>>> state are limited to scalars and Drill-defined types. There is no
> >> support
> >>>> to spill aggregate state, so that state will be lost if spilling is
> >>>> required to handle large aggregate batches. The current solution works
> >> for
> >>>> simple cases such as totals and averages.
> >>>>
> >>>> Aggregate UDFs share no state, so it is not possible for one function
> to
> >>>> use state accumulated by another. If, for example, you want sum,
> average
> >>>> and standard deviation, you'll have to accumulate the total three
> times,
> >>>> average twice, and so on. Note that the std dev function will require
> >>>> buffering all data in one's own array (without any spilling or other
> >>>> support), to allow computing the (X-bar - X)^2 part of the
> calculation.
> >>>>
> >>>> A UDF can emit a byte array (have to check it this is true of
> aggregate
> >>>> UDFs). A VarChar is simply a special kind of array, and UDFs can emit
> a
> >>>> VarChar.
> >>>>
> >>>> All this is from memory and so is only approximately accurate. YMMV.
> >>>>
> >>>> Thanks,
> >>>> - Paul
> >>>>
> >>>>
> >>>>
> >>>>     On Monday, August 12, 2019, 07:35:47 AM PDT, Ted Dunning <
> >>>> ted.dunn...@gmail.com> wrote:
> >>>>
> >>>>   What is the current state of building aggregators that have complex
> >> state
> >>>> via UDFs?
> >>>>
> >>>> Is it possible to define multi-level aggregators in a UDF?
> >>>>
> >>>> Can the output of a UDF be a byte array?
> >>>>
> >>>>
> >>>> (these are three different questions)
> >>>>
> >>
> >>
>
>

Reply via email to