Re: complex data structure aggregators?

Ted Dunning Mon, 12 Aug 2019 13:19:59 -0700

I am trying to figure out how to build an approximate percentile estimator.


I have a fancy data structure that will do this. It can live in bounded
memory with no allocation. I can add numbers to the digest easily enough.
And the required results can be extracted from the structure.

What I would need to know:

- how to use a fixed array of bytes as the state of an aggregating UDF

- how to pass in an argument to an aggregator OR (better) how to use the
binary result of an aggregator in another function.

On Mon, Aug 12, 2019 at 11:25 AM Charles Givre <[email protected]> wrote:

> Ted,
> Can we ask what it is you are trying to build a UDF for?
> --C
>
> > On Aug 12, 2019, at 2:23 PM, Paul Rogers <[email protected]>
> wrote:
> >
> > Hi Ted,
> >
> > Thanks for the link; I suspected there was some trick for stddev. The
> point still stands that, if the algorithm requires multiple passes over the
> data (ML, say), can't be done in Drill.
> >
> > Each UDF must return exactly one value. It can return a map if you want
> multiple values (though someone would have to check that projection works
> to convert these to scalar top-level values). AFAIK, a UDF can produce a
> binary buffer as output (type VarBinary). But, an aggregate UDF cannot
> accumulate a VarChar or VarBinary because Drill cannot insert values into
> an existing variable-length vector.
> >
> > UDFs need your knack for finding a workaround to get your job done; they
> have pretty strong limitations on the surface.
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >    On Monday, August 12, 2019, 10:59:56 AM PDT, Ted Dunning <
> [email protected]> wrote:
> >
> > Is it possible for a UDF to produce multiple scalar results? Can it
> produce
> > a binary result?
> >
> > Also, as a nit, standard deviation doesn't require buffering all the
> data.
> > It just requires that you have three accumulators, one for count, one for
> > mean and one for mean squared deviation.  There is a slightly tricky
> > algorithm called Welford's algorithm
> > <
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm
> >
> > which
> > allows good numerical stability while computing this on-line.
> >
> > On Mon, Aug 12, 2019 at 9:01 AM Paul Rogers <[email protected]>
> > wrote:
> >
> >> Hi Ted,
> >>
> >> Last I checked (when we wrote the book chapter on the subject),
> aggregate
> >> state are limited to scalars and Drill-defined types. There is no
> support
> >> to spill aggregate state, so that state will be lost if spilling is
> >> required to handle large aggregate batches. The current solution works
> for
> >> simple cases such as totals and averages.
> >>
> >> Aggregate UDFs share no state, so it is not possible for one function to
> >> use state accumulated by another. If, for example, you want sum, average
> >> and standard deviation, you'll have to accumulate the total three times,
> >> average twice, and so on. Note that the std dev function will require
> >> buffering all data in one's own array (without any spilling or other
> >> support), to allow computing the (X-bar - X)^2 part of the calculation.
> >>
> >> A UDF can emit a byte array (have to check it this is true of aggregate
> >> UDFs). A VarChar is simply a special kind of array, and UDFs can emit a
> >> VarChar.
> >>
> >> All this is from memory and so is only approximately accurate. YMMV.
> >>
> >> Thanks,
> >> - Paul
> >>
> >>
> >>
> >>     On Monday, August 12, 2019, 07:35:47 AM PDT, Ted Dunning <
> >> [email protected]> wrote:
> >>
> >>   What is the current state of building aggregators that have complex
> state
> >> via UDFs?
> >>
> >> Is it possible to define multi-level aggregators in a UDF?
> >>
> >> Can the output of a UDF be a byte array?
> >>
> >>
> >> (these are three different questions)
> >>
>
>

Re: complex data structure aggregators?

Reply via email to