>
> stddev(x) = sqrt((sum(x*x) - sum(x)*sum(x) / count(x))/(count(x)-1)))


This is not numerically stable. Please do not use it.  Please see [1] for
some algorithms that might be better.

The equation you provided is great in practice to calculate stdev for one
> array. It doesn't address the issue of combining stdev from multiple arrays.


I think what everyone else was potentially  stating implicitly is that for
combining details about arrays, for std. dev. and average there needs to be
more state kept that is different from the elements that one is actually
dealing with.  For std. dev.  you need to keep two numbers (same with
average).

For percentiles, I think calculating exactly will require quite a large
state (for integers a histogram approach could be used to compress this).
There are however some good approximation algorithms that can be used if
exact values are not necessary (for example t-digest [2]).  At some point
Arrow should probably have both.

[1]
https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Computing_shifted_data
[2] https://github.com/tdunning/t-digest

On Thu, Sep 17, 2020 at 8:17 PM Yibo Cai <yibo....@arm.com> wrote:

> Thanks Andrew. The link gives a cool method to calculate variance
> incrementally. I think the problem is that it's computationally too
> expensive (cannot leverage vectorization, three divisions for a single data
> point).
> The equation you provided is great in practice to calculate stdev for one
> array. It doesn't address the issue of combining stdev from multiple arrays.
>
> On 9/16/20 6:25 PM, Andrew Lamb wrote:
> > Perhaps you can rewrite the functions in terms of other kernels that can
> be
> > merged -- for example something like the following
> >
> > stddev(x) = sqrt((sum(x*x) - sum(x)*sum(x) / count(x))/(count(x)-1)))
> >
> > (loosely translated from
> >
> https://math.stackexchange.com/questions/102978/incremental-computation-of-standard-deviation
> > )
> >
> > On Wed, Sep 16, 2020 at 6:12 AM Yibo Cai <yibo....@arm.com> wrote:
> >
> >> Hi,
> >>
> >> I have a question about aggregate kernel implementation. Any help is
> >> appreciated.
> >>
> >> Aggregate kernel implements "consume" and "merge" interfaces. For a
> >> chunked array, "consume" is called for each array to get a temporary
> >> aggregated result, then "merge" it with previously consumed result. For
> >> associative operations like min/max/sum, this pattern is convenient. We
> can
> >> easily "merge" min/max/sum of two arrays, e.g, sum([array_a, array_b]) =
> >> sum(array_a) + sum(array_b).
> >>
> >> But I wonder what's the best approach to deal with operations like
> >> stdev/percentile. Results of these operations cannot be easily
> "merged". We
> >> have to walk through all the chunks to get the result. For these
> >> operations, looks "consume" must copy the input array and do all
> >> calculation once at "finalize" time. Or we don't expect it to support
> >> chunked array for them.
> >>
> >> Yibo
> >>
> >
>

Reply via email to