Hi,

> I think what everyone else was potentially  stating implicitly is that for
combining details about arrays, for std. dev. and average there needs to be
more state kept that is different from the elements that one is actually
dealing with.  For std. dev.  you need to keep two numbers (same with
average).

This; Irrespectively of which algorithm we compute an aggregate with, the
core idea is that we split the calculation in batches, and we need to be
able to reduce each batch to a set of states (e.g. N, count), so that we
can reduce these states to a single state, which can be used to compute the
final result.

Also related: https://issues.apache.org/jira/browse/ARROW-9779

Interestingly, spark uses count / N
<https://github.com/apache/spark/blob/59eb34b82c023ac56dcd08a4ceccdf612bfa7f29/examples/src/main/scala/org/apache/spark/examples/sql/SimpleTypedAggregator.scala#L83>
to compute the average, not an online algorithm.

Best,
Jorge


On Fri, Sep 18, 2020 at 5:42 AM Andrew Wieteska <andrew.r.wiete...@gmail.com>
wrote:

> Dear all
>
> I'm not sure I'm thinking about this right, but if we're looking to
> leverage vectorization for standard deviation/variance would it make sense
> to compute the sum, the sum of squares, and the total number of data (N)
> over all chunks and compute the actual function,
>
> stdev = sqrt(sum_squares/N - (sum/N)^2)
>
> only once at the end? This is one of the approaches in [1].
>
> Best wishes
> Andrew
>
> On Thu, Sep 17, 2020 at 11:29 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
> > >
> > > stddev(x) = sqrt((sum(x*x) - sum(x)*sum(x) / count(x))/(count(x)-1)))
> >
> >
> > This is not numerically stable. Please do not use it.  Please see [1] for
> > some algorithms that might be better.
> >
> > The equation you provided is great in practice to calculate stdev for one
> > > array. It doesn't address the issue of combining stdev from multiple
> > arrays.
> >
> >
> > I think what everyone else was potentially  stating implicitly is that
> for
> > combining details about arrays, for std. dev. and average there needs to
> be
> > more state kept that is different from the elements that one is actually
> > dealing with.  For std. dev.  you need to keep two numbers (same with
> > average).
> >
> > For percentiles, I think calculating exactly will require quite a large
> > state (for integers a histogram approach could be used to compress this).
> > There are however some good approximation algorithms that can be used if
> > exact values are not necessary (for example t-digest [2]).  At some point
> > Arrow should probably have both.
> >
> > [1]
> >
> >
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Computing_shifted_data
> > [2] https://github.com/tdunning/t-digest
> >
> > On Thu, Sep 17, 2020 at 8:17 PM Yibo Cai <yibo....@arm.com> wrote:
> >
> > > Thanks Andrew. The link gives a cool method to calculate variance
> > > incrementally. I think the problem is that it's computationally too
> > > expensive (cannot leverage vectorization, three divisions for a single
> > data
> > > point).
> > > The equation you provided is great in practice to calculate stdev for
> one
> > > array. It doesn't address the issue of combining stdev from multiple
> > arrays.
> > >
> > > On 9/16/20 6:25 PM, Andrew Lamb wrote:
> > > > Perhaps you can rewrite the functions in terms of other kernels that
> > can
> > > be
> > > > merged -- for example something like the following
> > > >
> > > > stddev(x) = sqrt((sum(x*x) - sum(x)*sum(x) / count(x))/(count(x)-1)))
> > > >
> > > > (loosely translated from
> > > >
> > >
> >
> https://math.stackexchange.com/questions/102978/incremental-computation-of-standard-deviation
> > > > )
> > > >
> > > > On Wed, Sep 16, 2020 at 6:12 AM Yibo Cai <yibo....@arm.com> wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >> I have a question about aggregate kernel implementation. Any help is
> > > >> appreciated.
> > > >>
> > > >> Aggregate kernel implements "consume" and "merge" interfaces. For a
> > > >> chunked array, "consume" is called for each array to get a temporary
> > > >> aggregated result, then "merge" it with previously consumed result.
> > For
> > > >> associative operations like min/max/sum, this pattern is convenient.
> > We
> > > can
> > > >> easily "merge" min/max/sum of two arrays, e.g, sum([array_a,
> > array_b]) =
> > > >> sum(array_a) + sum(array_b).
> > > >>
> > > >> But I wonder what's the best approach to deal with operations like
> > > >> stdev/percentile. Results of these operations cannot be easily
> > > "merged". We
> > > >> have to walk through all the chunks to get the result. For these
> > > >> operations, looks "consume" must copy the input array and do all
> > > >> calculation once at "finalize" time. Or we don't expect it to
> support
> > > >> chunked array for them.
> > > >>
> > > >> Yibo
> > > >>
> > > >
> > >
> >
>

Reply via email to