Re: [Numpy-discussion] please change mean to use dtype=float

Charles R Harris Wed, 20 Sep 2006 09:59:13 -0700

On 9/20/06, David M. Cooke <[EMAIL PROTECTED]> wrote:

On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote:
> Let me offer a third path: the algorithms used for .mean() and .var() are
> substandard. There are much better incremental algorithms that entirely avoid
> the need to accumulate such large (and therefore precision-losing) intermediate
> values. The algorithms look like the following for 1D arrays in Python:
>
> def mean(a):
>      m = a[0]
>      for i in range(1, len(a)):
>          m += (a[i] - m) / (i + 1)
>      return m

This isn't really going to be any better than using a simple sum.
It'll also be slower (a division per iteration). You do avoid
accumulating large sums, but then doing the division a[i]/len(a) and
adding that will do the same.

Now, if you want to avoid losing precision, you want to use a better
summation technique, like compensated (or Kahan) summation:

def mean(a):
    s = e = a.dtype.type(0)
    for i in range(0, len(a)):
        temp = s
        y = a[i] + e
        s = temp + y
        e = (temp - s) + y
    return s / len(a)

A variant of this can also be used to generate the sin/cos for fourier transforms using repeated complex multiplication. It works amazingly well. But I suspect accumulating in higher precision would be just as good and faster if the hardware supports it.

Divide and conquer, like an fft where only the constant coefficient is computed, is also a good bet but requires lg(n) passes over the data and extra storage.

Some numerical experiments in Maple using 5-digit precision show that
your mean is maybe a bit better in some cases, but can also be much
worse, than sum(a)/len(a), but both are quite poor in comparision to the
Kahan summation.

(We could probably use a fast implementation of Kahan summation in
addition to a.sum())

> def var(a):
>      m = a[0]
>      t = a.dtype.type(0)
>      for i in range(1, len(a)):
>          q = a[i] - m
>          r = q / (i+1)
>          m += r
>          t += i * q * r
>      t /= len(a)
>      return t
>
> Alternatively, from Knuth:
>
> def var_knuth(a):
>      m = a.dtype.type(0)
>      variance = a.dtype.type(0)
>      for i in range(len(a)):
>          delta = a[i] - m
>          m += delta / (i+1)
>          variance += delta * (a[i] - m)
>      variance /= len(a)
>      return variance

These formulas are good when you can only do one pass over the data
(like in a calculator where you don't store all the data points), but
are slightly worse than doing two passes. Kahan summation would probably
also be good here too.

I think we should leave things as they are. Choosing the right precision for accumulating sums is absolutely fundamental, I always think about it when programming such loops because it is always a potential gotcha. Making the defaults higher precision just kicks the can down the road where the unwary are still likely to trip over it at some point.

Perhaps these special tricks for special cases could be included in scipy somewhere.

Chuck

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Re: [Numpy-discussion] please change mean to use dtype=float

Reply via email to