Re: [Numpy-discussion] please change mean to use dtype=float

Tim Hochberg Thu, 21 Sep 2006 11:35:11 -0700

Tim Hochberg wrote:

Robert Kern wrote:
David M. Cooke wrote:
On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote:
Let me offer a third path: the algorithms used for .mean() and .var() aresubstandard. There are much better incremental algorithms that entirely avoidthe need to accumulate such large (and therefore precision-losing) intermediatevalues. The algorithms look like the following for 1D arrays in Python:
def mean(a):
     m = a[0]
     for i in range(1, len(a)):
         m += (a[i] - m) / (i + 1)
     return m
This isn't really going to be any better than using a simple sum.
It'll also be slower (a division per iteration).
With one exception, every test that I've thrown at it shows that it's better forfloat32. That exception is uniformly spaced arrays, like linspace().
 > You do avoid
 > accumulating large sums, but then doing the division a[i]/len(a) and
 > adding that will do the same.

Okay, this is true.
Now, if you want to avoid losing precision, you want to use a better
summation technique, like compensated (or Kahan) summation:

def mean(a):
    s = e = a.dtype.type(0)
    for i in range(0, len(a)):
        temp = s
        y = a[i] + e
        s = temp + y
        e = (temp - s) + y
    return s / len(a)

Some numerical experiments in Maple using 5-digit precision show that
your mean is maybe a bit better in some cases, but can also be much
worse, than sum(a)/len(a), but both are quite poor in comparision to the
Kahan summation.

(We could probably use a fast implementation of Kahan summation in
addition to a.sum())
+1
def var(a):
     m = a[0]
     t = a.dtype.type(0)
     for i in range(1, len(a)):
         q = a[i] - m
         r = q / (i+1)
         m += r
         t += i * q * r
     t /= len(a)
     return t

Alternatively, from Knuth:

def var_knuth(a):
     m = a.dtype.type(0)
     variance = a.dtype.type(0)
     for i in range(len(a)):
         delta = a[i] - m
         m += delta / (i+1)
         variance += delta * (a[i] - m)
     variance /= len(a)
     return variance
These formulas are good when you can only do one pass over the data
(like in a calculator where you don't store all the data points), but
are slightly worse than doing two passes. Kahan summation would probably
also be good here too.
Again, my tests show otherwise for float32. I'll condense my ipython log into amodule for everyone's perusal. It's possible that the Kahan summation of thesquared residuals will work better than the current two-pass algorithm and theimplementations I give above.
This is what my tests show as well var_knuth outperformed any simple twopass algorithm I could come up with, even ones using Kahan sums.Interestingly, for 1D arrays the built in float32 variance performsbetter than it should. After a bit of twiddling around I discovered thatit actually does most of it's calculations in float64. It uses a twopass calculation, the result of mean is a scalar, and in the process ofconverting that back to an array we end up with float64 values. Orsomething like that; I was mostly reverse engineering the sequence ofevents from the results.

Here's a simple of example of how var is a little wacky. A shape-[N]array will give you a different result than a shape-[1,N] array. Thereason is clear -- in the second case the mean is not a scalar so thereisn't the inadvertent promotion to float64, but it's still odd.


>>> data = (1000*(random.random([10000]) - 0.1)).astype(float32)
>>> print data.var() - data.reshape([1, -1]).var(-1)
[ 0.1171875]

I'm going to go ahead and attach a module containing the versions ofmean, var, etc that I've been playing with in case someone wants to messwith them. Some were stolen from traffic on this list, for others Igrabbed the algorithms from wikipedia or equivalent.


-tim

-tim




-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/numpy-discussion

def raw_kahan_sum(values):
    """raw_kahan_sum(values) -> sum(values), residual
    
    where sum(values) is computed using Kahan's summation algorithm and the 
    residual is value of the lower order bits when finished.
    
    """
    total = c = values.dtype.type(0)    
    for x in values:
        y = x + c              
        t = total + y          
        c = y - (t - total)  
        total = t             
    return total, c
    
def sum(values):
    """sum(values) -> sum of values performed using kahan summation"""
    total, c =  raw_kahan_sum(values)
    return total
    
def mean(values):
    """mean(values) -> mean of values performed using kahan summation
    
    The residual of the Kahan sum is used to increase the accuracy.
    
    """
    total, residual = raw_kahan_sum(values)
    n = float(len(values))
    return total / n + residual / n
    

def var(values):
    """var(values) -> variance of values computed using Knuth's one pass 
algorithm"""
    m = variance = values.dtype.type(0)
    for i, x in enumerate(values):
        delta = values[i] - m
        m += delta / (i+1)
        variance += delta * (x - m)  
    variance /= len(values)
    return variance

def var_kahan(a):
    m = mean(a)
    n = len(a)
    total, c =  raw_kahan_sum((a-m)**2) 
    return total / n + c / n

# This produces the same results as a.var() for 1D float32 arrays.
def var_builtin_eqv(a):
    x = a.mean()
    m = empty_like(a).astype(float)
    m.fill(x)
    x = m
    x = a - x
    x = x * x
    total = add.reduce(x)
    return total * (1.0 / len(a))
 


if __name__ == "__main__":
    from numpy import random, float32, float64
    builtin_better = 0
    kahan_better = 0
    for i in range(10):
        data = (1000*(random.random([10000]) - 0.1)).astype(float32)
        baseline = data.mean(dtype=float64)
        delta_builtin_mean = baseline - data.mean()
        delta_compensated_mean = baseline - mean(data)
        delta_kahan_mean = baseline - (sum(data) / len(data))
        builtin_better +=  (abs(delta_builtin_mean) < 
abs(delta_compensated_mean))
        kahan_better += (abs(delta_kahan_mean) < abs(delta_compensated_mean))
        #~ print delta_builtin_mean, delta_kahan_mean, delta_compensated_mean
    print "Builtin was better", builtin_better, "times"
    print "Kahan was better", kahan_better, "times"

    builtin_better = 0
    kahan_better = 0
    for i in range(10):
        data = (1000*(random.random([10000]) - 0.1)).astype(float32)
        baseline = data.var(dtype=float64)
        delta_builtin = baseline - data.var()
        delta_knuth = baseline - var(data)
        delta_kahan = baseline - var_kahan(data)
        builtin_better +=  (abs(delta_builtin) < abs(delta_knuth))
        kahan_better += (abs(delta_kahan) < abs(delta_knuth))
        print delta_builtin, delta_kahan, delta_knuth
    print "Builtin was better", builtin_better, "times"
    print "Kahan was better", kahan_better, "times"
    
    data = (1000*(random.random([10000]) - 0.1)).astype(float32)
    print data.var() - data.reshape([1, -1]).var(-1)

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Re: [Numpy-discussion] please change mean to use dtype=float

Reply via email to