[Sorry, this version should have less munged formatting since I clipped the comments. Oh, and the Kahan sum algorithm was grabbed from wikipedia, not mathworld]
Tim Hochberg wrote: > Robert Kern wrote: > >> David M. Cooke wrote: >> >> >>> On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote: >>> >>> >>>> Let me offer a third path: the algorithms used for .mean() and .var() are >>>> substandard. There are much better incremental algorithms that entirely >>>> avoid >>>> the need to accumulate such large (and therefore precision-losing) >>>> intermediate >>>> values. The algorithms look like the following for 1D arrays in Python: >>>> >>>> def mean(a): >>>> m = a[0] >>>> for i in range(1, len(a)): >>>> m += (a[i] - m) / (i + 1) >>>> return m >>>> >>>> >>> This isn't really going to be any better than using a simple sum. >>> It'll also be slower (a division per iteration). >>> >>> >> With one exception, every test that I've thrown at it shows that it's better >> for >> float32. That exception is uniformly spaced arrays, like linspace(). >> >> > Here's version of mean based on the Kahan sum, including the > compensation term that seems to be consistently much better than builtin > mean It seems to be typically 4 orders or magnitude closer to the > "correct" answer than the builtin mean and 2 orders of magnitude better > than just naively using the Kahan sum. I'm using the sum performed with > dtype=int64 as the "correct" value. > > > def rawKahanSum(values): if not input: return 0.0 total = values[0] c = type(total)(0.0) for x in values[1:]: y = x - c t = total + y c = (t - total) - y total = t return total, c def kahanSum(values): total, c = rawKahanSum(values) return total def mean(values): total, c = rawKahanSum(values) n = float(len(values)) return total / n - c / n for i in range(100): data = random.random([10000]).astype(float32) baseline = data.mean(dtype=float64) delta_builtin_mean = baseline - data.mean() delta_compensated_mean = baseline - mean(data) delta_kahan_mean = baseline - (kahanSum(data) / len(data)) if not abs(delta_builtin_mean) >= abs(delta_kahan_mean) >= abs(delta_compensated_mean): print ">>>", print delta_builtin_mean, delta_kahan_mean, delta_compensated_mean > -tim > >> > You do avoid >> > accumulating large sums, but then doing the division a[i]/len(a) and >> > adding that will do the same. >> >> Okay, this is true. >> >> >> >>> Now, if you want to avoid losing precision, you want to use a better >>> summation technique, like compensated (or Kahan) summation: >>> >>> def mean(a): >>> s = e = a.dtype.type(0) >>> for i in range(0, len(a)): >>> temp = s >>> y = a[i] + e >>> s = temp + y >>> e = (temp - s) + y >>> return s / len(a) >>> >>> Some numerical experiments in Maple using 5-digit precision show that >>> your mean is maybe a bit better in some cases, but can also be much >>> worse, than sum(a)/len(a), but both are quite poor in comparision to the >>> Kahan summation. >>> >>> (We could probably use a fast implementation of Kahan summation in >>> addition to a.sum()) >>> >>> >> +1 >> >> >> >>>> def var(a): >>>> m = a[0] >>>> t = a.dtype.type(0) >>>> for i in range(1, len(a)): >>>> q = a[i] - m >>>> r = q / (i+1) >>>> m += r >>>> t += i * q * r >>>> t /= len(a) >>>> return t >>>> >>>> Alternatively, from Knuth: >>>> >>>> def var_knuth(a): >>>> m = a.dtype.type(0) >>>> variance = a.dtype.type(0) >>>> for i in range(len(a)): >>>> delta = a[i] - m >>>> m += delta / (i+1) >>>> variance += delta * (a[i] - m) >>>> variance /= len(a) >>>> return variance >>>> >>>> >>> These formulas are good when you can only do one pass over the data >>> (like in a calculator where you don't store all the data points), but >>> are slightly worse than doing two passes. Kahan summation would probably >>> also be good here too. >>> >>> >> Again, my tests show otherwise for float32. I'll condense my ipython log >> into a >> module for everyone's perusal. It's possible that the Kahan summation of the >> squared residuals will work better than the current two-pass algorithm and >> the >> implementations I give above. >> >> >> > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys -- and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Numpy-discussion mailing list > Numpy-discussion@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/numpy-discussion > > > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion