Robert Kern wrote: > David M. Cooke wrote: > >> On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote: >> >>> Let me offer a third path: the algorithms used for .mean() and .var() are >>> substandard. There are much better incremental algorithms that entirely >>> avoid >>> the need to accumulate such large (and therefore precision-losing) >>> intermediate >>> values. The algorithms look like the following for 1D arrays in Python: >>> >>> def mean(a): >>> m = a[0] >>> for i in range(1, len(a)): >>> m += (a[i] - m) / (i + 1) >>> return m >>> >> This isn't really going to be any better than using a simple sum. >> It'll also be slower (a division per iteration). >> > > With one exception, every test that I've thrown at it shows that it's better > for > float32. That exception is uniformly spaced arrays, like linspace(). > Here's version of mean based on the Kahan sum, including the compensation term that seems to be consistently much better than builtin mean It seems to be typically 4 orders or magnitude closer to the "correct" answer than the builtin mean and 2 orders of magnitude better than just naively using the Kahan sum. I'm using the sum performed with dtype=int64 as the "correct" value.
def rawKahanSum(values): # Stolen from mathworld if not input: return 0.0 total = values[0] c = type(total)(0.0) # A running compensation for lost low-order bits. for x in values[1:]: y = x - c # So far, so good: c is zero. t = total + y # Alas, total is big, y small, so low-order digits of y are lost. c = (t - total) - y # (t - total) recovers the high-order part of y; subtracting y recovers -(low part of y) total = t # Algebriacally, c should always be zero. Beware eagerly optimising compilers! # Next time around, the lost low part will be added to y in a fresh attempt. return total, c def kahanSum(values): total, c = rawKahanSum(values) return total def mean(values): total, c = rawKahanSum(values) n = float(len(values)) return total / n - c / n for i in range(100): data = random.random([10000]).astype(float32) baseline = data.mean(dtype=float64) delta_builtin_mean = baseline - data.mean() delta_compensated_mean = baseline - mean(data) delta_kahan_mean = baseline - (kahanSum(data) / len(data)) if not abs(delta_builtin_mean) >= abs(delta_kahan_mean) >= abs(delta_compensated_mean): print ">>>", print delta_builtin_mean, delta_kahan_mean, delta_compensated_mean -tim > > You do avoid > > accumulating large sums, but then doing the division a[i]/len(a) and > > adding that will do the same. > > Okay, this is true. > > >> Now, if you want to avoid losing precision, you want to use a better >> summation technique, like compensated (or Kahan) summation: >> >> def mean(a): >> s = e = a.dtype.type(0) >> for i in range(0, len(a)): >> temp = s >> y = a[i] + e >> s = temp + y >> e = (temp - s) + y >> return s / len(a) >> >> Some numerical experiments in Maple using 5-digit precision show that >> your mean is maybe a bit better in some cases, but can also be much >> worse, than sum(a)/len(a), but both are quite poor in comparision to the >> Kahan summation. >> >> (We could probably use a fast implementation of Kahan summation in >> addition to a.sum()) >> > > +1 > > >>> def var(a): >>> m = a[0] >>> t = a.dtype.type(0) >>> for i in range(1, len(a)): >>> q = a[i] - m >>> r = q / (i+1) >>> m += r >>> t += i * q * r >>> t /= len(a) >>> return t >>> >>> Alternatively, from Knuth: >>> >>> def var_knuth(a): >>> m = a.dtype.type(0) >>> variance = a.dtype.type(0) >>> for i in range(len(a)): >>> delta = a[i] - m >>> m += delta / (i+1) >>> variance += delta * (a[i] - m) >>> variance /= len(a) >>> return variance >>> >> These formulas are good when you can only do one pass over the data >> (like in a calculator where you don't store all the data points), but >> are slightly worse than doing two passes. Kahan summation would probably >> also be good here too. >> > > Again, my tests show otherwise for float32. I'll condense my ipython log into > a > module for everyone's perusal. It's possible that the Kahan summation of the > squared residuals will work better than the current two-pass algorithm and > the > implementations I give above. > > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion