On Thu, 21 Sep 2006 11:34:42 -0700 Tim Hochberg <[EMAIL PROTECTED]> wrote:
> Tim Hochberg wrote: > > Robert Kern wrote: > > > >> David M. Cooke wrote: > >> > >> > >>> On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote: > >>> > >>> > >>>> Let me offer a third path: the algorithms used for .mean() and .var() > >>>> are substandard. There are much better incremental algorithms that > >>>> entirely avoid the need to accumulate such large (and therefore > >>>> precision-losing) intermediate values. The algorithms look like the > >>>> following for 1D arrays in Python: > >>>> > >>>> def mean(a): > >>>> m = a[0] > >>>> for i in range(1, len(a)): > >>>> m += (a[i] - m) / (i + 1) > >>>> return m > >>>> > >>>> > >>> This isn't really going to be any better than using a simple sum. > >>> It'll also be slower (a division per iteration). > >>> > >>> > >> With one exception, every test that I've thrown at it shows that it's > >> better for float32. That exception is uniformly spaced arrays, like > >> linspace(). > >> > >> > You do avoid > >> > accumulating large sums, but then doing the division a[i]/len(a) and > >> > adding that will do the same. > >> > >> Okay, this is true. > >> > >> > >> > >>> Now, if you want to avoid losing precision, you want to use a better > >>> summation technique, like compensated (or Kahan) summation: > >>> > >>> def mean(a): > >>> s = e = a.dtype.type(0) > >>> for i in range(0, len(a)): > >>> temp = s > >>> y = a[i] + e > >>> s = temp + y > >>> e = (temp - s) + y > >>> return s / len(a) > >> > >>>> def var(a): > >>>> m = a[0] > >>>> t = a.dtype.type(0) > >>>> for i in range(1, len(a)): > >>>> q = a[i] - m > >>>> r = q / (i+1) > >>>> m += r > >>>> t += i * q * r > >>>> t /= len(a) > >>>> return t > >>>> > >>>> Alternatively, from Knuth: > >>>> > >>>> def var_knuth(a): > >>>> m = a.dtype.type(0) > >>>> variance = a.dtype.type(0) > >>>> for i in range(len(a)): > >>>> delta = a[i] - m > >>>> m += delta / (i+1) > >>>> variance += delta * (a[i] - m) > >>>> variance /= len(a) > >>>> return variance > > I'm going to go ahead and attach a module containing the versions of > mean, var, etc that I've been playing with in case someone wants to mess > with them. Some were stolen from traffic on this list, for others I > grabbed the algorithms from wikipedia or equivalent. I looked into this a bit more. I checked float32 (single precision) and float64 (double precision), using long doubles (float96) for the "exact" results. This is based on your code. Results are compared using abs(exact_stat - computed_stat) / max(abs(values)), with 10000 values in the range of [-100, 900] First, the mean. In float32, the Kahan summation in single precision is better by about 2 orders of magnitude than simple summation. However, accumulating the sum in double precision is better by about 9 orders of magnitude than simple summation (7 orders more than Kahan). In float64, Kahan summation is the way to go, by 2 orders of magnitude. For the variance, in float32, Knuth's method is *no better* than the two-pass method. Tim's code does an implicit conversion of intermediate results to float64, which is why he saw a much better result. The two-pass method using Kahan summation (again, in single precision), is better by about 2 orders of magnitude. There is practically no difference when using a double-precision accumulator amongst the techniques: they're all about 9 orders of magnitude better than single-precision two-pass. In float64, Kahan summation is again better than the rest, by about 2 orders of magnitude. I've put my adaptation of Tim's code, and box-and-whisker plots of the results, at http://arbutus.mcmaster.ca/dmc/numpy/variance/ Conclusions: - If you're going to calculate everything in single precision, use Kahan summation. Using it in double-precision also helps. - If you can use a double-precision accumulator, it's much better than any of the techniques in single-precision only. - for speed+precision in the variance, either use Kahan summation in single precision with the two-pass method, or use double precision with simple summation with the two-pass method. Knuth buys you nothing, except slower code :-) After 1.0 is out, we should look at doing one of the above. -- |>|\/|< /--------------------------------------------------------------------------\ |David M. Cooke http://arbutus.physics.mcmaster.ca/dmc/ |[EMAIL PROTECTED] ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion