Al Chou wrote:
--- Phil Steitz <[EMAIL PROTECTED]> wrote:

Al Chou wrote:

Date: Wed, 04 Jun 2003 21:05:14 -0700
From: Phil Steitz <[EMAIL PROTECTED]>
Subject: [math] more improvement to storage free mean, variance computation

Check out procedure sum.2 and var.2 in

http://www.stanford.edu/~glynn/PDF/0208.pdf

The first looks like Brent's suggestion for a corrected mean computation, with no memory required. The additional computational cost that I complained about is docuemented to be 3x the flops cost of the direct computation, but the computation is claimed to be more stable. So the question is: do we pay the flops cost to get the numerical stability? The example in the paper is compelling; but it uses small words (err, numbers I mean -- sorry, slipped in to my native Fortran for a moment there ;-)). So how do we go about deciding whether the stability in the mean computation is worth the increased computational effort? I would prefer not to answer "let the user decide". To make the decision harder, we should note that it is actually worse than 3x, since in the no storage version, the user may request the mean only rarely (if at all) and the 3x comparison is against computiing the mean for each value added.

The variance formula looks better than what we have now, still requiring no memory. Should we implement this for the no storage case?


After implementing var.2 from the Stanford paper in UnivariateImpl and
scratching my head for some time over why the variance calculation failed

its


JUnit test case, I realized there's a flaw in var.2 that I can't understand

no


one talks about. To update the variance (called S in the paper), the

formula


calculates

z = y / i
S = S + (i-1) * y * z

where i is the number of data values (including the value just being added

to


the collection). It doesn't really matter how y is defined, because you

will


notice that

S = S + (i-1) * y * y / i
 = S + (i-1) * y**2 / i

which means that S can never decrease in magnitude (for real data, which is
what we're talking about).  But for the simple case of three data values

{1, 2,


2} in the JUnit test case, the variance decreases between the addition of

the


second and third data values.

Can anyone point out what I'm missing here?



I think that is OK, since if you look at the definition of S earlier in the paper, S is not the variance, it is the sum of the squared deviations from the mean. This should be always increasing.


Where is that definition?  I'm looking at equations 3 and 4, which define
S_{1,q} (in LaTeX notation), and the return statement in algorithm Procedure
var.2, which says S_{1,q} = S.

Equation 3 defines S to be the sum of squared differences between L_i and L-bar, which are defined on p. 209 to be the observed values and their mean.



Anyway, I think the resolution is contained in messages to follow shortly.



Al


=====
Albert Davidson Chou

Get answers to Mac questions at http://www.Mac-Mgrs.org/ .

__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to