(Posted to Jay Warner and to edstat.)
Jay Warner asked for data that would produce a negative sum of squares.
An example is provided after my .sig, below. Comments are embedded in
part of Jay's post:
On Thu, 16 Oct 2003, Jay Warner wrote in part:
> But when we get to negative sums of squares, I think we have a very
> serious issue on our hands. Mathematically negative sums of squares
> are impossible, so we have an issue of digital execution to deal with.
> This is what Excel is _supposed_ to do well.
Yes, well digital execution is one thing; the precision of the results
of the execution is something else again.
> Would someone post a sample dataset that produces this error?
Yup. Se below.
> Can we have a discussion of what properties of this dataset lead to
> the errors?
Yes. For details, see my 1969 Ph.D. thesis, "Computer-generated
errors in statistical analysis" (Cornell University); or the article
based on that thesis in the first volume of the Journal of Statistical
Computation and Simulation, about 1972 or so. Some remarks appear
below, along with the MINITAB output.
> What should a user, forced or allowed to live with Excel,
> look out for to avoid falling into the same trap?
< snip, paragraph re AoV in Excel >
> Show me some data that gets to negative sums of squares.
>
> And BTW, if the 'computational' form of the variance is so poor, what
> alternatives are available?
Simplest: store the first observation (as a vector, if there
are multiple variables as usual). Subtract it from each subsequent
observation, and use the 'computational' form on these deviations from
the first observation. Because zero is now within the range of the
values to be summed and squared (details in the thesis and paper cited
above), the loss in precision will be at most 1.5 decimal digits (and
that only if the first observation is at the remote extreme of a skewed
distribution).
> The definition form is subject to truncation error.
Only to truncation error in the current estimate of the mean. For
data reported only to 3-digit precision (which I believe accounts for
most data sets of interest), the effect on the deviation from the mean
will be discernible only four or five digits further on (in single
precision arithmetic), and the effect on the estimate of the mean will
be negligible. It's when the DATA require high precision that one gets
into trouble. See the numerical example in MINITAB, below my .sig.
> Many of my datasets exhibit more truncation error than anything else,
> even with my Excel's (apparent) 14 digit precision. If we all went to
> 64 bit chips & 64 bit code, would this issue 'go away'?
Only for data whose precision requirements did not exceed whatever the
actual precision is in 64-bit chips. (I believe that when CDC was
making computers with 64-bit words (30+ years ago?), their
floating-point words had a mantissa of 48 bits.)
And in any case, I believe the vaunted 14-digit precision describes only
the value reported in a result. If the arithmetic has been carried out
in the host computer's single-precision arithmetic, the result reported
cannot be precise to more than 6 or 7 digits: the rest is a figment of
the computer's imagination (well, more likely of the process of
converting values internally represented in binary digits to values
reported in decimal digits; if you saw the binary (or octal, or
hexadecimal) equivalents before conversion, you'd be struck by either
the shortness of the numeral (24 bits, or 6 hexadecimal digits) or by
the string of 0's at the end that pad the reported value to the
equivalent of 14 digits).
-----------------------------------------------------------------------
Donald F. Burrill [EMAIL PROTECTED]
56 Sebbins Pond Drive, Bedford, NH 03110 (603) 626-0816
-----------------------------------------------------------------------
Here are results of a MINITAB session. Original data (in C1) were the
integers 1-7. C2 = C1 + 1000, C3 = C1 + 100000, C4 = C1 + 100000000.
C1 requires one (1) decimal digit to express the data; C2 requires 4;
C3 requires 6; C4 requires 8; and C5 requires 10, which is beyond
MINTAB's capacity (on my HP Pavilion) to represent.
MTB > note Pursuit of precision of calculated variance.
MTB > let c5 = c1 + 10000000000
MTB > print c1-c5
Row C1 C2 C3 C4 C5
1 1 1001 100001 10000001 1.00000E+10
2 2 1002 100002 10000002 1.00000E+10
3 3 1003 100003 10000003 1.00000E+10
4 4 1004 100004 10000004 1.00000E+10
5 5 1005 100005 10000005 1.00000E+10
6 6 1006 100006 10000006 1.00000E+10
7 7 1007 100007 10000007 1.00000E+10
MTB > describe c1-c5
Descriptive Statistics: C1, C2, C3, C4, C5
Variable N Mean Median TrMean StDev SE
Mean
C1 7 4.000 4.000 4.000 2.160 0.816
C2 7 1004.0 1004.0 1004.0 2.2 0.8
C3 7 100004 100004 100004 2 1
C4 7 10000004 10000004 10000004 2 1
C5 7 1.0000E+10 1.0000E+10 1.0000E+10 2 1
Variable Minimum Maximum Q1 Q3
C1 1.000 7.000 2.000 6.000
C2 1001.0 1007.0 1002.0 1006.0
C3 100001 100007 100002 100006
C4 10000001 10000007 10000002 10000006
C5 1.0000E+10 1.0000E+10 1.0000E+10 1.0000E+10
Means, medians, and order statistics are reported correctly for C1-C4,
not for CX5. The standard deviation (square root of 4.6666666667, or
2.160247) is reported correctly (to 4 decimal digits) for C1, and is
also correct for C2 (but only to 2 digits). For C3-C5, only 1 digit is
reported (but at least it's the RIGHT one digit!).
Below are the results of using the "computational formula" on each
variable. (I didn't bother taking the square root; these are
variances.) For C1-C4 the variances (reported in K1-K4) are correct.
For C5 the reported variance is negative; this reflects MINITAB's
inability to express the original data precisely enough to distinguish
between 10000000003 and 10000000004 (e.g.), let alone between the
squares of those values. So the two very large numbers that are
subtracted one from the other in the computation are both somewhat
fictitious; I do not know why in these circumstances the second
incorrect value should exceed the first, but I found that routinely to
be the case whenever data exceeded the precision of the computer; and
for some reason the effect was more pronounced the greater the precision
demanded by the data.
MTB > let k1 = (SSQ(C1) - (SUM(C1)**2/7))/6
MTB > let k2 = (SSQ(C2) - (SUM(C2)**2/7))/6
MTB > let k3 = (SSQ(C3) - (SUM(C3)**2/7))/6
MTB > let k4 = (SSQ(C4) - (SUM(C4)**2/7))/6
MTB > let k5 = (SSQ(C5) - (SUM(C5)**2/7))/6
MTB > print k1-k5
K1 4.66667
K2 4.66667
K3 4.66667
K4 4.66667
K5 -21845.3
.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
. http://jse.stat.ncsu.edu/ .
=================================================================