Jean Bréfort wrote:
One other totally unrelated thing. We got recently a bug report about an
incorrect R squared in gnumeric regression code
(http://bugzilla.gnome.org/show_bug.cgi?id=534659). R (version 2.7.0)
give the same result as Gnumeric as can be seen below:
mydata - read.csv(file=data.csv,sep=,)
mydata
X Y
1 1 2
2 2 4
3 3 5
4 4 8
5 5 0
6 6 7
7 7 8
8 8 9
9 9 10
summary(lm(mydata$Y~mydata$X))
Call:
lm(formula = mydata$Y ~ mydata$X)
Residuals:
Min 1Q Median 3Q Max
-5.8889 0.2444 0.5111 0.7111 2.9778
Coefficients:
Estimate Std. Error t value Pr(|t|)
(Intercept) 1.5556 1.8587 0.837 0.4303
mydata$X 0.8667 0.3303 2.624 0.0342 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.559 on 7 degrees of freedom
Multiple R-squared: 0.4958, Adjusted R-squared: 0.4238
F-statistic: 6.885 on 1 and 7 DF, p-value: 0.03422
summary(lm(mydata$Y~mydata$X-1))
Call:
lm(formula = mydata$Y ~ mydata$X - 1)
Residuals:
Min 1Q Median 3Q Max
-5.5614 0.1018 0.3263 1.6632 3.5509
Coefficients:
Estimate Std. Error t value Pr(|t|)
mydata$X 1.1123 0.1487 7.481 7.06e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.51 on 8 degrees of freedom
Multiple R-squared: 0.8749, Adjusted R-squared: 0.8593
F-statistic: 55.96 on 1 and 8 DF, p-value: 7.056e-05
I am unable to figure out what this 0.8749 value might represent. If it
is intended to be the Pearson moment, it should be 0.4958, and if it is
the coefficient of determination, I think the correct value would be
0.4454 as given by Excel. It's of course nice to have the same result in
R and Gnumeric,but it would be better if this result was accurate (if it
is, we need some documentation fix). Btw, I am not a statistics expert
at all.
This horse has been flogged multiple times on the list.
It is of course mainly a matter of convention, but the convention used
by R has been around at least since Genstat in the mid-1970s. In the
no-intercept case, you get the _uncentered_ version of R-squared; that
is, the proportion of the sum of squares explained by the model (as
opposed to sum of squares of _deviations_ in the usual case.) The
rationale is that the R^2 should be based on a reduction in residual
variation between two nested models, and if theres no intercept, the
only well-determined nested model is the one where mydata$Y has mean
zero for all x corresponding to all-zero regression coefficients. The
resulting R^2 is directly related to the F statistic, which you'll see
is also larger and more significant when the intercept is removed.
BTW: lm(mydata$Y~mydata$X) is bad practice, use lm(Y~X, data=mydata).
Use of predict() will demonstrate why.
--
O__ Peter Dalgaard Øster Farimagsgade 5, Entr.B
c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
(*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel