[Rd] R and Gnumeric

2008-06-09 Thread Jean Bréfort
Hi,

I just read the Embedding R in Gnumeric idea at
http://www.r-project.org/SoC08/ideas.html. On my side, I intend to add
as many statistics related plot types to the current gnumeric charting
engine as possible. We already have boxplots and partial support for
histograms. My immediate plans are to finish the histogram code and add
probability plots (http://bugzilla.gnome.org/show_bug.cgi?id=500168)
during the summer if time permits (importing some code from R).
For the future, I see two options: either add all necessary plot types
to the gnumeric charting engine or embedding R charts directly using
either a new SheetObject class or the goffice component system (which
would allow inserting these charts in abiword as well).

One other totally unrelated thing. We got recently a bug report about an
incorrect R squared in gnumeric regression code
(http://bugzilla.gnome.org/show_bug.cgi?id=534659). R (version 2.7.0)
give the same result as Gnumeric as can be seen below:

 mydata - read.csv(file=data.csv,sep=,)
 mydata
  X  Y
1 1  2
2 2  4
3 3  5
4 4  8
5 5  0
6 6  7
7 7  8
8 8  9
9 9 10
 summary(lm(mydata$Y~mydata$X))

Call:
lm(formula = mydata$Y ~ mydata$X)

Residuals:
Min  1Q  Median  3Q Max 
-5.8889  0.2444  0.5111  0.7111  2.9778 

Coefficients:
Estimate Std. Error t value Pr(|t|)  
(Intercept)   1.5556 1.8587   0.837   0.4303  
mydata$X  0.8667 0.3303   2.624   0.0342 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 2.559 on 7 degrees of freedom
Multiple R-squared: 0.4958, Adjusted R-squared: 0.4238 
F-statistic: 6.885 on 1 and 7 DF,  p-value: 0.03422 

 summary(lm(mydata$Y~mydata$X-1))

Call:
lm(formula = mydata$Y ~ mydata$X - 1)

Residuals:
Min  1Q  Median  3Q Max 
-5.5614  0.1018  0.3263  1.6632  3.5509 

Coefficients:
 Estimate Std. Error t value Pr(|t|)
mydata$X   1.1123 0.1487   7.481 7.06e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 2.51 on 8 degrees of freedom
Multiple R-squared: 0.8749, Adjusted R-squared: 0.8593 
F-statistic: 55.96 on 1 and 8 DF,  p-value: 7.056e-05 

I am unable to figure out what this 0.8749 value might represent. If it
is intended to be the Pearson moment, it should be 0.4958, and if it is
the coefficient of determination, I think the correct value would be
0.4454 as given by Excel. It's of course nice to have the same result in
R and Gnumeric,but it would be better if this result was accurate (if it
is, we need some documentation fix). Btw, I am not a statistics expert
at all.

Best regards,
Jean Brefort

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] R and Gnumeric

2008-06-09 Thread Peter Dalgaard
Jean Bréfort wrote:
 One other totally unrelated thing. We got recently a bug report about an
 incorrect R squared in gnumeric regression code
 (http://bugzilla.gnome.org/show_bug.cgi?id=534659). R (version 2.7.0)
 give the same result as Gnumeric as can be seen below:

   
 mydata - read.csv(file=data.csv,sep=,)
 mydata
 
   X  Y
 1 1  2
 2 2  4
 3 3  5
 4 4  8
 5 5  0
 6 6  7
 7 7  8
 8 8  9
 9 9 10
   
 summary(lm(mydata$Y~mydata$X))
 

 Call:
 lm(formula = mydata$Y ~ mydata$X)

 Residuals:
 Min  1Q  Median  3Q Max 
 -5.8889  0.2444  0.5111  0.7111  2.9778 

 Coefficients:
 Estimate Std. Error t value Pr(|t|)  
 (Intercept)   1.5556 1.8587   0.837   0.4303  
 mydata$X  0.8667 0.3303   2.624   0.0342 *
 ---
 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

 Residual standard error: 2.559 on 7 degrees of freedom
 Multiple R-squared: 0.4958,   Adjusted R-squared: 0.4238 
 F-statistic: 6.885 on 1 and 7 DF,  p-value: 0.03422 

   
 summary(lm(mydata$Y~mydata$X-1))
 

 Call:
 lm(formula = mydata$Y ~ mydata$X - 1)

 Residuals:
 Min  1Q  Median  3Q Max 
 -5.5614  0.1018  0.3263  1.6632  3.5509 

 Coefficients:
  Estimate Std. Error t value Pr(|t|)
 mydata$X   1.1123 0.1487   7.481 7.06e-05 ***
 ---
 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

 Residual standard error: 2.51 on 8 degrees of freedom
 Multiple R-squared: 0.8749,   Adjusted R-squared: 0.8593 
 F-statistic: 55.96 on 1 and 8 DF,  p-value: 7.056e-05 

 I am unable to figure out what this 0.8749 value might represent. If it
 is intended to be the Pearson moment, it should be 0.4958, and if it is
 the coefficient of determination, I think the correct value would be
 0.4454 as given by Excel. It's of course nice to have the same result in
 R and Gnumeric,but it would be better if this result was accurate (if it
 is, we need some documentation fix). Btw, I am not a statistics expert
 at all.
   
This horse has been flogged multiple times on the list.

It is of course mainly a matter of convention, but the convention used
by R has been around at least since Genstat in the mid-1970s. In the
no-intercept case, you get the _uncentered_ version of R-squared; that
is, the proportion of the sum of squares explained by the model (as
opposed to sum of squares of _deviations_ in the usual case.) The
rationale is that the R^2 should be based on a reduction in residual
variation between two nested models, and if theres no intercept, the
only well-determined nested model is the one where mydata$Y has mean
zero for all x corresponding to all-zero regression coefficients. The
resulting R^2 is directly related to the F statistic, which you'll see
is also larger and more significant when the intercept is removed.

BTW:  lm(mydata$Y~mydata$X) is bad practice, use lm(Y~X, data=mydata).
Use of predict() will demonstrate why.

-- 
   O__   Peter Dalgaard Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark  Ph:  (+45) 35327918
~~ - ([EMAIL PROTECTED])  FAX: (+45) 35327907

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel