Re: [R] Interesting behavior of lm() with small, problematic data sets

David Winsemius Tue, 05 Sep 2017 09:29:30 -0700

> On Sep 5, 2017, at 6:24 AM, Glover, Tim <tim.glo...@amecfw.com> wrote:
> 
> I've recently come across the following results reported from the lm() 
> function when applied to a particular type of admittedly difficult data.  
> When working with
> small data sets (for instance 3 points) with the same response for different 
> predicting variable, the resulting slope estimate is a reasonable 
> approximation of the expected 0.0, but the p-value of that slope estimate is 
> a surprising value.  A reproducible example is included below, along with the 
> output of the summary of results
> 
> ######### example code
> x <- c(1,2,3)
> y <- c(1,1,1)
> 
> #above results in{ (1,1) (2,1) (3,1)} data set to regress
> 
> new.rez <- lm (y ~ x) # regress constant y on changing x)
> summary(new.rez) # display results of regression
> 
> ######## end of example code
> 
> Results:
> 
> Call:
> lm(formula = y ~ x)
> 
> Residuals:
>         1          2          3
> 5.906e-17 -1.181e-16  5.906e-17
> 
> Coefficients:
>              Estimate Std. Error    t value Pr(>|t|)
> (Intercept)  1.000e+00  2.210e-16  4.525e+15   <2e-16 ***
> x           -1.772e-16  1.023e-16 -1.732e+00    0.333
> ---
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> 
> Residual standard error: 1.447e-16 on 1 degrees of freedom
> Multiple R-squared:  0.7794,    Adjusted R-squared:  0.5589
> F-statistic: 3.534 on 1 and 1 DF,  p-value: 0.3112
> 
> Warning message:
> In summary.lm(new.rez) : essentially perfect fit: summary may be unreliable
> 
> 
> ##############
> 
> There is a warning that the summary may be unreliable sue to the essentially 
> perfect fit, but a p-value of 0.3112 doesn’t seem reasonable.
> As a side note, the various r^2 values seem odd too.


You have an overfitted model with only 3 perfectly fit-able data points and you 
are whinging about a Wald statistic about which you were warned. I think you 
are wasting our time. (But I'm fully retired and I have a lot of time to waste.)

I seem to remember that a t-distribution with 1 degree of freedom is actually 
the Cauchy distribution. I would point out that you can also get:

> 2*pt(-1.732e+00, 1)
[1] 0.3333414

So maybe from that perspective any value might be "reasonable" from the 
perspective that you have that particular number data points (so one degree of 
freedom) and are using an estimate of the t-statistic which is essentially the 
ratio of 0/0 from a numerical point of view.

-- 
David.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Interesting behavior of lm() with small, problematic data sets

Reply via email to