Re: R - how could lm() retrieve data from an object with leaving one row out?

David Winsemius Mon, 09 Feb 2004 06:56:09 -0800

Radford Neal wrote in news:[EMAIL PROTECTED]:

>>Cornilia wrote in
>>news:[EMAIL PROTECTED]: 
>>
>>> I have a training data set, and I want to obtain the LOOCV error
>>> rate for a linear regression model. How can I implement this in R or
>>> S-Plus? I can use for loop and fit linear models n times, with one
>>> row out each time. My main problem is that I don't know how to leave
>>> one row out of my data set in lm function within the for loop.
>>> 
>>> It might look like: 
>>> for (i in 1:n) {
>>>      fitcv<-lm(y ~ V1+V2+V3+V4+V5+V6+V7+V8+V9,data=train,
> 
> Just using data=train[-i,] ought to work.  I don't know how efficient
> it is.
> 
> In article <[EMAIL PROTECTED]>,
> David Winsemius  <[EMAIL PROTECTED]> wrote:
> 
>>Not sure what your acronym means, but it sounds as though you are
>>doing a jack-knife analysis. Why not do a real bootstrap analysis? If
>>you are already using R, it should not be difficult to find the boot
>>package. I think it is in the default 1.8.1 distribution. You would
>>bring it into the workspace with library("boot")
> 
> I've encountered suggestions to use bootstrap in circumstances such as
> this before, but I've never understood them.  The bootstrap samples
> will clearly violate the assumption of independent residuals that
> underlies the usual regression model.


Why should the validation method leave out exactly one instance during each 
validation run? My reading indicates that LOOCV performs relatively poorly 
in comparisons with K-fold CV or bootstrap methods. Although the LOOCV 
estimates of prediction error are unbiased, they are plagued by higher 
variance than competing methods. Users of that method will be giving up 
efficiency.

The method proposed by the original questioner still looks like a 
jackknife, rather than what I now understand LOOCV to mean after searching 
the web. The model is fixed and there is no way for model misspecification 
to be identified. 

At any rate LOOCV, as well as k-fold CV, can be implemented in the boot 
package for R that I offered the questioner:
http://www.math.mcgill.ca/sysdocs/R/library/boot/html/cv.glm.html
Harrell's Design library also has a validate.lrm function, whose default is 
bootstrap, but can also be set for cross-validation.

> The bootstrap samples will also
> have less diverse values for the predictor variables.  So it seems to
> me that the bootstrap results will NOT be a good guide to what is
> going on with the actual sample.

As I understand statistics, the goal is making some plausible statements 
about what is likely in the world *outside* the sample. My understanding is 
that bootstrap methods use the joint distribution of measured features of 
the sample to create a plausible larger (neo-sampled) world. It seems to be 
a realization of the concept of exchangeability.
 
> The poster's use of leave-one-out cross validation seems more sensible
> to me. 

Each person must determine what "makes sense". I have been relying on 
results of the simulation tests in Efron and Gong and in Efron and 
Tibshirani. You can certainly make your choice on the basis of theory. 
Given your far greater authority in this arena, I may  learn quite a bit 
from your response.

-- 
David Winsemius
.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

Re: R - how could lm() retrieve data from an object with leaving one row out?

Reply via email to