[Rd] Why does lm() with the subset argument give a different answer than subsetting in advance?

Balise, Raymond R Mon, 27 Dec 2021 06:20:58 -0800

Hello R folks,
Today I noticed that using the subset argument in lm() with a polynomial gives 
a different result than using the polynomial when the data has already been 
subsetted. This was not at all intuitive for me.    You can see an example 
here: 
https://stackoverflow.com/questions/70490599/why-does-lm-with-the-subset-argument-give-a-different-answer-than-subsetting-i


                If this is a design feature that you don’t think should be 
fixed, can you please include it in the documentation and explain why it makes 
sense to figure out the orthogonal polynomials on the entire dataset?  This 
feels like a serous leak of information when evaluating train and test datasets 
in a statistical learning framework.

Ray

Raymond R. Balise, PhD
Assistant  Professor
Department of Public Health Sciences, Biostatistics

University of Miami, Miller School of Medicine
1120 N.W. 14th Street
Don Soffer Clinical Research Center - Room 1061
Miami, Florida 33136



        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Why does lm() with the subset argument give a different answer than subsetting in advance?

Reply via email to