I agree that it seems non-intuitive (I can't think of a design reason for it to look this way), but I'd like to stress that it's *not* an information leak; the predictions of the model are independent of the parameterization, which is all this issue affects. In a worst case there might be some unfortunate effects on numerical stability if the data-dependent bases are computed on a very different set of data than the model fitting actually uses.

I've attached a suggested documentation patch (I hope it makes it through to the list, if not I can add it to the body of a message.)



On 12/26/21 8:35 PM, Balise, Raymond R wrote:
Hello R folks,
Today I noticed that using the subset argument in lm() with a polynomial gives 
a different result than using the polynomial when the data has already been 
subsetted. This was not at all intuitive for me.    You can see an example 
here: 
https://stackoverflow.com/questions/70490599/why-does-lm-with-the-subset-argument-give-a-different-answer-than-subsetting-i

                 If this is a design feature that you don’t think should be 
fixed, can you please include it in the documentation and explain why it makes 
sense to figure out the orthogonal polynomials on the entire dataset?  This 
feels like a serous leak of information when evaluating train and test datasets 
in a statistical learning framework.

Ray

Raymond R. Balise, PhD
Assistant  Professor
Department of Public Health Sciences, Biostatistics

University of Miami, Miller School of Medicine
1120 N.W. 14th Street
Don Soffer Clinical Research Center - Room 1061
Miami, Florida 33136



        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


--
Dr. Benjamin Bolker
Professor, Mathematics & Statistics and Biology, McMaster University
Director, School of Computational Science and Engineering
Graduate chair, Mathematics & Statistics
Index: lm.Rd
===================================================================
--- lm.Rd       (revision 81416)
+++ lm.Rd       (working copy)
@@ -33,7 +33,9 @@
     typically the environment from which \code{lm} is called.}
 
   \item{subset}{an optional vector specifying a subset of observations
-    to be used in the fitting process.}
+    to be used in the fitting process. (See additional details about how
+    this argument interacts with data-dependent bases in the
+    \sQuote{Details} section of the \code{\link{model.frame}} documentation.)
 
   \item{weights}{an optional vector of weights to be used in the fitting
     process.  Should be \code{NULL} or a numeric vector.
Index: model.frame.Rd
===================================================================
--- model.frame.Rd      (revision 81416)
+++ model.frame.Rd      (working copy)
@@ -38,7 +38,9 @@
   \item{subset}{a specification of the rows to be used: defaults to all
     rows. This can be any valid indexing vector (see
     \code{\link{[.data.frame}}) for the rows of \code{data} or if that is not
-    supplied, a data frame made up of the variables used in \code{formula}.}
+    supplied, a data frame made up of the variables used in
+    \code{formula}. (See additional details about how this argument
+    interacts with data-dependent bases under \sQuote{Details} below.)
 
   \item{na.action}{how \code{NA}s are treated.  The default is first,
     any \code{na.action} attribute of \code{data}, second
@@ -103,6 +105,12 @@
   character variable is found, it is converted to a factor (as from \R
   2.10.0).
 
+  Because variables in the formula are evaluated before rows are dropped
+  based on \code{subset}, the characteristics of data-dependent bases
+  such as orthogonal polynomials (i.e. from terms using
+  \code{\link{poly}}) or splines will be computed based on the full data
+  set rather than the subsetted data set.
+
   Unless \code{na.action = NULL}, time-series attributes will be removed
   from the variables found (since they will be wrong if \code{NA}s are
   removed).
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to