On Tue, Jul 26, 2016 at 3:24 AM, Martin Maechler <maech...@stat.math.ethz.ch> wrote: > I have been asked (in private)
Martin was very polite to not share my name, but it was me :) > > Hi Martin, > > y <- c(1, 2, 3, NA, 4) > x <- c(1, 2, 2, 1, 1) > > t.test(y ~ x) > lm(y ~ x) > > > Normally, most R functions follow the principle that > > "missings should never silently go missing". Do you have > > any background on why these functions (and most model/test > > function in stats, I presume) silently drop missing values? > > And I think, the issue and an answer are important enough to be > public, hence this posting to R-help : > > First note that in some sense it is not true that lm(y ~ x) silently drops > NAs: Everybody who is taught about lm() is taught to look at > summary( fm ) where fm <- lm(y ~ x) Good point - unfortunately the message was a bit subtle for me, and don't remember being taught to look for it :( > and that (for the above case) "says" > > ' (1 observation deleted due to missingness) ' > > and so is not entirely silent. > > > This goes all back to the idea of having an 'na.action' argument > which may be a function (a very good idea, "functional > programming" in a double sense!)... which Chambers et al > introduced in The White Book (*1) and which I think to remember > was quite a revolutionary idea; at least I had liked that very much > once I understood the beauty of passing functions as arguments > to other functions. > One problem already back then has been that we already had the > ---much more primitive but often sufficient--- standard of an > 'na.rm = FALSE' (i.e. default FALSE) argument. > > This has been tought in all good classes/course about statistical > modeling with S and R ever since ... I had hoped .... > (but it seems I was too optimistic, .. or many students have too > quickly forgotten what they were taught ..) > Notably the white book itself, and the MASS (*2) book do teach > this.. though possibly not loudly enough. > > Two more decisions about this were made back then, as well: > > 1) The default for na.action to be na.omit (==> "silently dropping") > > 2) na.action being governed by options(na.action = ..) > > '1)' may have been mostly "historical": I think it had been the behavior of > other "main stream" statistical packages back then (and now?) *and* > possibly more importantly, notably with the later (than white book = "S3") > advent of na.replace, you did want to keep the missing in your > data frame, for later analysis; e.g. drawing (-> "gaps" in > plots) so the NA *were* carried along and would not be > forgotten, something very important in most case.s. > > and '2)' is something I personally no longer like very > much, as it is "killing" the functional paradigm. > OTOH, it has to be said in favor of that "session wide" / "global" setting > options(na.action = *) > that indeed it depends on the specific data analysis, or even > the specific *phase* of a data analysis, *what* behavior of NA > treatment is desired.... and at the time it was thought smart > that all methods (also functions "deep down" called by > user-called functtions) would automatically use the "currently > desired" NA handling. > > There have been recommendations (I don't know exactly where and > by whom) to always set > > options(na.action = na.fail) > > in your global .First() or nowadays rather your Rprofile, and > I assume that some of the CRAN packages and some of the "teaching > setups" would do that (and if you do that, the lm() and t.test() > calls above give an error). I think that's a bit too strict for me, so I wrote my own: na.warn <- function(object, ...) { missing <- complete.cases(object) if (any(missing)) { warning("Dropping ", sum(missing), " rows with missing values", call. = FALSE) } na.exclude(object, ...) } That gives me the behaviour I want: options(na.action = na.warn) y <- c(1, 2, 3, NA, 4) x <- c(1, 2, 2, 1, 1) mod <- lm(y ~ x) #> Warning: Dropping 4 rows with missing values predict(mod) #> 1 2 3 4 5 #> 2.5 2.5 2.5 NA 2.5 resid(mod) #> 1 2 3 4 5 #> -1.5 -0.5 0.5 NA 1.5 To me, this would be the most sensible default behaviour, but I realise it's too late to change without breaking many existing expectations. On a related note, I've never really understood why it's called na.exclude - from my perspective it causes the _inclusion_ of missing values in the predictions/residuals. Thanks for the (as always!) informative response, Martin. Hadley -- http://hadley.nz ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.