Re: [R] A Tip: lm, glm, and retained cases
In R-devel na.action(GLM) will work as the extractor. The problem with attr(GLM$model, "na.action") is that the 'model' component is optional, and with model.frame(ModelObject) that if the 'model' component has been omitted it will try to recreate the model frame from the currently visible objects of the name originally used. (Because that is error-prone, we switched to model=TRUE as the default.) In earlier versions of R, GLM$na.action is the copy you want. However, I think if you care about omitted rows, you should use na.action=na.exclude, for then most auxiliary functions will give you results for all the rows. On Tue, 26 Aug 2008, Marc Schwartz wrote: on 08/26/2008 07:31 PM (Ted Harding) wrote: On 26-Aug-08 23:49:37, hadley wickham wrote: On Tue, Aug 26, 2008 at 6:45 PM, Ted Harding <[EMAIL PROTECTED]> wrote: Hi Folks, This tip is probably lurking somewhere already, but I've just discovered it the hard way, so it is probably worth passing on for the benefit of those who might otherwise hack their way along the same path. Say (for example) you want to do a logistic regression of a binary response Y on variables X1, X2, X3, X4: GLM <- glm(Y ~ X1 + X2 + X3 + X4) Say there are 1000 cases in the data. Because of missing values (NAs) in the variables, the number of complete cases retained for the regression is, say, 600. glm() does this automatically. QUESTION: Which cases are they? You can of course find out "by hand" on the lines of ix <- which( (!is.na(Y))&(!is.na(X1))&...&(!is.na(X4)) ) but one feels that GLM already knows -- so how to get it to talk? ANSWER: (e.g.) ix <- as.integer(names(GLM$fit)) This is a partial match to 'fitted', and will only work if default row names were used. Alternatively, you can use: attr(GLM$model, "na.action") Hadley Thanks! I can see that it works -- though understanding how requires a deeper knowledge of "R internals". However, since you've approached it from that direction, simply GLM$model is a dataframe of the retained cases (with corresponding row-names), all variables at once, and that is possibly an even simpler approach! Or just use: model.frame(ModelObject) as the extractor function... :-) Another 'a priori' approach would be to use na.omit() or one of its brethren on the data frame before creating the model. Which function is used depends upon how 'na.action' is set. The returned value, or more specifically the 'na.action' attribute as appropriate, would yield information similar to Hadley's approach relative to which records were excluded. For example, using the simple data frame in ?na.omit: DF <- data.frame(x = c(1, 2, 3), y = c(0, 10, NA)) DF x y 1 1 0 2 2 10 3 3 NA DF.na <- na.omit(DF) DF.na x y 1 1 0 2 2 10 attr(DF.na, "na.action") 3 3 attr(,"class") [1] "omit" So you can see that record 3 was removed from the original data frame due to the NA for 'y'. HTH, Marc Schwartz __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] A Tip: lm, glm, and retained cases
Marc Schwartz wrote: > on 08/26/2008 07:31 PM (Ted Harding) wrote: > >> On 26-Aug-08 23:49:37, hadley wickham wrote: >> >>> On Tue, Aug 26, 2008 at 6:45 PM, Ted Harding >>> <[EMAIL PROTECTED]> wrote: >>> Hi Folks, This tip is probably lurking somewhere already, but I've just discovered it the hard way, so it is probably worth passing on for the benefit of those who might otherwise hack their way along the same path. Say (for example) you want to do a logistic regression of a binary response Y on variables X1, X2, X3, X4: GLM <- glm(Y ~ X1 + X2 + X3 + X4) Say there are 1000 cases in the data. Because of missing values (NAs) in the variables, the number of complete cases retained for the regression is, say, 600. glm() does this automatically. QUESTION: Which cases are they? You can of course find out "by hand" on the lines of ix <- which( (!is.na(Y))&(!is.na(X1))&...&(!is.na(X4)) ) but one feels that GLM already knows -- so how to get it to talk? ANSWER: (e.g.) ix <- as.integer(names(GLM$fit)) >>> Alternatively, you can use: >>> >>> attr(GLM$model, "na.action") >>> >>> Hadley >>> >> Thanks! I can see that it works -- though understanding how >> requires a deeper knowledge of "R internals". However, since >> you've approached it from that direction, simply >> >> GLM$model >> >> is a dataframe of the retained cases (with corresponding >> row-names), all variables at once, and that is possibly an >> even simpler approach! >> > > Or just use: > >model.frame(ModelObject) > > as the extractor function... :-) > > Another 'a priori' approach would be to use na.omit() or one of its > brethren on the data frame before creating the model. Which function is > used depends upon how 'na.action' is set. > > The returned value, or more specifically the 'na.action' attribute as > appropriate, would yield information similar to Hadley's approach > relative to which records were excluded. > > For example, using the simple data frame in ?na.omit: > > DF <- data.frame(x = c(1, 2, 3), y = c(0, 10, NA)) > > >> DF >> > x y > 1 1 0 > 2 2 10 > 3 3 NA > > DF.na <- na.omit(DF) > > >> DF.na >> > x y > 1 1 0 > 2 2 10 > > >> attr(DF.na, "na.action") >> > 3 > 3 > attr(,"class") > [1] "omit" > > > So you can see that record 3 was removed from the original data frame > due to the NA for 'y'. > Also notice the possibility of (g)lm(., na.action=na.exclude) as in library(ISwR); attach(thuesen) fit <- lm(short.velocity ~ blood.glucose, na.action=na.exclude) which(is.na(fitted(fit))) # 16 This is often recommendable anyway, e.g. in case you want to plot residuals against original predictors. -- O__ Peter Dalgaard Ă˜ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] A Tip: lm, glm, and retained cases
on 08/26/2008 07:31 PM (Ted Harding) wrote: > On 26-Aug-08 23:49:37, hadley wickham wrote: >> On Tue, Aug 26, 2008 at 6:45 PM, Ted Harding >> <[EMAIL PROTECTED]> wrote: >>> Hi Folks, >>> This tip is probably lurking somewhere already, but I've just >>> discovered it the hard way, so it is probably worth passing >>> on for the benefit of those who might otherwise hack their >>> way along the same path. >>> >>> Say (for example) you want to do a logistic regression of a >>> binary response Y on variables X1, X2, X3, X4: >>> >>> GLM <- glm(Y ~ X1 + X2 + X3 + X4) >>> >>> Say there are 1000 cases in the data. Because of missing values >>> (NAs) in the variables, the number of complete cases retained >>> for the regression is, say, 600. glm() does this automatically. >>> >>> QUESTION: Which cases are they? >>> >>> You can of course find out "by hand" on the lines of >>> >>> ix <- which( (!is.na(Y))&(!is.na(X1))&...&(!is.na(X4)) ) >>> >>> but one feels that GLM already knows -- so how to get it to talk? >>> >>> ANSWER: (e.g.) >>> >>> ix <- as.integer(names(GLM$fit)) >> Alternatively, you can use: >> >> attr(GLM$model, "na.action") >> >> Hadley > > Thanks! I can see that it works -- though understanding how > requires a deeper knowledge of "R internals". However, since > you've approached it from that direction, simply > > GLM$model > > is a dataframe of the retained cases (with corresponding > row-names), all variables at once, and that is possibly an > even simpler approach! Or just use: model.frame(ModelObject) as the extractor function... :-) Another 'a priori' approach would be to use na.omit() or one of its brethren on the data frame before creating the model. Which function is used depends upon how 'na.action' is set. The returned value, or more specifically the 'na.action' attribute as appropriate, would yield information similar to Hadley's approach relative to which records were excluded. For example, using the simple data frame in ?na.omit: DF <- data.frame(x = c(1, 2, 3), y = c(0, 10, NA)) > DF x y 1 1 0 2 2 10 3 3 NA DF.na <- na.omit(DF) > DF.na x y 1 1 0 2 2 10 > attr(DF.na, "na.action") 3 3 attr(,"class") [1] "omit" So you can see that record 3 was removed from the original data frame due to the NA for 'y'. HTH, Marc Schwartz __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] A Tip: lm, glm, and retained cases
On 26-Aug-08 23:49:37, hadley wickham wrote: > On Tue, Aug 26, 2008 at 6:45 PM, Ted Harding > <[EMAIL PROTECTED]> wrote: >> Hi Folks, >> This tip is probably lurking somewhere already, but I've just >> discovered it the hard way, so it is probably worth passing >> on for the benefit of those who might otherwise hack their >> way along the same path. >> >> Say (for example) you want to do a logistic regression of a >> binary response Y on variables X1, X2, X3, X4: >> >> GLM <- glm(Y ~ X1 + X2 + X3 + X4) >> >> Say there are 1000 cases in the data. Because of missing values >> (NAs) in the variables, the number of complete cases retained >> for the regression is, say, 600. glm() does this automatically. >> >> QUESTION: Which cases are they? >> >> You can of course find out "by hand" on the lines of >> >> ix <- which( (!is.na(Y))&(!is.na(X1))&...&(!is.na(X4)) ) >> >> but one feels that GLM already knows -- so how to get it to talk? >> >> ANSWER: (e.g.) >> >> ix <- as.integer(names(GLM$fit)) > > Alternatively, you can use: > > attr(GLM$model, "na.action") > > Hadley Thanks! I can see that it works -- though understanding how requires a deeper knowledge of "R internals". However, since you've approached it from that direction, simply GLM$model is a dataframe of the retained cases (with corresponding row-names), all variables at once, and that is possibly an even simpler approach! Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 27-Aug-08 Time: 01:31:46 -- XFMail -- __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] A Tip: lm, glm, and retained cases
On Tue, Aug 26, 2008 at 6:45 PM, Ted Harding <[EMAIL PROTECTED]> wrote: > Hi Folks, > This tip is probably lurking somewhere already, but I've just > discovered it the hard way, so it is probably worth passing > on for the benefit of those who might otherwise hack their > way along the same path. > > Say (for example) you want to do a logistic regression of a > binary response Y on variables X1, X2, X3, X4: > > GLM <- glm(Y ~ X1 + X2 + X3 + X4) > > Say there are 1000 cases in the data. Because of missing values > (NAs) in the variables, the number of complete cases retained > for the regression is, say, 600. glm() does this automatically. > > QUESTION: Which cases are they? > > You can of course find out "by hand" on the lines of > > ix <- which( (!is.na(Y))&(!is.na(X1))&...&(!is.na(X4)) ) > > but one feels that GLM already knows -- so how to get it to talk? > > ANSWER: (e.g.) > > ix <- as.integer(names(GLM$fit)) Alternatively, you can use: attr(GLM$model, "na.action") Hadley -- http://had.co.nz/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] A Tip: lm, glm, and retained cases
Hi Folks, This tip is probably lurking somewhere already, but I've just discovered it the hard way, so it is probably worth passing on for the benefit of those who might otherwise hack their way along the same path. Say (for example) you want to do a logistic regression of a binary response Y on variables X1, X2, X3, X4: GLM <- glm(Y ~ X1 + X2 + X3 + X4) Say there are 1000 cases in the data. Because of missing values (NAs) in the variables, the number of complete cases retained for the regression is, say, 600. glm() does this automatically. QUESTION: Which cases are they? You can of course find out "by hand" on the lines of ix <- which( (!is.na(Y))&(!is.na(X1))&...&(!is.na(X4)) ) but one feels that GLM already knows -- so how to get it to talk? ANSWER: (e.g.) ix <- as.integer(names(GLM$fit)) Reason: When glm(Y~X1+...) picks up the data passed to it, it assigns[*] to each element of Y a name which is its integer position in the variable, expressed as a character string ("1", "2", "3", ... ). [*] Assuming (as is usually the case) that the elements didn't have names in the first place. Otherwise these names are used; modify the above approach accordingly. These names are retained during the computation, and when incomplete cases are dropped the retained complete cases retain their original names. Thus, any per-case series of computed values (such as $fit) has the names of the retained cases the values correspond to. These can be discovered by names(GLM$fit) but you don't want them as character strings, so convert them to integers: as.integer(names(GLM$fit)) Done! I hope this helps some people. Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 27-Aug-08 Time: 00:45:47 -- XFMail -- __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.