Re: [R-pkg-devel] Determine subset from glm object
On second thoughts it may be better to preserve the original data and na.action in the call to glm. So then you might combine the idea of a dummy model frame with evaluating the subset, e.g. mfcall <- call("model.frame", reformulate(all.vars(f)), data = data) mf <- eval(mfcall, parent.frame()) mf$id <- seq_len(nrow(mf)) subset <- mf$id %in% model.frame( ~ id, data = mf, subset = subset)$id This will give the subset as a logical vector whether it was originally supplied as logical, numeric or character. Then you might combine this with the logical vector based on the first glm as follows: subset[subset] <- linearity On Mon, Jul 9, 2018, at 10:14 PM, Heather Turner wrote: > Good point. In that case a solution might be to create a model frame > based on the named variables, e.g. > > # general formula > f <- ~ log(x) + ns(v, df = 2) > # model frame based on "bare" variables; deal with user-supplied subset, > data, na.action, etc > mfcall <- call("model.frame", reformulate(all.vars(f)), subset = subset, > data = data, na.action = na.action) > mf <- eval(mfcall, parent.frame()) > > Then `mf` can be passed as the data argument to `glm` without any subset > argument for the first model and with the new subset argument for the > second model. > > > On Mon, Jul 9, 2018, at 5:06 PM, Ben Bolker wrote: > > > > From painful experience: model.frame() does *NOT* necessarily return a > > data frame that can be successfully used as the data= argument for models. > > > > - transformed variables (e.g. log(x)) will be in the model frame > > rather than the original variables, so when model.frame() is called > > again within glm(), it won't find the original variables > > - variables with data-dependent bases (poly(), ns(), etc.) get > > computed and stuck in the model frame - again, the original variables > > are inaccessible > > > > > > On 2018-07-09 11:20 AM, Heather Turner wrote: > > > > > > > > > On Sun, Jul 8, 2018, at 8:25 PM, Charles Geyer wrote: > > >> I spoke too soon. The problem isn't that I don't know how to get the > > >> subset argument. I am just calling glm (via eval) with (mostly) the > > >> same arguments as the call to my function, so subset is (if not > > >> missing) an argument to my function too. So I can just use it. > > >> > > >> The problem is that I then want to call glm again fitting a subset of > > >> the original subset (if there was one). And when I do that glm will > > >> refer to the original data wherever it is, and I don't have that. > > >> > > >> if this isn't clear, here is the code as it stands now > > >> https://github.com/cjgeyer/glmdr/blob/master/package/glmdr/R/glmdr.R. > > >> > > >> The issue is with the lines (very near the end) > > >> > > >> subset.lcm <- as.integer(rownames(modmat)) > > >> subset.lcm <- subset.lcm[linearity] > > >> # call glm again > > >> call.glm$subset <- subset.lcm > > >> gout.lcm <- eval(call.glm, parent.frame()) > > >> > > >> I can see from what Duncan said that I really don't want the > > >> as.integer around rownames. But it is not clear what would be better. > > >> > > >> I just had another thought that I could get the original data with > > >> another call to glm with subset removed from the call and method = > > >> "model.frame" added. And I think (maybe, have to try it) that it > > >> would have NA's removed or whatever na.action says to do. > > >> But that seems redundant. > > >> > > >> > > > As you are calling stats::glm, you can use `model.frame` to get the data > > > used to fit the model after applying subset and na.action. So then you > > > can do: > > > > > > call.glm$subset <- linearity > > > call.glm$data <- model.frame(gout) > > > > > > I think this is what you are after? > > > > > > Heather > > > > > >> > > >> On Sun, Jul 8, 2018, 1:04 PM Charles Geyer wrote: > > >>> > > >>> I think your second option sounds better because this is all happening > > >>> inside one function I'm writing so users won't be able mess with the > > >>> glm object. Many thanks. > > >>> > > >>> On Sun, Jul 8, 2018, 12:10 PM Duncan Murdoch > > >>> wrote: > > > > On 08/07/2018 11:48 AM, Charles Geyer wrote: > > > I need to find out from an object returned by R function glm with > > > argument > > > x = TRUE > > > what the subsetting was. It appears that if gout is that object, then > > > > > > as.integer(rownames(gout$x)) > > > > > > is a subset vector equivalent to the one actually used. > > > > You don't want the "as.integer". If the dataframe had rownames to > > start > > with, the x component of the fit will have row labels consisting of > > those labels, so as.integer may fail. Even if it doesn't, the rownames > > aren't necessarily sequential integers. You can index the dataframe > > by > > the character versions of the default numbers, so simply > > rownames(gout$x) should always work. > > > > More generally, I'm
Re: [R-pkg-devel] Determine subset from glm object
Good point. In that case a solution might be to create a model frame based on the named variables, e.g. # general formula f <- ~ log(x) + ns(v, df = 2) # model frame based on "bare" variables; deal with user-supplied subset, data, na.action, etc mfcall <- call("model.frame", reformulate(all.vars(f)), subset = subset, data = data, na.action = na.action) mf <- eval(mfcall, parent.frame()) Then `mf` can be passed as the data argument to `glm` without any subset argument for the first model and with the new subset argument for the second model. On Mon, Jul 9, 2018, at 5:06 PM, Ben Bolker wrote: > > From painful experience: model.frame() does *NOT* necessarily return a > data frame that can be successfully used as the data= argument for models. > > - transformed variables (e.g. log(x)) will be in the model frame > rather than the original variables, so when model.frame() is called > again within glm(), it won't find the original variables > - variables with data-dependent bases (poly(), ns(), etc.) get > computed and stuck in the model frame - again, the original variables > are inaccessible > > > On 2018-07-09 11:20 AM, Heather Turner wrote: > > > > > > On Sun, Jul 8, 2018, at 8:25 PM, Charles Geyer wrote: > >> I spoke too soon. The problem isn't that I don't know how to get the > >> subset argument. I am just calling glm (via eval) with (mostly) the > >> same arguments as the call to my function, so subset is (if not > >> missing) an argument to my function too. So I can just use it. > >> > >> The problem is that I then want to call glm again fitting a subset of > >> the original subset (if there was one). And when I do that glm will > >> refer to the original data wherever it is, and I don't have that. > >> > >> if this isn't clear, here is the code as it stands now > >> https://github.com/cjgeyer/glmdr/blob/master/package/glmdr/R/glmdr.R. > >> > >> The issue is with the lines (very near the end) > >> > >> subset.lcm <- as.integer(rownames(modmat)) > >> subset.lcm <- subset.lcm[linearity] > >> # call glm again > >> call.glm$subset <- subset.lcm > >> gout.lcm <- eval(call.glm, parent.frame()) > >> > >> I can see from what Duncan said that I really don't want the > >> as.integer around rownames. But it is not clear what would be better. > >> > >> I just had another thought that I could get the original data with > >> another call to glm with subset removed from the call and method = > >> "model.frame" added. And I think (maybe, have to try it) that it > >> would have NA's removed or whatever na.action says to do. > >> But that seems redundant. > >> > >> > > As you are calling stats::glm, you can use `model.frame` to get the data > > used to fit the model after applying subset and na.action. So then you can > > do: > > > > call.glm$subset <- linearity > > call.glm$data <- model.frame(gout) > > > > I think this is what you are after? > > > > Heather > > > >> > >> On Sun, Jul 8, 2018, 1:04 PM Charles Geyer wrote: > >>> > >>> I think your second option sounds better because this is all happening > >>> inside one function I'm writing so users won't be able mess with the glm > >>> object. Many thanks. > >>> > >>> On Sun, Jul 8, 2018, 12:10 PM Duncan Murdoch > >>> wrote: > > On 08/07/2018 11:48 AM, Charles Geyer wrote: > > I need to find out from an object returned by R function glm with > > argument > > x = TRUE > > what the subsetting was. It appears that if gout is that object, then > > > > as.integer(rownames(gout$x)) > > > > is a subset vector equivalent to the one actually used. > > You don't want the "as.integer". If the dataframe had rownames to start > with, the x component of the fit will have row labels consisting of > those labels, so as.integer may fail. Even if it doesn't, the rownames > aren't necessarily sequential integers. You can index the dataframe by > the character versions of the default numbers, so simply > rownames(gout$x) should always work. > > More generally, I'm not sure your question is well posed. What do you > mean by "the subsetting"? If you have something like > > df <- data.frame(letters, x = 1:26, y = rbinom(26, 1, 0.5)) > > df1 <- subset(df, letters > "b" & letters < "y") > > gout <- glm(y ~ x, data = df1, subset = letters < "q", x = TRUE) > > the rownames(gout$x) are going to be numbers for rows of df, because df1 > will get a subset of those as row labels. > > > > I do also have the call to glm (as a call object) so can determine the > > actual subset argument, but this seems to be not so useful because I > > don't > > know the length of the original variables before subsetting. > > You should be able to evaluate the subset expression in the environment > of the formula, i.e. > > eval(gout$call$subset, envir = environment(gout$formula)) > >
Re: [R-pkg-devel] Determine subset from glm object
From painful experience: model.frame() does *NOT* necessarily return a data frame that can be successfully used as the data= argument for models. - transformed variables (e.g. log(x)) will be in the model frame rather than the original variables, so when model.frame() is called again within glm(), it won't find the original variables - variables with data-dependent bases (poly(), ns(), etc.) get computed and stuck in the model frame - again, the original variables are inaccessible On 2018-07-09 11:20 AM, Heather Turner wrote: > > > On Sun, Jul 8, 2018, at 8:25 PM, Charles Geyer wrote: >> I spoke too soon. The problem isn't that I don't know how to get the >> subset argument. I am just calling glm (via eval) with (mostly) the >> same arguments as the call to my function, so subset is (if not >> missing) an argument to my function too. So I can just use it. >> >> The problem is that I then want to call glm again fitting a subset of >> the original subset (if there was one). And when I do that glm will >> refer to the original data wherever it is, and I don't have that. >> >> if this isn't clear, here is the code as it stands now >> https://github.com/cjgeyer/glmdr/blob/master/package/glmdr/R/glmdr.R. >> >> The issue is with the lines (very near the end) >> >> subset.lcm <- as.integer(rownames(modmat)) >> subset.lcm <- subset.lcm[linearity] >> # call glm again >> call.glm$subset <- subset.lcm >> gout.lcm <- eval(call.glm, parent.frame()) >> >> I can see from what Duncan said that I really don't want the >> as.integer around rownames. But it is not clear what would be better. >> >> I just had another thought that I could get the original data with >> another call to glm with subset removed from the call and method = >> "model.frame" added. And I think (maybe, have to try it) that it >> would have NA's removed or whatever na.action says to do. >> But that seems redundant. >> >> > As you are calling stats::glm, you can use `model.frame` to get the data used > to fit the model after applying subset and na.action. So then you can do: > > call.glm$subset <- linearity > call.glm$data <- model.frame(gout) > > I think this is what you are after? > > Heather > >> >> On Sun, Jul 8, 2018, 1:04 PM Charles Geyer wrote: >>> >>> I think your second option sounds better because this is all happening >>> inside one function I'm writing so users won't be able mess with the glm >>> object. Many thanks. >>> >>> On Sun, Jul 8, 2018, 12:10 PM Duncan Murdoch >>> wrote: On 08/07/2018 11:48 AM, Charles Geyer wrote: > I need to find out from an object returned by R function glm with argument > x = TRUE > what the subsetting was. It appears that if gout is that object, then > > as.integer(rownames(gout$x)) > > is a subset vector equivalent to the one actually used. You don't want the "as.integer". If the dataframe had rownames to start with, the x component of the fit will have row labels consisting of those labels, so as.integer may fail. Even if it doesn't, the rownames aren't necessarily sequential integers. You can index the dataframe by the character versions of the default numbers, so simply rownames(gout$x) should always work. More generally, I'm not sure your question is well posed. What do you mean by "the subsetting"? If you have something like df <- data.frame(letters, x = 1:26, y = rbinom(26, 1, 0.5)) df1 <- subset(df, letters > "b" & letters < "y") gout <- glm(y ~ x, data = df1, subset = letters < "q", x = TRUE) the rownames(gout$x) are going to be numbers for rows of df, because df1 will get a subset of those as row labels. > I do also have the call to glm (as a call object) so can determine the > actual subset argument, but this seems to be not so useful because I don't > know the length of the original variables before subsetting. You should be able to evaluate the subset expression in the environment of the formula, i.e. eval(gout$call$subset, envir = environment(gout$formula)) This may give incorrect results if the variables used in subsetting aren't in the dataframe and have changed since glm() was called. > So now my questions. Is this idea above (using rownames) OK even though I > cannot find where (if anywhere) it is documented? Is there a better way? > One more guaranteed to be correct in the future? > I would trust evaluating the subset more than grabbing row labels from gout$x, but I don't know for sure it is likely to be more robust. Duncan Murdoch >> >> __ >> R-package-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-package-devel > > __ > R-package-devel@r-project.org mailing list >
Re: [R-pkg-devel] Determine subset from glm object
On Sun, Jul 8, 2018, at 8:25 PM, Charles Geyer wrote: > I spoke too soon. The problem isn't that I don't know how to get the > subset argument. I am just calling glm (via eval) with (mostly) the > same arguments as the call to my function, so subset is (if not > missing) an argument to my function too. So I can just use it. > > The problem is that I then want to call glm again fitting a subset of > the original subset (if there was one). And when I do that glm will > refer to the original data wherever it is, and I don't have that. > > if this isn't clear, here is the code as it stands now > https://github.com/cjgeyer/glmdr/blob/master/package/glmdr/R/glmdr.R. > > The issue is with the lines (very near the end) > > subset.lcm <- as.integer(rownames(modmat)) > subset.lcm <- subset.lcm[linearity] > # call glm again > call.glm$subset <- subset.lcm > gout.lcm <- eval(call.glm, parent.frame()) > > I can see from what Duncan said that I really don't want the > as.integer around rownames. But it is not clear what would be better. > > I just had another thought that I could get the original data with > another call to glm with subset removed from the call and method = > "model.frame" added. And I think (maybe, have to try it) that it > would have NA's removed or whatever na.action says to do. > But that seems redundant. > > As you are calling stats::glm, you can use `model.frame` to get the data used to fit the model after applying subset and na.action. So then you can do: call.glm$subset <- linearity call.glm$data <- model.frame(gout) I think this is what you are after? Heather > > On Sun, Jul 8, 2018, 1:04 PM Charles Geyer wrote: > > > > I think your second option sounds better because this is all happening > > inside one function I'm writing so users won't be able mess with the glm > > object. Many thanks. > > > > On Sun, Jul 8, 2018, 12:10 PM Duncan Murdoch > > wrote: > >> > >> On 08/07/2018 11:48 AM, Charles Geyer wrote: > >> > I need to find out from an object returned by R function glm with > >> > argument > >> > x = TRUE > >> > what the subsetting was. It appears that if gout is that object, then > >> > > >> > as.integer(rownames(gout$x)) > >> > > >> > is a subset vector equivalent to the one actually used. > >> > >> You don't want the "as.integer". If the dataframe had rownames to start > >> with, the x component of the fit will have row labels consisting of > >> those labels, so as.integer may fail. Even if it doesn't, the rownames > >> aren't necessarily sequential integers. You can index the dataframe by > >> the character versions of the default numbers, so simply > >> rownames(gout$x) should always work. > >> > >> More generally, I'm not sure your question is well posed. What do you > >> mean by "the subsetting"? If you have something like > >> > >> df <- data.frame(letters, x = 1:26, y = rbinom(26, 1, 0.5)) > >> > >> df1 <- subset(df, letters > "b" & letters < "y") > >> > >> gout <- glm(y ~ x, data = df1, subset = letters < "q", x = TRUE) > >> > >> the rownames(gout$x) are going to be numbers for rows of df, because df1 > >> will get a subset of those as row labels. > >> > >> > >> > I do also have the call to glm (as a call object) so can determine the > >> > actual subset argument, but this seems to be not so useful because I > >> > don't > >> > know the length of the original variables before subsetting. > >> > >> You should be able to evaluate the subset expression in the environment > >> of the formula, i.e. > >> > >> eval(gout$call$subset, envir = environment(gout$formula)) > >> > >> This may give incorrect results if the variables used in subsetting > >> aren't in the dataframe and have changed since glm() was called. > >> > >> > >> > So now my questions. Is this idea above (using rownames) OK even though > >> > I > >> > cannot find where (if anywhere) it is documented? Is there a better way? > >> > One more guaranteed to be correct in the future? > >> > > >> > >> I would trust evaluating the subset more than grabbing row labels from > >> gout$x, but I don't know for sure it is likely to be more robust. > >> > >> Duncan Murdoch > > __ > R-package-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-package-devel __ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel
Re: [R-pkg-devel] Determine subset from glm object
If there might be NA's in the response or predictors so na.exclude or na.omit would remove some rows as well, then using the row.names might be an easier way to match up rows in the original data with rows in gout$x. Bill Dunlap TIBCO Software wdunlap tibco.com On Sun, Jul 8, 2018 at 11:04 AM, Charles Geyer wrote: > I think your second option sounds better because this is all happening > inside one function I'm writing so users won't be able mess with the glm > object. Many thanks. > > On Sun, Jul 8, 2018, 12:10 PM Duncan Murdoch > wrote: > > > On 08/07/2018 11:48 AM, Charles Geyer wrote: > > > I need to find out from an object returned by R function glm with > > argument > > > x = TRUE > > > what the subsetting was. It appears that if gout is that object, then > > > > > > as.integer(rownames(gout$x)) > > > > > > is a subset vector equivalent to the one actually used. > > > > You don't want the "as.integer". If the dataframe had rownames to start > > with, the x component of the fit will have row labels consisting of > > those labels, so as.integer may fail. Even if it doesn't, the rownames > > aren't necessarily sequential integers. You can index the dataframe by > > the character versions of the default numbers, so simply > > rownames(gout$x) should always work. > > > > More generally, I'm not sure your question is well posed. What do you > > mean by "the subsetting"? If you have something like > > > > df <- data.frame(letters, x = 1:26, y = rbinom(26, 1, 0.5)) > > > > df1 <- subset(df, letters > "b" & letters < "y") > > > > gout <- glm(y ~ x, data = df1, subset = letters < "q", x = TRUE) > > > > the rownames(gout$x) are going to be numbers for rows of df, because df1 > > will get a subset of those as row labels. > > > > > > > I do also have the call to glm (as a call object) so can determine the > > > actual subset argument, but this seems to be not so useful because I > > don't > > > know the length of the original variables before subsetting. > > > > You should be able to evaluate the subset expression in the environment > > of the formula, i.e. > > > > eval(gout$call$subset, envir = environment(gout$formula)) > > > > This may give incorrect results if the variables used in subsetting > > aren't in the dataframe and have changed since glm() was called. > > > > > > > So now my questions. Is this idea above (using rownames) OK even > though > > I > > > cannot find where (if anywhere) it is documented? Is there a better > way? > > > One more guaranteed to be correct in the future? > > > > > > > I would trust evaluating the subset more than grabbing row labels from > > gout$x, but I don't know for sure it is likely to be more robust. > > > > Duncan Murdoch > > > > [[alternative HTML version deleted]] > > __ > R-package-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-package-devel > [[alternative HTML version deleted]] __ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel
Re: [R-pkg-devel] Determine subset from glm object
I think your second option sounds better because this is all happening inside one function I'm writing so users won't be able mess with the glm object. Many thanks. On Sun, Jul 8, 2018, 12:10 PM Duncan Murdoch wrote: > On 08/07/2018 11:48 AM, Charles Geyer wrote: > > I need to find out from an object returned by R function glm with > argument > > x = TRUE > > what the subsetting was. It appears that if gout is that object, then > > > > as.integer(rownames(gout$x)) > > > > is a subset vector equivalent to the one actually used. > > You don't want the "as.integer". If the dataframe had rownames to start > with, the x component of the fit will have row labels consisting of > those labels, so as.integer may fail. Even if it doesn't, the rownames > aren't necessarily sequential integers. You can index the dataframe by > the character versions of the default numbers, so simply > rownames(gout$x) should always work. > > More generally, I'm not sure your question is well posed. What do you > mean by "the subsetting"? If you have something like > > df <- data.frame(letters, x = 1:26, y = rbinom(26, 1, 0.5)) > > df1 <- subset(df, letters > "b" & letters < "y") > > gout <- glm(y ~ x, data = df1, subset = letters < "q", x = TRUE) > > the rownames(gout$x) are going to be numbers for rows of df, because df1 > will get a subset of those as row labels. > > > > I do also have the call to glm (as a call object) so can determine the > > actual subset argument, but this seems to be not so useful because I > don't > > know the length of the original variables before subsetting. > > You should be able to evaluate the subset expression in the environment > of the formula, i.e. > > eval(gout$call$subset, envir = environment(gout$formula)) > > This may give incorrect results if the variables used in subsetting > aren't in the dataframe and have changed since glm() was called. > > > > So now my questions. Is this idea above (using rownames) OK even though > I > > cannot find where (if anywhere) it is documented? Is there a better way? > > One more guaranteed to be correct in the future? > > > > I would trust evaluating the subset more than grabbing row labels from > gout$x, but I don't know for sure it is likely to be more robust. > > Duncan Murdoch > [[alternative HTML version deleted]] __ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel
[R-pkg-devel] Determine subset from glm object
I need to find out from an object returned by R function glm with argument x = TRUE what the subsetting was. It appears that if gout is that object, then as.integer(rownames(gout$x)) is a subset vector equivalent to the one actually used. I do also have the call to glm (as a call object) so can determine the actual subset argument, but this seems to be not so useful because I don't know the length of the original variables before subsetting. So now my questions. Is this idea above (using rownames) OK even though I cannot find where (if anywhere) it is documented? Is there a better way? One more guaranteed to be correct in the future? -- Charles Geyer Professor, School of Statistics Resident Fellow, Minnesota Center for Philosophy of Science University of Minnesota char...@stat.umn.edu [[alternative HTML version deleted]] __ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel