On second thoughts it may be better to preserve the original data and na.action in the call to glm. So then you might combine the idea of a dummy model frame with evaluating the subset, e.g.
mfcall <- call("model.frame", reformulate(all.vars(f)), data = data) mf <- eval(mfcall, parent.frame()) mf$id <- seq_len(nrow(mf)) subset <- mf$id %in% model.frame( ~ id, data = mf, subset = subset)$id This will give the subset as a logical vector whether it was originally supplied as logical, numeric or character. Then you might combine this with the logical vector based on the first glm as follows: subset[subset] <- linearity On Mon, Jul 9, 2018, at 10:14 PM, Heather Turner wrote: > Good point. In that case a solution might be to create a model frame > based on the named variables, e.g. > > # general formula > f <- ~ log(x) + ns(v, df = 2) > # model frame based on "bare" variables; deal with user-supplied subset, > data, na.action, etc > mfcall <- call("model.frame", reformulate(all.vars(f)), subset = subset, > data = data, na.action = na.action) > mf <- eval(mfcall, parent.frame()) > > Then `mf` can be passed as the data argument to `glm` without any subset > argument for the first model and with the new subset argument for the > second model. > > > On Mon, Jul 9, 2018, at 5:06 PM, Ben Bolker wrote: > > > > From painful experience: model.frame() does *NOT* necessarily return a > > data frame that can be successfully used as the data= argument for models. > > > > - transformed variables (e.g. log(x)) will be in the model frame > > rather than the original variables, so when model.frame() is called > > again within glm(), it won't find the original variables > > - variables with data-dependent bases (poly(), ns(), etc.) get > > computed and stuck in the model frame - again, the original variables > > are inaccessible > > > > > > On 2018-07-09 11:20 AM, Heather Turner wrote: > > > > > > > > > On Sun, Jul 8, 2018, at 8:25 PM, Charles Geyer wrote: > > >> I spoke too soon. The problem isn't that I don't know how to get the > > >> subset argument. I am just calling glm (via eval) with (mostly) the > > >> same arguments as the call to my function, so subset is (if not > > >> missing) an argument to my function too. So I can just use it. > > >> > > >> The problem is that I then want to call glm again fitting a subset of > > >> the original subset (if there was one). And when I do that glm will > > >> refer to the original data wherever it is, and I don't have that. > > >> > > >> if this isn't clear, here is the code as it stands now > > >> https://github.com/cjgeyer/glmdr/blob/master/package/glmdr/R/glmdr.R. > > >> > > >> The issue is with the lines (very near the end) > > >> > > >> subset.lcm <- as.integer(rownames(modmat)) > > >> subset.lcm <- subset.lcm[linearity] > > >> # call glm again > > >> call.glm$subset <- subset.lcm > > >> gout.lcm <- eval(call.glm, parent.frame()) > > >> > > >> I can see from what Duncan said that I really don't want the > > >> as.integer around rownames. But it is not clear what would be better. > > >> > > >> I just had another thought that I could get the original data with > > >> another call to glm with subset removed from the call and method = > > >> "model.frame" added. And I think (maybe, have to try it) that it > > >> would have NA's removed or whatever na.action says to do. > > >> But that seems redundant. > > >> > > >> > > > As you are calling stats::glm, you can use `model.frame` to get the data > > > used to fit the model after applying subset and na.action. So then you > > > can do: > > > > > > call.glm$subset <- linearity > > > call.glm$data <- model.frame(gout) > > > > > > I think this is what you are after? > > > > > > Heather > > > > > >> > > >> On Sun, Jul 8, 2018, 1:04 PM Charles Geyer <char...@stat.umn.edu> wrote: > > >>> > > >>> I think your second option sounds better because this is all happening > > >>> inside one function I'm writing so users won't be able mess with the > > >>> glm object. Many thanks. > > >>> > > >>> On Sun, Jul 8, 2018, 12:10 PM Duncan Murdoch <murdoch.dun...@gmail.com> > > >>> wrote: > > >>>> > > >>>> On 08/07/2018 11:48 AM, Charles Geyer wrote: > > >>>>> I need to find out from an object returned by R function glm with > > >>>>> argument > > >>>>> x = TRUE > > >>>>> what the subsetting was. It appears that if gout is that object, then > > >>>>> > > >>>>> as.integer(rownames(gout$x)) > > >>>>> > > >>>>> is a subset vector equivalent to the one actually used. > > >>>> > > >>>> You don't want the "as.integer". If the dataframe had rownames to > > >>>> start > > >>>> with, the x component of the fit will have row labels consisting of > > >>>> those labels, so as.integer may fail. Even if it doesn't, the rownames > > >>>> aren't necessarily sequential integers. You can index the dataframe > > >>>> by > > >>>> the character versions of the default numbers, so simply > > >>>> rownames(gout$x) should always work. > > >>>> > > >>>> More generally, I'm not sure your question is well posed. What do you > > >>>> mean by "the subsetting"? If you have something like > > >>>> > > >>>> df <- data.frame(letters, x = 1:26, y = rbinom(26, 1, 0.5)) > > >>>> > > >>>> df1 <- subset(df, letters > "b" & letters < "y") > > >>>> > > >>>> gout <- glm(y ~ x, data = df1, subset = letters < "q", x = TRUE) > > >>>> > > >>>> the rownames(gout$x) are going to be numbers for rows of df, because > > >>>> df1 > > >>>> will get a subset of those as row labels. > > >>>> > > >>>> > > >>>>> I do also have the call to glm (as a call object) so can determine the > > >>>>> actual subset argument, but this seems to be not so useful because I > > >>>>> don't > > >>>>> know the length of the original variables before subsetting. > > >>>> > > >>>> You should be able to evaluate the subset expression in the environment > > >>>> of the formula, i.e. > > >>>> > > >>>> eval(gout$call$subset, envir = environment(gout$formula)) > > >>>> > > >>>> This may give incorrect results if the variables used in subsetting > > >>>> aren't in the dataframe and have changed since glm() was called. > > >>>> > > >>>> > > >>>>> So now my questions. Is this idea above (using rownames) OK even > > >>>>> though I > > >>>>> cannot find where (if anywhere) it is documented? Is there a better > > >>>>> way? > > >>>>> One more guaranteed to be correct in the future? > > >>>>> > > >>>> > > >>>> I would trust evaluating the subset more than grabbing row labels from > > >>>> gout$x, but I don't know for sure it is likely to be more robust. > > >>>> > > >>>> Duncan Murdoch > > >> > > >> ______________________________________________ > > >> R-package-devel@r-project.org mailing list > > >> https://stat.ethz.ch/mailman/listinfo/r-package-devel > > > > > > ______________________________________________ > > > R-package-devel@r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-package-devel > > > > > > > ______________________________________________ > > R-package-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-package-devel > > ______________________________________________ > R-package-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-package-devel ______________________________________________ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel