Re: [R-pkg-devel] Determine subset from glm object

2018-07-09 Thread Heather Turner
On second thoughts it may be better to preserve the original data and na.action 
in the call to glm. So then you might combine the idea of a dummy model frame 
with evaluating the subset, e.g.

mfcall <- call("model.frame", reformulate(all.vars(f)), data = data)
mf <- eval(mfcall, parent.frame())
mf$id <- seq_len(nrow(mf))
subset <- mf$id %in% model.frame( ~ id, data = mf, subset = subset)$id

This will give the subset as a logical vector whether it was originally 
supplied as logical, numeric or character. Then you might combine this with the 
logical vector based on the first glm as follows:

subset[subset] <- linearity

On Mon, Jul 9, 2018, at 10:14 PM, Heather Turner wrote:
> Good point. In that case a solution might be to create a model frame 
> based on the named variables, e.g.
> 
> #  general formula
> f <- ~ log(x) + ns(v, df = 2)
> # model frame based on "bare" variables; deal with user-supplied subset, 
> data, na.action, etc
> mfcall <- call("model.frame", reformulate(all.vars(f)), subset = subset, 
> data = data, na.action = na.action)
> mf <- eval(mfcall, parent.frame())
> 
> Then `mf` can be passed as the data argument to `glm` without any subset 
> argument for the first model and with the new subset argument for the 
> second model.
> 
> 
> On Mon, Jul 9, 2018, at 5:06 PM, Ben Bolker wrote:
> > 
> >   From painful experience: model.frame() does *NOT* necessarily return a
> > data frame that can be successfully used as the data= argument for models.
> > 
> >   - transformed variables (e.g. log(x)) will be in the model frame
> > rather than the original variables,  so when model.frame() is called
> > again within glm(), it won't find the original variables
> >   - variables with data-dependent bases (poly(), ns(), etc.) get
> > computed and stuck in the model frame - again, the original variables
> > are inaccessible
> > 
> > 
> > On 2018-07-09 11:20 AM, Heather Turner wrote:
> > > 
> > > 
> > > On Sun, Jul 8, 2018, at 8:25 PM, Charles Geyer wrote:
> > >> I spoke too soon.  The problem isn't that I don't know how to get the
> > >> subset argument. I am just calling glm (via eval) with (mostly) the
> > >> same arguments as the call to my function, so subset is (if not
> > >> missing) an argument to my function too.  So I can just use it.
> > >>
> > >> The problem is that I then want to call glm again fitting a subset of
> > >> the original subset (if there was one).  And when I do that glm will
> > >> refer to the original data wherever it is, and I don't have that.
> > >>
> > >> if this isn't clear, here is the code as it stands now
> > >> https://github.com/cjgeyer/glmdr/blob/master/package/glmdr/R/glmdr.R.
> > >>
> > >> The issue is with the lines (very near the end)
> > >>
> > >> subset.lcm <- as.integer(rownames(modmat))
> > >> subset.lcm <- subset.lcm[linearity]
> > >> # call glm again
> > >> call.glm$subset <- subset.lcm
> > >> gout.lcm <- eval(call.glm, parent.frame())
> > >>
> > >> I can see from what Duncan said that I really don't want the
> > >> as.integer around rownames.  But it is not clear what would be better.
> > >>
> > >> I just had another thought that I could get the original data with
> > >> another call to glm with subset removed from the call and method =
> > >> "model.frame" added.  And I think (maybe, have to try it) that it
> > >> would have NA's removed or whatever na.action says to do.
> > >> But that seems redundant.
> > >>
> > >>
> > > As you are calling stats::glm, you can use `model.frame` to get the data 
> > > used to fit the model after applying subset and na.action. So then you 
> > > can do:
> > > 
> > > call.glm$subset <- linearity
> > > call.glm$data <- model.frame(gout)
> > > 
> > > I think this is what you are after?
> > > 
> > > Heather
> > > 
> > >>
> > >> On Sun, Jul 8, 2018, 1:04 PM Charles Geyer  wrote:
> > >>>
> > >>> I think your second option sounds better because this is all happening 
> > >>> inside one function I'm writing so users won't be able mess with the 
> > >>> glm object. Many thanks.
> > >>>
> > >>> On Sun, Jul 8, 2018, 12:10 PM Duncan Murdoch  
> > >>> wrote:
> > 
> >  On 08/07/2018 11:48 AM, Charles Geyer wrote:
> > > I need to find out from an object returned by R function glm with 
> > > argument
> > > x = TRUE
> > > what the subsetting was.  It appears that if gout is that object, then
> > >
> > > as.integer(rownames(gout$x))
> > >
> > > is a subset vector equivalent to the one actually used.
> > 
> >  You don't want the "as.integer".  If the dataframe had rownames to 
> >  start
> >  with, the x component of the fit will have row labels consisting of
> >  those labels, so as.integer may fail.  Even if it doesn't, the rownames
> >  aren't necessarily sequential integers.   You can index the dataframe 
> >  by
> >  the character versions of the default numbers, so simply
> >  rownames(gout$x) should always work.
> > 
> >  More generally, I'm 

Re: [R-pkg-devel] Determine subset from glm object

2018-07-09 Thread Heather Turner
Good point. In that case a solution might be to create a model frame based on 
the named variables, e.g.

#  general formula
f <- ~ log(x) + ns(v, df = 2)
# model frame based on "bare" variables; deal with user-supplied subset, data, 
na.action, etc
mfcall <- call("model.frame", reformulate(all.vars(f)), subset = subset, data = 
data, na.action = na.action)
mf <- eval(mfcall, parent.frame())

Then `mf` can be passed as the data argument to `glm` without any subset 
argument for the first model and with the new subset argument for the second 
model.


On Mon, Jul 9, 2018, at 5:06 PM, Ben Bolker wrote:
> 
>   From painful experience: model.frame() does *NOT* necessarily return a
> data frame that can be successfully used as the data= argument for models.
> 
>   - transformed variables (e.g. log(x)) will be in the model frame
> rather than the original variables,  so when model.frame() is called
> again within glm(), it won't find the original variables
>   - variables with data-dependent bases (poly(), ns(), etc.) get
> computed and stuck in the model frame - again, the original variables
> are inaccessible
> 
> 
> On 2018-07-09 11:20 AM, Heather Turner wrote:
> > 
> > 
> > On Sun, Jul 8, 2018, at 8:25 PM, Charles Geyer wrote:
> >> I spoke too soon.  The problem isn't that I don't know how to get the
> >> subset argument. I am just calling glm (via eval) with (mostly) the
> >> same arguments as the call to my function, so subset is (if not
> >> missing) an argument to my function too.  So I can just use it.
> >>
> >> The problem is that I then want to call glm again fitting a subset of
> >> the original subset (if there was one).  And when I do that glm will
> >> refer to the original data wherever it is, and I don't have that.
> >>
> >> if this isn't clear, here is the code as it stands now
> >> https://github.com/cjgeyer/glmdr/blob/master/package/glmdr/R/glmdr.R.
> >>
> >> The issue is with the lines (very near the end)
> >>
> >> subset.lcm <- as.integer(rownames(modmat))
> >> subset.lcm <- subset.lcm[linearity]
> >> # call glm again
> >> call.glm$subset <- subset.lcm
> >> gout.lcm <- eval(call.glm, parent.frame())
> >>
> >> I can see from what Duncan said that I really don't want the
> >> as.integer around rownames.  But it is not clear what would be better.
> >>
> >> I just had another thought that I could get the original data with
> >> another call to glm with subset removed from the call and method =
> >> "model.frame" added.  And I think (maybe, have to try it) that it
> >> would have NA's removed or whatever na.action says to do.
> >> But that seems redundant.
> >>
> >>
> > As you are calling stats::glm, you can use `model.frame` to get the data 
> > used to fit the model after applying subset and na.action. So then you can 
> > do:
> > 
> > call.glm$subset <- linearity
> > call.glm$data <- model.frame(gout)
> > 
> > I think this is what you are after?
> > 
> > Heather
> > 
> >>
> >> On Sun, Jul 8, 2018, 1:04 PM Charles Geyer  wrote:
> >>>
> >>> I think your second option sounds better because this is all happening 
> >>> inside one function I'm writing so users won't be able mess with the glm 
> >>> object. Many thanks.
> >>>
> >>> On Sun, Jul 8, 2018, 12:10 PM Duncan Murdoch  
> >>> wrote:
> 
>  On 08/07/2018 11:48 AM, Charles Geyer wrote:
> > I need to find out from an object returned by R function glm with 
> > argument
> > x = TRUE
> > what the subsetting was.  It appears that if gout is that object, then
> >
> > as.integer(rownames(gout$x))
> >
> > is a subset vector equivalent to the one actually used.
> 
>  You don't want the "as.integer".  If the dataframe had rownames to start
>  with, the x component of the fit will have row labels consisting of
>  those labels, so as.integer may fail.  Even if it doesn't, the rownames
>  aren't necessarily sequential integers.   You can index the dataframe by
>  the character versions of the default numbers, so simply
>  rownames(gout$x) should always work.
> 
>  More generally, I'm not sure your question is well posed.  What do you
>  mean by "the subsetting"?  If you have something like
> 
>  df <- data.frame(letters, x = 1:26, y = rbinom(26, 1, 0.5))
> 
>  df1 <- subset(df, letters > "b" & letters < "y")
> 
>  gout <- glm(y ~ x, data = df1, subset = letters < "q", x = TRUE)
> 
>  the rownames(gout$x) are going to be numbers for rows of df, because df1
>  will get a subset of those as row labels.
> 
> 
> > I do also have the call to glm (as a call object) so can determine the
> > actual subset argument, but this seems to be not so useful because I 
> > don't
> > know the length of the original variables before subsetting.
> 
>  You should be able to evaluate the subset expression in the environment
>  of the formula, i.e.
> 
>  eval(gout$call$subset, envir = environment(gout$formula))
> 
> 

Re: [R-pkg-devel] Determine subset from glm object

2018-07-09 Thread Ben Bolker


  From painful experience: model.frame() does *NOT* necessarily return a
data frame that can be successfully used as the data= argument for models.

  - transformed variables (e.g. log(x)) will be in the model frame
rather than the original variables,  so when model.frame() is called
again within glm(), it won't find the original variables
  - variables with data-dependent bases (poly(), ns(), etc.) get
computed and stuck in the model frame - again, the original variables
are inaccessible


On 2018-07-09 11:20 AM, Heather Turner wrote:
> 
> 
> On Sun, Jul 8, 2018, at 8:25 PM, Charles Geyer wrote:
>> I spoke too soon.  The problem isn't that I don't know how to get the
>> subset argument. I am just calling glm (via eval) with (mostly) the
>> same arguments as the call to my function, so subset is (if not
>> missing) an argument to my function too.  So I can just use it.
>>
>> The problem is that I then want to call glm again fitting a subset of
>> the original subset (if there was one).  And when I do that glm will
>> refer to the original data wherever it is, and I don't have that.
>>
>> if this isn't clear, here is the code as it stands now
>> https://github.com/cjgeyer/glmdr/blob/master/package/glmdr/R/glmdr.R.
>>
>> The issue is with the lines (very near the end)
>>
>> subset.lcm <- as.integer(rownames(modmat))
>> subset.lcm <- subset.lcm[linearity]
>> # call glm again
>> call.glm$subset <- subset.lcm
>> gout.lcm <- eval(call.glm, parent.frame())
>>
>> I can see from what Duncan said that I really don't want the
>> as.integer around rownames.  But it is not clear what would be better.
>>
>> I just had another thought that I could get the original data with
>> another call to glm with subset removed from the call and method =
>> "model.frame" added.  And I think (maybe, have to try it) that it
>> would have NA's removed or whatever na.action says to do.
>> But that seems redundant.
>>
>>
> As you are calling stats::glm, you can use `model.frame` to get the data used 
> to fit the model after applying subset and na.action. So then you can do:
> 
> call.glm$subset <- linearity
> call.glm$data <- model.frame(gout)
> 
> I think this is what you are after?
> 
> Heather
> 
>>
>> On Sun, Jul 8, 2018, 1:04 PM Charles Geyer  wrote:
>>>
>>> I think your second option sounds better because this is all happening 
>>> inside one function I'm writing so users won't be able mess with the glm 
>>> object. Many thanks.
>>>
>>> On Sun, Jul 8, 2018, 12:10 PM Duncan Murdoch  
>>> wrote:

 On 08/07/2018 11:48 AM, Charles Geyer wrote:
> I need to find out from an object returned by R function glm with argument
> x = TRUE
> what the subsetting was.  It appears that if gout is that object, then
>
> as.integer(rownames(gout$x))
>
> is a subset vector equivalent to the one actually used.

 You don't want the "as.integer".  If the dataframe had rownames to start
 with, the x component of the fit will have row labels consisting of
 those labels, so as.integer may fail.  Even if it doesn't, the rownames
 aren't necessarily sequential integers.   You can index the dataframe by
 the character versions of the default numbers, so simply
 rownames(gout$x) should always work.

 More generally, I'm not sure your question is well posed.  What do you
 mean by "the subsetting"?  If you have something like

 df <- data.frame(letters, x = 1:26, y = rbinom(26, 1, 0.5))

 df1 <- subset(df, letters > "b" & letters < "y")

 gout <- glm(y ~ x, data = df1, subset = letters < "q", x = TRUE)

 the rownames(gout$x) are going to be numbers for rows of df, because df1
 will get a subset of those as row labels.


> I do also have the call to glm (as a call object) so can determine the
> actual subset argument, but this seems to be not so useful because I don't
> know the length of the original variables before subsetting.

 You should be able to evaluate the subset expression in the environment
 of the formula, i.e.

 eval(gout$call$subset, envir = environment(gout$formula))

 This may give incorrect results if the variables used in subsetting
 aren't in the dataframe and have changed since glm() was called.


> So now my questions.  Is this idea above (using rownames) OK even though I
> cannot find where (if anywhere) it is documented?  Is there a better way?
> One more guaranteed to be correct in the future?
>

 I would trust evaluating the subset more than grabbing row labels from
 gout$x, but I don't know for sure it is likely to be more robust.

 Duncan Murdoch
>>
>> __
>> R-package-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-package-devel
> 
> __
> R-package-devel@r-project.org mailing list
> 

Re: [R-pkg-devel] Determine subset from glm object

2018-07-09 Thread Heather Turner



On Sun, Jul 8, 2018, at 8:25 PM, Charles Geyer wrote:
> I spoke too soon.  The problem isn't that I don't know how to get the
> subset argument. I am just calling glm (via eval) with (mostly) the
> same arguments as the call to my function, so subset is (if not
> missing) an argument to my function too.  So I can just use it.
> 
> The problem is that I then want to call glm again fitting a subset of
> the original subset (if there was one).  And when I do that glm will
> refer to the original data wherever it is, and I don't have that.
> 
> if this isn't clear, here is the code as it stands now
> https://github.com/cjgeyer/glmdr/blob/master/package/glmdr/R/glmdr.R.
> 
> The issue is with the lines (very near the end)
> 
> subset.lcm <- as.integer(rownames(modmat))
> subset.lcm <- subset.lcm[linearity]
> # call glm again
> call.glm$subset <- subset.lcm
> gout.lcm <- eval(call.glm, parent.frame())
> 
> I can see from what Duncan said that I really don't want the
> as.integer around rownames.  But it is not clear what would be better.
> 
> I just had another thought that I could get the original data with
> another call to glm with subset removed from the call and method =
> "model.frame" added.  And I think (maybe, have to try it) that it
> would have NA's removed or whatever na.action says to do.
> But that seems redundant.
> 
> 
As you are calling stats::glm, you can use `model.frame` to get the data used 
to fit the model after applying subset and na.action. So then you can do:

call.glm$subset <- linearity
call.glm$data <- model.frame(gout)

I think this is what you are after?

Heather

> 
> On Sun, Jul 8, 2018, 1:04 PM Charles Geyer  wrote:
> >
> > I think your second option sounds better because this is all happening 
> > inside one function I'm writing so users won't be able mess with the glm 
> > object. Many thanks.
> >
> > On Sun, Jul 8, 2018, 12:10 PM Duncan Murdoch  
> > wrote:
> >>
> >> On 08/07/2018 11:48 AM, Charles Geyer wrote:
> >> > I need to find out from an object returned by R function glm with 
> >> > argument
> >> > x = TRUE
> >> > what the subsetting was.  It appears that if gout is that object, then
> >> >
> >> > as.integer(rownames(gout$x))
> >> >
> >> > is a subset vector equivalent to the one actually used.
> >>
> >> You don't want the "as.integer".  If the dataframe had rownames to start
> >> with, the x component of the fit will have row labels consisting of
> >> those labels, so as.integer may fail.  Even if it doesn't, the rownames
> >> aren't necessarily sequential integers.   You can index the dataframe by
> >> the character versions of the default numbers, so simply
> >> rownames(gout$x) should always work.
> >>
> >> More generally, I'm not sure your question is well posed.  What do you
> >> mean by "the subsetting"?  If you have something like
> >>
> >> df <- data.frame(letters, x = 1:26, y = rbinom(26, 1, 0.5))
> >>
> >> df1 <- subset(df, letters > "b" & letters < "y")
> >>
> >> gout <- glm(y ~ x, data = df1, subset = letters < "q", x = TRUE)
> >>
> >> the rownames(gout$x) are going to be numbers for rows of df, because df1
> >> will get a subset of those as row labels.
> >>
> >>
> >> > I do also have the call to glm (as a call object) so can determine the
> >> > actual subset argument, but this seems to be not so useful because I 
> >> > don't
> >> > know the length of the original variables before subsetting.
> >>
> >> You should be able to evaluate the subset expression in the environment
> >> of the formula, i.e.
> >>
> >> eval(gout$call$subset, envir = environment(gout$formula))
> >>
> >> This may give incorrect results if the variables used in subsetting
> >> aren't in the dataframe and have changed since glm() was called.
> >>
> >>
> >> > So now my questions.  Is this idea above (using rownames) OK even though 
> >> > I
> >> > cannot find where (if anywhere) it is documented?  Is there a better way?
> >> > One more guaranteed to be correct in the future?
> >> >
> >>
> >> I would trust evaluating the subset more than grabbing row labels from
> >> gout$x, but I don't know for sure it is likely to be more robust.
> >>
> >> Duncan Murdoch
> 
> __
> R-package-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-package-devel

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] Determine subset from glm object

2018-07-08 Thread William Dunlap
If there might be NA's in the response or predictors so na.exclude or
na.omit would remove
some rows as well, then using the row.names might be an easier way to match
up rows in
the original data with rows in gout$x.

Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Sun, Jul 8, 2018 at 11:04 AM, Charles Geyer  wrote:

> I think your second option sounds better because this is all happening
> inside one function I'm writing so users won't be able mess with the glm
> object. Many thanks.
>
> On Sun, Jul 8, 2018, 12:10 PM Duncan Murdoch 
> wrote:
>
> > On 08/07/2018 11:48 AM, Charles Geyer wrote:
> > > I need to find out from an object returned by R function glm with
> > argument
> > > x = TRUE
> > > what the subsetting was.  It appears that if gout is that object, then
> > >
> > > as.integer(rownames(gout$x))
> > >
> > > is a subset vector equivalent to the one actually used.
> >
> > You don't want the "as.integer".  If the dataframe had rownames to start
> > with, the x component of the fit will have row labels consisting of
> > those labels, so as.integer may fail.  Even if it doesn't, the rownames
> > aren't necessarily sequential integers.   You can index the dataframe by
> > the character versions of the default numbers, so simply
> > rownames(gout$x) should always work.
> >
> > More generally, I'm not sure your question is well posed.  What do you
> > mean by "the subsetting"?  If you have something like
> >
> > df <- data.frame(letters, x = 1:26, y = rbinom(26, 1, 0.5))
> >
> > df1 <- subset(df, letters > "b" & letters < "y")
> >
> > gout <- glm(y ~ x, data = df1, subset = letters < "q", x = TRUE)
> >
> > the rownames(gout$x) are going to be numbers for rows of df, because df1
> > will get a subset of those as row labels.
> >
> >
> > > I do also have the call to glm (as a call object) so can determine the
> > > actual subset argument, but this seems to be not so useful because I
> > don't
> > > know the length of the original variables before subsetting.
> >
> > You should be able to evaluate the subset expression in the environment
> > of the formula, i.e.
> >
> > eval(gout$call$subset, envir = environment(gout$formula))
> >
> > This may give incorrect results if the variables used in subsetting
> > aren't in the dataframe and have changed since glm() was called.
> >
> >
> > > So now my questions.  Is this idea above (using rownames) OK even
> though
> > I
> > > cannot find where (if anywhere) it is documented?  Is there a better
> way?
> > > One more guaranteed to be correct in the future?
> > >
> >
> > I would trust evaluating the subset more than grabbing row labels from
> > gout$x, but I don't know for sure it is likely to be more robust.
> >
> > Duncan Murdoch
> >
>
> [[alternative HTML version deleted]]
>
> __
> R-package-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-package-devel
>

[[alternative HTML version deleted]]

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] Determine subset from glm object

2018-07-08 Thread Charles Geyer
I think your second option sounds better because this is all happening
inside one function I'm writing so users won't be able mess with the glm
object. Many thanks.

On Sun, Jul 8, 2018, 12:10 PM Duncan Murdoch 
wrote:

> On 08/07/2018 11:48 AM, Charles Geyer wrote:
> > I need to find out from an object returned by R function glm with
> argument
> > x = TRUE
> > what the subsetting was.  It appears that if gout is that object, then
> >
> > as.integer(rownames(gout$x))
> >
> > is a subset vector equivalent to the one actually used.
>
> You don't want the "as.integer".  If the dataframe had rownames to start
> with, the x component of the fit will have row labels consisting of
> those labels, so as.integer may fail.  Even if it doesn't, the rownames
> aren't necessarily sequential integers.   You can index the dataframe by
> the character versions of the default numbers, so simply
> rownames(gout$x) should always work.
>
> More generally, I'm not sure your question is well posed.  What do you
> mean by "the subsetting"?  If you have something like
>
> df <- data.frame(letters, x = 1:26, y = rbinom(26, 1, 0.5))
>
> df1 <- subset(df, letters > "b" & letters < "y")
>
> gout <- glm(y ~ x, data = df1, subset = letters < "q", x = TRUE)
>
> the rownames(gout$x) are going to be numbers for rows of df, because df1
> will get a subset of those as row labels.
>
>
> > I do also have the call to glm (as a call object) so can determine the
> > actual subset argument, but this seems to be not so useful because I
> don't
> > know the length of the original variables before subsetting.
>
> You should be able to evaluate the subset expression in the environment
> of the formula, i.e.
>
> eval(gout$call$subset, envir = environment(gout$formula))
>
> This may give incorrect results if the variables used in subsetting
> aren't in the dataframe and have changed since glm() was called.
>
>
> > So now my questions.  Is this idea above (using rownames) OK even though
> I
> > cannot find where (if anywhere) it is documented?  Is there a better way?
> > One more guaranteed to be correct in the future?
> >
>
> I would trust evaluating the subset more than grabbing row labels from
> gout$x, but I don't know for sure it is likely to be more robust.
>
> Duncan Murdoch
>

[[alternative HTML version deleted]]

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


[R-pkg-devel] Determine subset from glm object

2018-07-08 Thread Charles Geyer
I need to find out from an object returned by R function glm with argument
x = TRUE
what the subsetting was.  It appears that if gout is that object, then

as.integer(rownames(gout$x))

is a subset vector equivalent to the one actually used.

I do also have the call to glm (as a call object) so can determine the
actual subset argument, but this seems to be not so useful because I don't
know the length of the original variables before subsetting.

So now my questions.  Is this idea above (using rownames) OK even though I
cannot find where (if anywhere) it is documented?  Is there a better way?
One more guaranteed to be correct in the future?

-- 
Charles Geyer
Professor, School of Statistics
Resident Fellow, Minnesota Center for Philosophy of Science
University of Minnesota
char...@stat.umn.edu

[[alternative HTML version deleted]]

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel