[Rd] Model formulas with explicit references

2018-07-20 Thread Lenth, Russell V
Dear R-Devel,

I seem to no longer be able to access the bug-reporting system, so am doing 
this by e-mail.

My report concerns models where variables are explicitly referenced (or is it 
"dereferenced"?), such as:

cars.lm <- lm(mtcars[[1]] ~ factor(mtcars$cyl) + mtcars[["disp"]])

I have found that it is not possible to predict such models with new data. For 
example:

> predict(cars.lm, newdata = mtcars[1:5, )
   12345678
9   10 
20.37954 20.37954 26.58543 17.70329 14.91157 18.60448 14.91157 25.52859 
25.68971 20.17199 
  11   12   13   14   15   16   17   18   
19   20 
20.17199 17.21096 17.21096 17.21096 11.85300 12.18071 12.72688 27.38558 
27.46750 27.59312 
  21   22   23   24   25   26   27   28   
29   30 
26.25500 16.05853 16.44085 15.18466 13.81922 27.37738 26.24954 26.93772 
15.15735 20.78917 
  31   32 
16.52278 26.23042 
Warning message:
'newdata' had 5 rows but variables found have 32 rows 

Instead of returning 5 predictions, it returns the 32 original predicted 
values. There is a warning message suggesting that something went wrong. This 
tickled my curiosity, and hance this result:

> predict(cars.lm, newdata = data.frame(x = 1:32))
   12345678
9   10 
20.37954 20.37954 26.58543 17.70329 14.91157 18.60448 14.91157 25.52859 
25.68971 20.17199 
  11   12   13   14   15   16   17   18   
19   20 
20.17199 17.21096 17.21096 17.21096 11.85300 12.18071 12.72688 27.38558 
27.46750 27.59312 
  21   22   23   24   25   26   27   28   
29   30 
26.25500 16.05853 16.44085 15.18466 13.81922 27.37738 26.24954 26.93772 
15.15735 20.78917 
  31   32 
16.52278 26.23042

Again, the new data are ignored, but there is no warning message, because the 
previous warning was based only on a discrepancy with the number of rows and 
the number of predictions. Indeed, the new data set makes no sense at all in 
the context of this model.

At the root of this behavior is the fact that the model.frame function ignores 
its data argument with such models. So instead of constructing a new frame 
based on the new data, it just returns the original model frame.

I am not really suggesting that you try to make these things work with models 
when the formula is like this. Instead, I am hoping that it throws an actual 
error message rather than just a warning, and that you be a little bit more 
sophisticated than merely checking the number of rows. Both predict() with 
newdata provided, and model.frame() with a data argument, should return an 
informative error message that says that model formulas like this are not 
supported with new data. Here is what appears to be an easy way to check:

> get_all_vars(terms(cars.lm))
Error in eval(inp, data, env) : object 'cyl' not found


Thanks

Russ

Russell V. Lenth  -  Professor Emeritus
Department of Statistics and Actuarial Science
The University of Iowa  -  Iowa City, IA 52242  USA 
Voice (319)335-0712 (Dept. office)  -  FAX (319)335-3017

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Model formulas with explicit references

2018-07-21 Thread Berry, Charles



> On Jul 20, 2018, at 3:05 PM, Lenth, Russell V  wrote:
> 
> Dear R-Devel,
> 
> I seem to no longer be able to access the bug-reporting system, so am doing 
> this by e-mail.
> 
> My report concerns models where variables are explicitly referenced (or is it 
> "dereferenced"?), such as:
> 
>cars.lm <- lm(mtcars[[1]] ~ factor(mtcars$cyl) + mtcars[["disp"]])
> 
> I have found that it is not possible to predict such models with new data. 
> For example:
> 
>> predict(cars.lm, newdata = mtcars[1:5, )
>   12345678
> 9   10 
> 20.37954 20.37954 26.58543 17.70329 14.91157 18.60448 14.91157 25.52859 
> 25.68971 20.17199 
>  11   12   13   14   15   16   17   18   
> 19   20 
> 20.17199 17.21096 17.21096 17.21096 11.85300 12.18071 12.72688 27.38558 
> 27.46750 27.59312 
>  21   22   23   24   25   26   27   28   
> 29   30 
> 26.25500 16.05853 16.44085 15.18466 13.81922 27.37738 26.24954 26.93772 
> 15.15735 20.78917 
>  31   32 
> 16.52278 26.23042 
> Warning message:
> 'newdata' had 5 rows but variables found have 32 rows 
> 
> Instead of returning 5 predictions, it returns the 32 original predicted 
> values. There is a warning message suggesting that something went wrong. This 
> tickled my curiosity, and hance this result:
> 
>> predict(cars.lm, newdata = data.frame(x = 1:32))
>   12345678
> 9   10 
> 20.37954 20.37954 26.58543 17.70329 14.91157 18.60448 14.91157 25.52859 
> 25.68971 20.17199 
>  11   12   13   14   15   16   17   18   
> 19   20 
> 20.17199 17.21096 17.21096 17.21096 11.85300 12.18071 12.72688 27.38558 
> 27.46750 27.59312 
>  21   22   23   24   25   26   27   28   
> 29   30 
> 26.25500 16.05853 16.44085 15.18466 13.81922 27.37738 26.24954 26.93772 
> 15.15735 20.78917 
>  31   32 
> 16.52278 26.23042
> 
> Again, the new data are ignored, but there is no warning message, because the 
> previous warning was based only on a discrepancy with the number of rows and 
> the number of predictions. Indeed, the new data set makes no sense at all in 
> the context of this model.
> 
> At the root of this behavior is the fact that the model.frame function 
> ignores its data argument with such models. So instead of constructing a new 
> frame based on the new data, it just returns the original model frame.
> 

This produces what I think you intended:

> predict(cars.lm, newdata = list(mtcars=mtcars[1:5,]) )
   12345 
20.37954 20.37954 26.58543 17.70329 14.91157 
>


> I am not really suggesting that you try to make these things work with models 
> when the formula is like this. Instead, I am hoping that it throws an actual 
> error message rather than just a warning, and that you be a little bit more 
> sophisticated than merely checking the number of rows. Both predict() with 
> newdata provided, and model.frame() with a data argument, should return an 
> informative error message that says that model formulas like this are not 
> supported with new data.

As you can see they are supported, but you have to make sure that the objects 
in the formula can be found in the newdata arg. If this is puzzling, try 

debugonce(predict.lm)
predict(cars.lm, newdata = mtcars[1:5, )

and inspect the newdata object and terms(object). You should see why the terms 
in the formula are not found in newdata.

If you  think that something like your idiom for formula is required, maybe you 
should  repost on r-help and say what you are trying to do with it.  I expect 
you'll get some advice on how to reformulate your call.

HTH,

Chuck
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Model formulas with explicit references

2018-07-22 Thread Lenth, Russell V
Chuck,

Thanks for your response. I am impressed to find out how to get 'newdata' to 
work with a model like this. I couldn't have figured that out from the 
documentation. The context is trying to write code for my emmeans package that 
supports whatever model the user gives it, even ones like my example. Since 
emmeans uses predictions, it stumbles on this kind of model specification. I 
may be able to work around it with the help you provided.

However, the real gist of my question is in the last part, where the call ' 
predict(cars.lm, newdata = data.frame(x = 1:32))' does NOT throw an error, or 
even a warning, even though the supplied 'newdata' is clearly inappropriate for 
this model. This is not the way 'predict' behaves if the model has plain names:

> nicecars.lm <- lm(mpg ~ factor(cyl) + disp, data = mtcars)

> predict(nicecars.lm, newdata = data.frame(x = 1:32))
Error in factor(cyl) : object 'cyl' not found

I still suggest that predict(..., newdata = ...) should *always* return an 
error when newdata does not provide new values of the model data.

Russ

-Original Message-
From: Berry, Charles  
Sent: Saturday, July 21, 2018 12:02 PM
To: Lenth, Russell V 
Cc: r-devel@r-project.org
Subject: Re: Model formulas with explicit references



> On Jul 20, 2018, at 3:05 PM, Lenth, Russell V  wrote:
> 
> Dear R-Devel,
> 
> I seem to no longer be able to access the bug-reporting system, so am doing 
> this by e-mail.
> 
> My report concerns models where variables are explicitly referenced (or is it 
> "dereferenced"?), such as:
> 
>cars.lm <- lm(mtcars[[1]] ~ factor(mtcars$cyl) + mtcars[["disp"]])
> 
> I have found that it is not possible to predict such models with new data. 
> For example:
> 
>> predict(cars.lm, newdata = mtcars[1:5, )
>   12345678
> 9   10 
> 20.37954 20.37954 26.58543 17.70329 14.91157 18.60448 14.91157 25.52859 
> 25.68971 20.17199 
>  11   12   13   14   15   16   17   18   
> 19   20 
> 20.17199 17.21096 17.21096 17.21096 11.85300 12.18071 12.72688 27.38558 
> 27.46750 27.59312 
>  21   22   23   24   25   26   27   28   
> 29   30 
> 26.25500 16.05853 16.44085 15.18466 13.81922 27.37738 26.24954 26.93772 
> 15.15735 20.78917 
>  31   32 
> 16.52278 26.23042 
> Warning message:
> 'newdata' had 5 rows but variables found have 32 rows 
> 
> Instead of returning 5 predictions, it returns the 32 original predicted 
> values. There is a warning message suggesting that something went wrong. This 
> tickled my curiosity, and hance this result:
> 
>> predict(cars.lm, newdata = data.frame(x = 1:32))
>   12345678
> 9   10 
> 20.37954 20.37954 26.58543 17.70329 14.91157 18.60448 14.91157 25.52859 
> 25.68971 20.17199 
>  11   12   13   14   15   16   17   18   
> 19   20 
> 20.17199 17.21096 17.21096 17.21096 11.85300 12.18071 12.72688 27.38558 
> 27.46750 27.59312 
>  21   22   23   24   25   26   27   28   
> 29   30 
> 26.25500 16.05853 16.44085 15.18466 13.81922 27.37738 26.24954 26.93772 
> 15.15735 20.78917 
>  31   32 
> 16.52278 26.23042
> 
> Again, the new data are ignored, but there is no warning message, because the 
> previous warning was based only on a discrepancy with the number of rows and 
> the number of predictions. Indeed, the new data set makes no sense at all in 
> the context of this model.
> 
> At the root of this behavior is the fact that the model.frame function 
> ignores its data argument with such models. So instead of constructing a new 
> frame based on the new data, it just returns the original model frame.
> 

This produces what I think you intended:

> predict(cars.lm, newdata = list(mtcars=mtcars[1:5,]) )
   12345 
20.37954 20.37954 26.58543 17.70329 14.91157 
>


> I am not really suggesting that you try to make these things work with models 
> when the formula is like this. Instead, I am hoping that it throws an actual 
> error message rather than just a warning, and that you be a little bit more 
> sophisticated than merely checking the number of rows. Both predict() with 
> newdata provided, and model.frame() with a data argument, should return an 
> informative error message that says that model formulas like this are not 
> supported with new data.

As you can see they are supported, but you have to make sure that the objects 
in the formula can be found in the newdata arg. If this is puzzling, try 

debugonce(predict.lm)
predict(cars.lm, newdata = mtcars[1:5, )

and inspect the newdata object and terms(object). You should see why the terms 
in the formula are not found in newdata.

If you  think that something like your idiom for formula is required, maybe you 
should  repost on r-help and