Re: [Rd] stringsAsFactors

Duncan Murdoch Mon, 11 Feb 2013 09:51:18 -0800

On 11/02/2013 12:13 PM, William Dunlap wrote:

Note that changing this does not just mean getting rid of "silly warnings".
Currently, predict.lm() can give wrong answers when stringsAsFactors is FALSE.


   > d <- data.frame(x=1:10, f=rep(c("A","B","C"), c(4,3,3)), y=c(1:4, 15:17, 
28.1,28.8,30.1))
   > fit_ab <- lm(y ~ x + f, data = d, subset = f!="B")
   Warning message:
   In model.matrix.default(mt, mf, contrasts) :
     variable 'f' converted to a factor
   > predict(fit_ab, newdata=d)
    1  2  3  4  5  6  7  8  9 10
    1  2  3  4 25 26 27  8  9 10
   Warning messages:
   1: In model.matrix.default(Terms, m, contrasts.arg = object$contrasts) :
     variable 'f' converted to a factor
   2: In predict.lm(fit_ab, newdata = d) :
     prediction from a rank-deficient fit may be misleading

fit_ab is not rank-deficient and the predict should report
    1 2 3 4 NA NA NA 28 29 30

In R-devel, the two warnings about factor conversions are no longergiven, but the predictions are the same and the warning about rankdeficiency still shows up. If f is set to be a factor, an error isgenerated:

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev= object$xlevels) :

  factor f has new levels B

I think both the warning and error are somewhat reasonable responses.The fit is rank deficient relative to the model that includes f == "B",because the column of the design matrix corresponding to f level B wouldbe completely zero. In this particular model, we could still dopredictions for the other levels, but it also seems reasonable to quit,given that clearly something has gone wrong.

I do think that it's unfortunate that we don't get the same result inboth cases, and I'd like to have gotten the predictions you suggested,but I don't think that's going to happen. The reason for the differenceis that the subsetting is done before the conversion to a factor, but Ithink that is unavoidable without really big changes.


Duncan Murdoch


Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

> -----Original Message-----
> From: r-devel-boun...@r-project.org [mailto:r-devel-boun...@r-project.org] On 
Behalf
> Of Terry Therneau
> Sent: Monday, February 11, 2013 5:50 AM
> To: r-devel@r-project.org; Duncan Murdoch
> Subject: Re: [Rd] stringsAsFactors
>
> I think your idea to remove the warnings is excellent, and a good compromise.
> Characters
> already work fine in modeling functions except for the silly warning.
>
> It is interesting how often the defaults for a program reflect the data sets 
in use at the
> time the defaults were chosen.  There are some such in my own survival 
package whose
> proper value is no longer as "obvious" as it was when I chose them.  Factors 
are very
> handy for variables which have only a few levels and will be used in 
modeling.  Every
> character variable of every dataset in "Statistical Models in S", which 
introduced
> factors, is of this type so auto-transformation made a lot of sense.  The 
"solder" data
> set there is one for which Helmert contrasts are proper so guess what the 
default
> contrast
> option was?  (I think there are only a few data sets in the world for which 
Helmert makes
> sense, however, and R eventually changed the default.)
>
> For character variables that should not be factors such as a street adress
> stringsAsFactors can be a real PITA, and I expect that people's preference 
for the option
> depends almost entirely on how often these arise in their own work.  As long 
as there is
> an option that can be overridden I'm okay.  Yes, I'd prefer FALSE as the 
default, partly
> because the current value is a tripwire in the hallway that eventually 
catches every new
> user.
>
> Terry Therneau
>
> On 02/11/2013 05:00 AM, r-devel-requ...@r-project.org wrote:
> > Both of these were discussed by R Core.  I think it's unlikely the
> > default for stringsAsFactors will be changed (some R Core members like
> > the current behaviour), but it's fairly likely the show.signif.stars
> > default will change.  (That's if someone gets around to it:  I
> > personally don't care about that one.  P-values are commonly used
> > statistics, and the stars are just a simple graphical display of them.
> > I find some p-values to be useful, and the display to be harmless.)
> >
> > I think it's really unlikely the more extreme changes (i.e. dropping
> > show.signif.stars completely, or dropping p-values) will happen.
> >
> > Regarding stringsAsFactors:  I'm not going to defend keeping it as is,
> > I'll let the people who like it defend it.  What I will likely do is
> > make a few changes so that character vectors are automatically changed
> > to factors in modelling functions, so that operating with
> > stringsAsFactors=FALSE doesn't trigger silly warnings.
>
> ______________________________________________
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel


______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] stringsAsFactors

Reply via email to