Re: [Rd] stringsAsFactors

Duncan Murdoch Mon, 11 Feb 2013 12:17:59 -0800

On 11/02/2013 2:34 PM, Terry Therneau wrote:

The root of this problem is that the .getXlevels function does not return the 
levels for
character variables.

Thanks, that looks easy to fix (not by changing .getXlevels, but byhaving model.frame convert the character variables, instead

of waiting for model.matrix).


Duncan

Future predictions depend on that information.

On 02/11/2013 11:50 AM, Duncan Murdoch wrote:
> On 11/02/2013 12:13 PM, William Dunlap wrote:
>> Note that changing this does not just mean getting rid of "silly warnings".
>> Currently, predict.lm() can give wrong answers when stringsAsFactors is 
FALSE.
>>
>> > d <- data.frame(x=1:10, f=rep(c("A","B","C"), c(4,3,3)), y=c(1:4, 15:17,
>> 28.1,28.8,30.1))
>> > fit_ab <- lm(y ~ x + f, data = d, subset = f!="B")
>>    Warning message:
>>    In model.matrix.default(mt, mf, contrasts) :
>>      variable 'f' converted to a factor
>> > predict(fit_ab, newdata=d)
>>     1  2  3  4  5  6  7  8  9 10
>>     1  2  3  4 25 26 27  8  9 10
>>    Warning messages:
>>    1: In model.matrix.default(Terms, m, contrasts.arg = object$contrasts) :
>>      variable 'f' converted to a factor
>>    2: In predict.lm(fit_ab, newdata = d) :
>>      prediction from a rank-deficient fit may be misleading
>>
>> fit_ab is not rank-deficient and the predict should report
>>     1 2 3 4 NA NA NA 28 29 30
>
> In R-devel, the two warnings about factor conversions are no longer given, 
but the
> predictions are the same and the warning about rank deficiency still shows 
up.  If f is
> set to be a factor, an error is generated:
>
> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev =
> object$xlevels) :
>   factor f has new levels B
>
> I think both the warning and error are somewhat reasonable responses.  The 
fit is rank
> deficient relative to the model that includes f == "B",  because the column 
of the
> design matrix corresponding to f level B would be completely zero.  In this 
particular
> model, we could still do predictions for the other levels, but it also seems 
reasonable
> to quit, given that clearly something has gone wrong.
>
> I do think that it's unfortunate that we don't get the same result in both 
cases, and
> I'd like to have gotten the predictions you suggested, but I don't think 
that's going to
> happen.  The reason for the difference is that the subsetting is done before 
the
> conversion to a factor, but I think that is unavoidable without really big 
changes.
>
> Duncan Murdoch
>
>
>>
>> Bill Dunlap
>> Spotfire, TIBCO Software
>> wdunlap tibco.com
>>
>> > -----Original Message-----
>> > From: r-devel-boun...@r-project.org [mailto:r-devel-boun...@r-project.org] 
On Behalf
>> > Of Terry Therneau
>> > Sent: Monday, February 11, 2013 5:50 AM
>> > To: r-devel@r-project.org; Duncan Murdoch
>> > Subject: Re: [Rd] stringsAsFactors
>> >
>> > I think your idea to remove the warnings is excellent, and a good 
compromise.
>> > Characters
>> > already work fine in modeling functions except for the silly warning.
>> >
>> > It is interesting how often the defaults for a program reflect the data 
sets in use
>> at the
>> > time the defaults were chosen.  There are some such in my own survival 
package whose
>> > proper value is no longer as "obvious" as it was when I chose them.  
Factors are very
>> > handy for variables which have only a few levels and will be used in 
modeling.  Every
>> > character variable of every dataset in "Statistical Models in S", which 
introduced
>> > factors, is of this type so auto-transformation made a lot of sense.  The 
"solder" data
>> > set there is one for which Helmert contrasts are proper so guess what the 
default
>> > contrast
>> > option was?  (I think there are only a few data sets in the world for 
which Helmert
>> makes
>> > sense, however, and R eventually changed the default.)
>> >
>> > For character variables that should not be factors such as a street adress
>> > stringsAsFactors can be a real PITA, and I expect that people's preference 
for the
>> option
>> > depends almost entirely on how often these arise in their own work.  As 
long as there is
>> > an option that can be overridden I'm okay.  Yes, I'd prefer FALSE as the 
default, partly
>> > because the current value is a tripwire in the hallway that eventually 
catches every new
>> > user.
>> >
>> > Terry Therneau
>> >
>> > On 02/11/2013 05:00 AM, r-devel-requ...@r-project.org wrote:
>> > > Both of these were discussed by R Core.  I think it's unlikely the
>> > > default for stringsAsFactors will be changed (some R Core members like
>> > > the current behaviour), but it's fairly likely the show.signif.stars
>> > > default will change.  (That's if someone gets around to it:  I
>> > > personally don't care about that one.  P-values are commonly used
>> > > statistics, and the stars are just a simple graphical display of them.
>> > > I find some p-values to be useful, and the display to be harmless.)
>> > >
>> > > I think it's really unlikely the more extreme changes (i.e. dropping
>> > > show.signif.stars completely, or dropping p-values) will happen.
>> > >
>> > > Regarding stringsAsFactors:  I'm not going to defend keeping it as is,
>> > > I'll let the people who like it defend it.  What I will likely do is
>> > > make a few changes so that character vectors are automatically changed
>> > > to factors in modelling functions, so that operating with
>> > > stringsAsFactors=FALSE doesn't trigger silly warnings.
>> >
>> > ______________________________________________
>> > R-devel@r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-devel
>


______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] stringsAsFactors

Reply via email to