Re: [Rd] stringsAsFactors

Nicholas Crookston Wed, 13 Feb 2013 07:46:04 -0800

I would like to see the default for stringsAsFactors set to FALSE. As for
posting descriptions of all the horrors, also post descriptions of the
ensuing increases in productivity that many will see as they no longer have
to go back and fix code that is not working because they forgot to add
as.is=TRUE,
or stringsAsFactors=FALSE.  I have often spent a lot of time fixing
"reuseable code" that was developed using data coded as numeric (and
therefore not automatically converted to factors) and then applied to new
data that has some character data which was automatically converted.


In short, I think the default is causing problems that will go away. Lets
report those cases as well!

Duncan's judgement on these issues has been quite wise in the past. I'm
going to take him up on his suggestion and change my default. Lets see!

Nick Crookston




On Wed, Feb 13, 2013 at 5:39 AM, Duncan Murdoch <murdoch.dun...@gmail.com>wrote:

> On 13-02-13 8:30 AM, Milan Bouchet-Valat wrote:
>
>> Le mercredi 13 février 2013 à 12:33 +0000, Michael Dewey a écrit :
>>
>>> At 18:01 11/02/2013, Ista Zahn wrote:
>>>
>>>> FWIW my view is that for data cleaning and organizing factors just get
>>>> it the way. For modeling I like them because they make it easier to
>>>> understand what is happening. For example I can look at the levels()
>>>> to see what the reference group will be. With characters one has to
>>>> know a) that levels are created in alphabetical order and b) the
>>>> alphabetical order of the the unique values in the character vector.
>>>> Ugh. So my habit is to turn off stringsAsFactors, then explicitly
>>>> convert to factors before modeling (I also use factors to change the
>>>> order in which things are displayed in tables and graphs, another
>>>> place where converting to factors myself is useful but the creating
>>>> them in alphabetical order by default is not)
>>>>
>>>> All this is to say that I would like options(stingsAsFactors=FALSE) to
>>>> be the default, but I like the warning about converting to factors in
>>>> modeling functions because it reminds me that I forgot to covert them,
>>>> which I like to do anyway...
>>>>
>>>
>>> I seem to be one of the few people who find the current default
>>> helpful. When I read in a dataset I am nearly always going to follow
>>> it with one or more of the modelling functions and so I do want to
>>> treat the categorical variables as factors. I cannot off-hand think
>>> of an example where I have had to convert them to characters.
>>>
>> If the changes to modeling functions that are discussed in this thread
>> can finally be applied (i.e. a solution is found), characters would be
>> converted to factors automatically, so you would not notice the
>> difference. And if you need to change the order of levels of your
>> factors, calling factor(myVar, levels=c(...)) is the same, be myVar a
>> character or a factor.
>>
>
> I think most of the changes *have* been applied.  Please try R-devel, and
> point out problems.
>
> The only change that I would like to apply but haven't (and probably
> won't) is to change the default for stringsAsFactors to FALSE.  Users who
> think that is a bad idea can bolster their cases by setting
> options(stringsAsFactors=**FALSE), and posting descriptions of all the
> horrors that ensue.
>
> Duncan Murdoch
>
>
>
>>  Incidentally xkcd has, while this discussion has been going on,
>>> posted something relevant
>>> http://www.xkcd.com/1172/
>>>
>> Truly hilarious, indeed. But beware, it sounds like an argument in favor
>> of the change, while you are lobbying against it. :-p
>>
>>
>> Regards
>>
>>
>>
>>
>>>
>>>  Best,
>>>> Ista
>>>>
>>>> On Mon, Feb 11, 2013 at 12:50 PM, Duncan Murdoch
>>>> <murdoch.dun...@gmail.com> wrote:
>>>>
>>>>> On 11/02/2013 12:13 PM, William Dunlap wrote:
>>>>>
>>>>>>
>>>>>> Note that changing this does not just mean getting rid of "silly
>>>>>> warnings".
>>>>>> Currently, predict.lm() can give wrong answers when stringsAsFactors
>>>>>> is
>>>>>> FALSE.
>>>>>>
>>>>>>     > d <- data.frame(x=1:10, f=rep(c("A","B","C"), c(4,3,3)),
>>>>>> y=c(1:4,
>>>>>> 15:17, 28.1,28.8,30.1))
>>>>>>     > fit_ab <- lm(y ~ x + f, data = d, subset = f!="B")
>>>>>>     Warning message:
>>>>>>     In model.matrix.default(mt, mf, contrasts) :
>>>>>>       variable 'f' converted to a factor
>>>>>>     > predict(fit_ab, newdata=d)
>>>>>>      1 2 3 4 5 6 7 8 9 10
>>>>>>      1  2  3  4 25 26 27  8  9 10
>>>>>>     Warning messages:
>>>>>>     1: In model.matrix.default(Terms, m, contrasts.arg =
>>>>>> object$contrasts)
>>>>>> :
>>>>>>       variable 'f' converted to a factor
>>>>>>     2: In predict.lm(fit_ab, newdata = d) :
>>>>>>       prediction from a rank-deficient fit may be misleading
>>>>>>
>>>>>> fit_ab is not rank-deficient and the predict should report
>>>>>>      1 2 3 4 NA NA NA 28 29 30
>>>>>>
>>>>>
>>>>>
>>>>> In R-devel, the two warnings about factor conversions are no longer
>>>>> given,
>>>>> but the predictions are the same and the warning about rank
>>>>>
>>>> deficiency still
>>>>
>>>>> shows up.  If f is set to be a factor, an error is generated:
>>>>>
>>>>> Error in model.frame.default(Terms, newdata, na.action = na.action,
>>>>> xlev =
>>>>> object$xlevels) :
>>>>>    factor f has new levels B
>>>>>
>>>>> I think both the warning and error are somewhat reasonable responses.
>>>>>  The
>>>>> fit is rank deficient relative to the model that includes f ==
>>>>>
>>>> "B",  because
>>>>
>>>>> the column of the design matrix corresponding to f level B would be
>>>>> completely zero.  In this particular model, we could still do
>>>>> predictions
>>>>> for the other levels, but it also seems reasonable to quit, given that
>>>>> clearly something has gone wrong.
>>>>>
>>>>> I do think that it's unfortunate that we don't get the same result in
>>>>> both
>>>>> cases, and I'd like to have gotten the predictions you suggested, but I
>>>>> don't think that's going to happen.  The reason for the difference is
>>>>> that
>>>>> the subsetting is done before the conversion to a factor, but I think
>>>>> that
>>>>> is unavoidable without really big changes.
>>>>>
>>>>> Duncan Murdoch
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Bill Dunlap
>>>>>> Spotfire, TIBCO Software
>>>>>> wdunlap tibco.com
>>>>>>
>>>>>>  -----Original Message-----
>>>>>>> From: r-devel-boun...@r-project.org
>>>>>>> [mailto:r-devel-bounces@r-**project.org<r-devel-boun...@r-project.org>]
>>>>>>> On Behalf
>>>>>>> Of Terry Therneau
>>>>>>> Sent: Monday, February 11, 2013 5:50 AM
>>>>>>> To: r-devel@r-project.org; Duncan Murdoch
>>>>>>> Subject: Re: [Rd] stringsAsFactors
>>>>>>>
>>>>>>> I think your idea to remove the warnings is excellent, and a good
>>>>>>> compromise.
>>>>>>> Characters
>>>>>>> already work fine in modeling functions except for the silly warning.
>>>>>>>
>>>>>>> It is interesting how often the defaults for a program reflect the
>>>>>>> data
>>>>>>> sets in use at the
>>>>>>> time the defaults were chosen.  There are some such in my own
>>>>>>> survival
>>>>>>> package whose
>>>>>>> proper value is no longer as "obvious" as it was when I chose them.
>>>>>>> Factors are very
>>>>>>> handy for variables which have only a few levels and will be used in
>>>>>>> modeling.  Every
>>>>>>> character variable of every dataset in "Statistical Models in S",
>>>>>>> which
>>>>>>> introduced
>>>>>>> factors, is of this type so auto-transformation made a lot of sense.
>>>>>>> The "solder" data
>>>>>>> set there is one for which Helmert contrasts are proper so guess what
>>>>>>> the default
>>>>>>> contrast
>>>>>>> option was?  (I think there are only a few data sets in the world for
>>>>>>> which Helmert makes
>>>>>>> sense, however, and R eventually changed the default.)
>>>>>>>
>>>>>>> For character variables that should not be factors such as a street
>>>>>>> adress
>>>>>>> stringsAsFactors can be a real PITA, and I expect that people's
>>>>>>> preference for the option
>>>>>>> depends almost entirely on how often these arise in their own work.
>>>>>>>  As
>>>>>>> long as there is
>>>>>>> an option that can be overridden I'm okay.  Yes, I'd prefer FALSE as
>>>>>>> the
>>>>>>> default, partly
>>>>>>> because the current value is a tripwire in the hallway that
>>>>>>> eventually
>>>>>>> catches every new
>>>>>>> user.
>>>>>>>
>>>>>>> Terry Therneau
>>>>>>>
>>>>>>> On 02/11/2013 05:00 AM, r-devel-requ...@r-project.org wrote:
>>>>>>>
>>>>>>>> Both of these were discussed by R Core.  I think it's unlikely the
>>>>>>>> default for stringsAsFactors will be changed (some R Core members
>>>>>>>> like
>>>>>>>> the current behaviour), but it's fairly likely the show.signif.stars
>>>>>>>> default will change.  (That's if someone gets around to it:  I
>>>>>>>> personally don't care about that one.  P-values are commonly used
>>>>>>>> statistics, and the stars are just a simple graphical display of
>>>>>>>> them.
>>>>>>>> I find some p-values to be useful, and the display to be harmless.)
>>>>>>>>
>>>>>>>> I think it's really unlikely the more extreme changes (i.e. dropping
>>>>>>>> show.signif.stars completely, or dropping p-values) will happen.
>>>>>>>>
>>>>>>>> Regarding stringsAsFactors:  I'm not going to defend keeping it as
>>>>>>>> is,
>>>>>>>> I'll let the people who like it defend it.  What I will likely do is
>>>>>>>> make a few changes so that character vectors are automatically
>>>>>>>> changed
>>>>>>>> to factors in modelling functions, so that operating with
>>>>>>>> stringsAsFactors=FALSE doesn't trigger silly warnings.
>>>>>>>>
>>>>>>>
>>>>>>> ______________________________**________________
>>>>>>> R-devel@r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/**listinfo/r-devel<https://stat.ethz.ch/mailman/listinfo/r-devel>
>>>>>>>
>>>>>>
>>>>>
>>>>> ______________________________**________________
>>>>> R-devel@r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/**listinfo/r-devel<https://stat.ethz.ch/mailman/listinfo/r-devel>
>>>>>
>>>>
>>> Michael Dewey
>>> i...@aghmed.fsnet.co.uk
>>> http://www.aghmed.fsnet.co.uk/**home.html<http://www.aghmed.fsnet.co.uk/home.html>
>>>
>>> ______________________________**________________
>>> R-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/**listinfo/r-devel<https://stat.ethz.ch/mailman/listinfo/r-devel>
>>>
>>
>>
> ______________________________**________________
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/**listinfo/r-devel<https://stat.ethz.ch/mailman/listinfo/r-devel>
>

        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] stringsAsFactors

Reply via email to