On Wed, Feb 13, 2013 at 7:33 AM, Michael Dewey <i...@aghmed.fsnet.co.uk> wrote: > At 18:01 11/02/2013, Ista Zahn wrote: >> >> FWIW my view is that for data cleaning and organizing factors just get >> it the way. For modeling I like them because they make it easier to >> understand what is happening. For example I can look at the levels() >> to see what the reference group will be. With characters one has to >> know a) that levels are created in alphabetical order and b) the >> alphabetical order of the the unique values in the character vector. >> Ugh. So my habit is to turn off stringsAsFactors, then explicitly >> convert to factors before modeling (I also use factors to change the >> order in which things are displayed in tables and graphs, another >> place where converting to factors myself is useful but the creating >> them in alphabetical order by default is not) >> >> All this is to say that I would like options(stingsAsFactors=FALSE) to >> be the default, but I like the warning about converting to factors in >> modeling functions because it reminds me that I forgot to covert them, >> which I like to do anyway... > > > I seem to be one of the few people who find the current default helpful. > When I read in a dataset I am nearly always going to follow it with one or > more of the modelling functions and so I do want to treat the categorical > variables as factors. I cannot off-hand think of an example where I have had > to convert them to characters.
Your data must reach you in a much better state than mine reaches me. I spend most of my time organizing, combining, fixing typos, reshaping, merging and so on. Then I see the dreaded warning "In `[<-.factor`(`*tmp*`, 6, value = "z") : invalid factor level, NAs generated which reminds me that I've forgotten to set stringsAsFactors=FALSE. However, I'm not saying I don't like factors. Once the data is cleaned up they are very useful. But often I find that when I'm trying to clean up a messy data set they just get in the way. And since that is what I spend most of my time doing, factors get in the way most of the time for me. > > Incidentally xkcd has, while this discussion has been going on, posted > something relevant > http://www.xkcd.com/1172/ > > > > >> Best, >> Ista >> >> On Mon, Feb 11, 2013 at 12:50 PM, Duncan Murdoch >> <murdoch.dun...@gmail.com> wrote: >> > On 11/02/2013 12:13 PM, William Dunlap wrote: >> >> >> >> Note that changing this does not just mean getting rid of "silly >> >> warnings". >> >> Currently, predict.lm() can give wrong answers when stringsAsFactors is >> >> FALSE. >> >> >> >> > d <- data.frame(x=1:10, f=rep(c("A","B","C"), c(4,3,3)), y=c(1:4, >> >> 15:17, 28.1,28.8,30.1)) >> >> > fit_ab <- lm(y ~ x + f, data = d, subset = f!="B") >> >> Warning message: >> >> In model.matrix.default(mt, mf, contrasts) : >> >> variable 'f' converted to a factor >> >> > predict(fit_ab, newdata=d) >> >> 1 2 3 4 5 6 7 8 9 10 >> >> 1 2 3 4 25 26 27 8 9 10 >> >> Warning messages: >> >> 1: In model.matrix.default(Terms, m, contrasts.arg = >> >> object$contrasts) >> >> : >> >> variable 'f' converted to a factor >> >> 2: In predict.lm(fit_ab, newdata = d) : >> >> prediction from a rank-deficient fit may be misleading >> >> >> >> fit_ab is not rank-deficient and the predict should report >> >> 1 2 3 4 NA NA NA 28 29 30 >> > >> > >> > In R-devel, the two warnings about factor conversions are no longer >> > given, >> > but the predictions are the same and the warning about rank deficiency >> > still >> > shows up. If f is set to be a factor, an error is generated: >> > >> > Error in model.frame.default(Terms, newdata, na.action = na.action, xlev >> > = >> > object$xlevels) : >> > factor f has new levels B >> > >> > I think both the warning and error are somewhat reasonable responses. >> > The >> > fit is rank deficient relative to the model that includes f == "B", >> > because >> > the column of the design matrix corresponding to f level B would be >> > completely zero. In this particular model, we could still do >> > predictions >> > for the other levels, but it also seems reasonable to quit, given that >> > clearly something has gone wrong. >> > >> > I do think that it's unfortunate that we don't get the same result in >> > both >> > cases, and I'd like to have gotten the predictions you suggested, but I >> > don't think that's going to happen. The reason for the difference is >> > that >> > the subsetting is done before the conversion to a factor, but I think >> > that >> > is unavoidable without really big changes. >> > >> > Duncan Murdoch >> > >> > >> > >> >> >> >> Bill Dunlap >> >> Spotfire, TIBCO Software >> >> wdunlap tibco.com >> >> >> >> > -----Original Message----- >> >> > From: r-devel-boun...@r-project.org >> >> > [mailto:r-devel-boun...@r-project.org] On Behalf >> >> > Of Terry Therneau >> >> > Sent: Monday, February 11, 2013 5:50 AM >> >> > To: r-devel@r-project.org; Duncan Murdoch >> >> > Subject: Re: [Rd] stringsAsFactors >> >> > >> >> > I think your idea to remove the warnings is excellent, and a good >> >> > compromise. >> >> > Characters >> >> > already work fine in modeling functions except for the silly warning. >> >> > >> >> > It is interesting how often the defaults for a program reflect the >> >> > data >> >> > sets in use at the >> >> > time the defaults were chosen. There are some such in my own >> >> > survival >> >> > package whose >> >> > proper value is no longer as "obvious" as it was when I chose them. >> >> > Factors are very >> >> > handy for variables which have only a few levels and will be used in >> >> > modeling. Every >> >> > character variable of every dataset in "Statistical Models in S", >> >> > which >> >> > introduced >> >> > factors, is of this type so auto-transformation made a lot of sense. >> >> > The "solder" data >> >> > set there is one for which Helmert contrasts are proper so guess what >> >> > the default >> >> > contrast >> >> > option was? (I think there are only a few data sets in the world for >> >> > which Helmert makes >> >> > sense, however, and R eventually changed the default.) >> >> > >> >> > For character variables that should not be factors such as a street >> >> > adress >> >> > stringsAsFactors can be a real PITA, and I expect that people's >> >> > preference for the option >> >> > depends almost entirely on how often these arise in their own work. >> >> > As >> >> > long as there is >> >> > an option that can be overridden I'm okay. Yes, I'd prefer FALSE as >> >> > the >> >> > default, partly >> >> > because the current value is a tripwire in the hallway that >> >> > eventually >> >> > catches every new >> >> > user. >> >> > >> >> > Terry Therneau >> >> > >> >> > On 02/11/2013 05:00 AM, r-devel-requ...@r-project.org wrote: >> >> > > Both of these were discussed by R Core. I think it's unlikely the >> >> > > default for stringsAsFactors will be changed (some R Core members >> >> > > like >> >> > > the current behaviour), but it's fairly likely the >> >> > > show.signif.stars >> >> > > default will change. (That's if someone gets around to it: I >> >> > > personally don't care about that one. P-values are commonly used >> >> > > statistics, and the stars are just a simple graphical display of >> >> > > them. >> >> > > I find some p-values to be useful, and the display to be harmless.) >> >> > > >> >> > > I think it's really unlikely the more extreme changes (i.e. >> >> > > dropping >> >> > > show.signif.stars completely, or dropping p-values) will happen. >> >> > > >> >> > > Regarding stringsAsFactors: I'm not going to defend keeping it as >> >> > > is, >> >> > > I'll let the people who like it defend it. What I will likely do >> >> > > is >> >> > > make a few changes so that character vectors are automatically >> >> > > changed >> >> > > to factors in modelling functions, so that operating with >> >> > > stringsAsFactors=FALSE doesn't trigger silly warnings. >> >> > >> >> > ______________________________________________ >> >> > R-devel@r-project.org mailing list >> >> > https://stat.ethz.ch/mailman/listinfo/r-devel >> > >> > >> > ______________________________________________ >> > R-devel@r-project.org mailing list >> > https://stat.ethz.ch/mailman/listinfo/r-devel > > > Michael Dewey > i...@aghmed.fsnet.co.uk > http://www.aghmed.fsnet.co.uk/home.html > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel