> A place where factors really are a pain is when the patient id is a > character > string. When, for instance, you subset the data to do an analysis of only > the females, having the data set `remember' all of the male id's (the original > levels) is non-productive in dozens of ways. For other variables factors > work well and have some nice properties. In general, I've found in my work > (medical research) that factors are beneficial for about 1/5 of the character > variables, a PITA for 1/4, and a wash for the rest; so prefer to do any > transformations myself.
It seems to me that the most importance difference between factors and character vectors is that factors also store the range of the variable. You could imagine doing something similar for continuous variables. This would have the interesting property that plots of subsets would have the same range as plots of the original data. I'd imagine, just as with factors, this would be useful and frustrating in equal parts. In terms of which should be the default, I can imagine two arguments: * keep to the original format of the data as closely as possible: character vectors should be the default * maintain as much information about the original data as possible: factors should be the default. Hadley ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.