On Feb 12, 2013, at 11:05 AM, Brian Lee Yung Rowe wrote:

> 
> I thought that the default was the way it was for performance reasons. For 
> large data.frames or repeated applications, using factors should be faster 
> for non-trivial strings.
> 
>> fs <- c('apple','peach','watermelon','spinach','persimmon','potato','kale')
>> n <- 1000000
>> 
>> a1 <- data.frame(f=sample(fs,n,replace=TRUE), x1=rnorm(n), x2=rnorm(n), 
>> stringsAsFactors=TRUE)
>> a2 <- data.frame(f=sample(fs,n,replace=TRUE), x1=rnorm(n), x2=rnorm(n), 
>> stringsAsFactors=FALSE)
>> 
>> fn <- function(i,x) x[x$f %in% c('kale','spinach'),]
>> system.time(z <- sapply(1:100, fn, a1))
>   user  system elapsed 
> 19.614   4.037  24.649 
>> system.time(z <- sapply(1:100, fn, a2))
>   user  system elapsed 
> 19.726   7.715  36.761 
> 

Not really:

> system.time(z <- sapply(1:100, fn, a1))
   user  system elapsed 
 13.780   0.444  14.229 
> rm(z)
> gc()
          used (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells  182113  9.8     407500   21.8    337655   18.1
Vcells 5789638 44.2  133982285 1022.3 163019778 1243.8
> system.time(z <- sapply(1:100, fn, a2))
   user  system elapsed 
 13.201   0.668  13.873 


But your test is bogus, because %in% uses match() which converts factors to 
character vectors anyway, so in your case you're just measuring noise in your 
system, character vectors are always faster in your example.

The reason is that in R strings are hashed so character vectors are technically 
very similar to factors just with faster access (because they don't need to go 
through the integer indirection). On 32-bit strings are in theory always faster 
than factors, on 64-bit they use double the size so they may or may not be 
faster depending on how you hit the cache etc. Anyway, in modern R versions 
you're much better off using character vectors than factors for any processing, 
so stringsAsFactors=FALSE is what I use exclusively.

Cheers,
Simon

> 
> On Feb 12, 2013, at 10:40 AM, Ben Bolker <bbol...@gmail.com> wrote:
>> 
>> Thanks, Uwe.
>> Now let me go one step farther.
>> 
>> Can you (or anyone) give a good argument **other than backward
>> compatibility** for keeping the stringAsFactors=TRUE argument on
>> data.frame()?
>> 
>> I appreciate your distinction between data.frame() and read.table()'s
>> use of stringAsFactors, and I can see that there is some point for
>> quick-and-dirty interactive use in setting all non-numeric variables to
>> factors (arguing that wanting non-numerics as factors is somewhat more
>> common than wanting them as strings).
>> 
>> It might be nice to add an optional stringsAsFactors (and check.names)
>> argument to transform(): I've had to write my own Transform() function
>> to allow the defaults to be overridden, since transform() calls
>> data.frame() with the defaults.  (Setting the stringsAsFactors option
>> globally would work, although not for check.names.)
>> 
>> Ben BOlker
>> 
>>> 
>>>> 
>>>>> What I will likely do is
>>>>> make a few changes so that character vectors are automatically changed
>>>>> to factors in modelling functions, so that operating with
>>>>> stringsAsFactors=FALSE doesn't trigger silly warnings.
>>>>> 
>>>>> Duncan Murdoch
>>>>> 
>>>> 
>>>> [apologies for snipping context: "gmane made me do it"]
>>>> 
>>>> ______________________________________________
>>>> R-devel@r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>> 
>> 
>> ______________________________________________
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> ______________________________________________
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to