On 29/02/2012 13:41, Duncan Murdoch wrote:
On 12-02-29 8:16 AM, R. Michael Weylandt wrote:
Factors are internally stored as integers (enums if you have used
other programming languages) with a special label set -- it's more
memory efficient than storing the whole string over and over.

That was one of the original justifications, but character vectors are
just as memory efficient these days.

No, not really. Character vectors (STRSXPs) store a pointer for each string entry, and factors store an integer. On most current systems pointers are twice the size of integers, so on a 64-bit system

> a <- rep(letters[1:10], each = 1000)
> object.size(a)
80520 bytes
> object.size(as.factor(a))
41008 bytes


The other justifications are still valid: sometimes you have a vector
which only takes on a subset of the possible values it could take, and
when you tabulate it, you'd like to see those zero counts. You may also
want to control the display order, and a factor allows that.

For example:

x <- c("a", "a", "b")
table(x)
x <- factor(x, levels=c("c", "b", "a"))
table(x)

Duncan Murdoch

--
Brian D. Ripley,                  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to