[R] splitting a factor column into binary columns for each factor

Chuck White Tue, 26 Jan 2010 12:15:37 -0800

Yesterday I posted the following question (my apologies for not putting a 
subject line):


=================question======================
Hello -- I would like to know of a more efficient way of writing the following 
piece of code. Thanks. 

options(stringsAsFactors=FALSE) 
orig <-  c(rep('11111111',100000),rep('22222222',200000),rep('33333333'  
,300000),rep('44444444',400000)) 
orig.unique <- unique(orig) 
system.time(df <- as.data.frame(sapply(orig.unique,  function(x) 
ifelse(orig==x, 1, 0))))
============================================

I received a response via e-mail which was **extremely** useful.

=================answer======================
Using sapply instead of lapply here is a waste.  sapply() calls lapply(), which 
returns a list that sapply() turns into a list by making each list element a 
column of the matrix.  data.frame(matrix) then makes a list from the columns of 
the matrix.

The one thing that sapply gives you and lapply doesn't is column names.  If you 
attach names to orig.unique then lapply's output will have them. 
 
Also ifelse(orig==x,1,0) slower than the equivalent as.numeric(orig==x).  I 
wrote functions g0 (containing your code), g1 (using lapply), and g2 
(ifelse->as.numeric).  I parameterized them by the number of '1111111' elements 
and they each return the data.frame created and the time it took to do it:

> g0
function(n = 1e+05) { 
  orig <- c(rep("11111111", n), rep("22222222", 2*n), rep("33333333", 3*n), 
rep("44444444", 4*n)) 
  orig.unique <- unique(orig) 
  time <- system.time(df <- as.data.frame(sapply(orig.unique, function(x) 
ifelse(orig == x, 1, 0)))) 
  list(time = time, df = df) 
} 

> g1
function (n = 1e+05)  { 
  orig <- c(rep("11111111", n), rep("22222222", 2*n), rep("33333333", 3*n), 
rep("44444444", 4*n)) 
  orig.unique <- unique(orig)
  names(orig.unique) <- orig.unique 
  time <- system.time(df <- data.frame(check.names=FALSE, lapply(orig.unique, 
function(x) ifelse(orig == x, 1, 0)))) 
  list(time = time, df = df) 
}

> g2 
function (n = 1e+05) { 
  orig <- c(rep("11111111", n), rep("22222222", 2*n), rep("33333333", 3*n), 
rep("44444444", 4*n)) 
  orig.unique <- unique(orig) 
  names(orig.unique) <- orig.unique 
  time <- system.time(df <- data.frame(check.names=FALSE, lapply(orig.unique, 
function(x) as.numeric(orig == x)))) 
  list(time = time, df = df) 
} 
 
For n=10^5 the times were 
> g0(1e5)$time 
   user  system elapsed 
  20.65    0.41   20.64 
> g1(1e5)$time 
   user  system elapsed 
   2.35    0.05    2.36 
> g2(1e5)$time 
   user  system elapsed 
   0.73    0.10    0.77 
and the data.frames each produced were identical. 
 
Another approach is to use outer() to make a matrix that gets passed to 
data.frame().  It seems slightly slower than g2, but small changes might make 
it faster.
 
> g3 
function (n = 1e+05) { 
    orig <- c(rep("11111111", n), rep("22222222", 2 * n), rep("33333333", 3 * 
n), rep("44444444", 4 * n)) 
    orig.unique <- unique(orig) 
    names(orig.unique) <- orig.unique 
    time <- system.time(df <- data.frame(check.names=FALSE, outer(orig, 
orig.unique, function(x, y) as.numeric(x==y)))) 
    list(time = time, df = df) 
}

> g3(1e5)$time 
   user  system elapsed 
   1.02    0.00    0.97 
 
When you want to optimize code it is often handy to write functions like this 
to do the timing for various problem sizes.  You can quickly experiment with 
small versions of the problem to make sure the results are correct and the time 
looks reasonable and later see if the times scale up as hoped to your desired 
problem size.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] splitting a factor column into binary columns for each factor

Reply via email to