Re: [R] Speed Advice for R --- avoid data frames

Uwe Ligges Sat, 02 Jul 2011 11:59:33 -0700

Some comments:

the comparison matrix rows vs. matrix columns is incorrect: Note that Rhas lazy evaluation, hence you construct your matrix in the timing forthe rows and it is already constructed in the timing for the columns,hence you want to use:


 M <- matrix( rnorm(C*R), nrow=R )
 D <- as.data.frame(matrix( rnorm(C*R), nrow=R ) )
 example(M)
 example(D)

Further on, you are correct with you statement that data.frame indexingis much slower, but if you can store your data in matrix form, just goon as it is.

I doubt anybody is really going to make the index operation you citedwithin a loop. Then, with a data.frame, I can live with many vectorizedreplacements again:


> system.time(D[,20] <- sqrt(abs(D[,20])) + rnorm(1000))
   user  system elapsed
   0.01    0.00    0.01

> system.time(D[20,] <- sqrt(abs(D[20,])) + rnorm(1000))
   user  system elapsed
   0.51    0.00    0.52

OK, it would be nice to do that faster, but this is not easy. I think RCore is happy to see contributions to make it faster without breakingexisting features.




Best wishes,
Uwe



On 02.07.2011 20:35, ivo welch wrote:

This email is intended for R users that are not that familiar with R
internals and are searching google about how to speed up R.

Despite common misperception, R is not slow when it comes to iterative
access.  R is fast when it comes to matrices.  R is very slow when it
comes to iterative access into data frames.  Such access occurs when a
user uses "data$varname[index]", which is a very common operation.  To
illustrate, run the following program:

R<- 1000; C<- 1000

example<- function(m) {
   cat("rows: "); cat(system.time( for (r in 1:R) m[r,20]<-
sqrt(abs(m[r,20])) + rnorm(1) ), "\n")
   cat("columns: "); cat(system.time(for (c in 1:C) m[20,c]<-
sqrt(abs(m[20,c])) + rnorm(1)), "\n")
   if (is.data.frame(m)) { cat("df: columns as names: ");
cat(system.time(for (c in 1:C) m[[c]][20]<- sqrt(abs(m[[c]][20])) +
rnorm(1)), "\n") }
}

cat("\n**** Now as matrix\n")
example( matrix( rnorm(C*R), nrow=R ) )

cat("\n**** Now as data frame\n")
example( as.data.frame( matrix( rnorm(C*R), nrow=R ) ) )


The following are the reported timing under R 2.12.0 on a Mac Pro 3,1
with ample RAM:

matrix, columns: 0.01s
matrix, rows: 0.175s
data frame, columns: 53s
data frame, rows: 56s
data frame, names: 58s

Data frame access is about 5,000 times slower than matrix column
access, and 300 times slower than matrix row access.  R's data frame
operational speed is an amazing 40 data accesses per seconds.  I have
not seen access numbers this low for decades.


How to avoid it?  Not easy.  One way is to create multiple matrices,
and group them as an object.  of course, this loses a lot of features
of R.  Another way is to copy all data used in calculations out of the
data frame into a matrix, do the operations, and then copy them back.
not ideal, either.

In my opinion, this is an R design flow.  Data frames are the
fundamental unit of much statistical analysis, and should be fast.  I
think R lacks any indexing into data frames.  Turning on indexing of
data frames should at least be an optional feature.


I hope this message post helps others.

/iaw

----
Ivo Welch (ivo.we...@gmail.com)
http://www.ivo-welch.info/

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Speed Advice for R --- avoid data frames

Reply via email to