Some comments:

the comparison matrix rows vs. matrix columns is incorrect: Note that R has lazy evaluation, hence you construct your matrix in the timing for the rows and it is already constructed in the timing for the columns, hence you want to use:

 M <- matrix( rnorm(C*R), nrow=R )
 D <- as.data.frame(matrix( rnorm(C*R), nrow=R ) )
 example(M)
 example(D)

Further on, you are correct with you statement that data.frame indexing is much slower, but if you can store your data in matrix form, just go on as it is.

I doubt anybody is really going to make the index operation you cited within a loop. Then, with a data.frame, I can live with many vectorized replacements again:

> system.time(D[,20] <- sqrt(abs(D[,20])) + rnorm(1000))
   user  system elapsed
   0.01    0.00    0.01

> system.time(D[20,] <- sqrt(abs(D[20,])) + rnorm(1000))
   user  system elapsed
   0.51    0.00    0.52

OK, it would be nice to do that faster, but this is not easy. I think R Core is happy to see contributions to make it faster without breaking existing features.



Best wishes,
Uwe



On 02.07.2011 20:35, ivo welch wrote:
This email is intended for R users that are not that familiar with R
internals and are searching google about how to speed up R.

Despite common misperception, R is not slow when it comes to iterative
access.  R is fast when it comes to matrices.  R is very slow when it
comes to iterative access into data frames.  Such access occurs when a
user uses "data$varname[index]", which is a very common operation.  To
illustrate, run the following program:

R<- 1000; C<- 1000

example<- function(m) {
   cat("rows: "); cat(system.time( for (r in 1:R) m[r,20]<-
sqrt(abs(m[r,20])) + rnorm(1) ), "\n")
   cat("columns: "); cat(system.time(for (c in 1:C) m[20,c]<-
sqrt(abs(m[20,c])) + rnorm(1)), "\n")
   if (is.data.frame(m)) { cat("df: columns as names: ");
cat(system.time(for (c in 1:C) m[[c]][20]<- sqrt(abs(m[[c]][20])) +
rnorm(1)), "\n") }
}

cat("\n**** Now as matrix\n")
example( matrix( rnorm(C*R), nrow=R ) )

cat("\n**** Now as data frame\n")
example( as.data.frame( matrix( rnorm(C*R), nrow=R ) ) )


The following are the reported timing under R 2.12.0 on a Mac Pro 3,1
with ample RAM:

matrix, columns: 0.01s
matrix, rows: 0.175s
data frame, columns: 53s
data frame, rows: 56s
data frame, names: 58s

Data frame access is about 5,000 times slower than matrix column
access, and 300 times slower than matrix row access.  R's data frame
operational speed is an amazing 40 data accesses per seconds.  I have
not seen access numbers this low for decades.


How to avoid it?  Not easy.  One way is to create multiple matrices,
and group them as an object.  of course, this loses a lot of features
of R.  Another way is to copy all data used in calculations out of the
data frame into a matrix, do the operations, and then copy them back.
not ideal, either.

In my opinion, this is an R design flow.  Data frames are the
fundamental unit of much statistical analysis, and should be fast.  I
think R lacks any indexing into data frames.  Turning on indexing of
data frames should at least be an optional feature.


I hope this message post helps others.

/iaw

----
Ivo Welch (ivo.we...@gmail.com)
http://www.ivo-welch.info/

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to