hi uwe---thanks for the clarification. of course, my example should always be done in vectorized form. I only used it to show how iterative access compares in the simplest possible fashion. <100 accesses per seconds is REALLY slow, though.
I don't know R internals and the learning curve would be steep. moreover, there is no guarantee that changes I would make would be accepted. so, I cannot do this. however, for an R expert, this should not be too difficult. conceptually, if data frame element access primitives are create/write/read/destroy in the code, then it's truly trivial. just add a matrix (dim the same as the data frame) of byte pointers to point at the storage upon creation/change time. this would be quick-and-dirty. for curiosity, do you know which source file has the data frame internals? maybe I will get tempted anyway if it is simple enough. (a more efficient but more involved way to do this would be to store a data frame internally always as a matrix of data pointers, but this would probably require more surgery.) It is also not as important for me, as it is for others...to give a good impression to those that are not aware of the tradeoffs---which is most people considering to adopt R. /iaw ---- Ivo Welch (ivo.we...@gmail.com) 2011/7/2 Uwe Ligges <lig...@statistik.tu-dortmund.de> > Some comments: > > the comparison matrix rows vs. matrix columns is incorrect: Note that R has > lazy evaluation, hence you construct your matrix in the timing for the rows > and it is already constructed in the timing for the columns, hence you want > to use: > > M <- matrix( rnorm(C*R), nrow=R ) > D <- as.data.frame(matrix( rnorm(C*R), nrow=R ) ) > example(M) > example(D) > > Further on, you are correct with you statement that data.frame indexing is > much slower, but if you can store your data in matrix form, just go on as it > is. > > I doubt anybody is really going to make the index operation you cited > within a loop. Then, with a data.frame, I can live with many vectorized > replacements again: > > > system.time(D[,20] <- sqrt(abs(D[,20])) + rnorm(1000)) > user system elapsed > 0.01 0.00 0.01 > > > system.time(D[20,] <- sqrt(abs(D[20,])) + rnorm(1000)) > user system elapsed > 0.51 0.00 0.52 > > OK, it would be nice to do that faster, but this is not easy. I think R > Core is happy to see contributions to make it faster without breaking > existing features. > > > > Best wishes, > Uwe > > > > > On 02.07.2011 20:35, ivo welch wrote: > >> This email is intended for R users that are not that familiar with R >> internals and are searching google about how to speed up R. >> >> Despite common misperception, R is not slow when it comes to iterative >> access. R is fast when it comes to matrices. R is very slow when it >> comes to iterative access into data frames. Such access occurs when a >> user uses "data$varname[index]", which is a very common operation. To >> illustrate, run the following program: >> >> R<- 1000; C<- 1000 >> >> example<- function(m) { >> cat("rows: "); cat(system.time( for (r in 1:R) m[r,20]<- >> sqrt(abs(m[r,20])) + rnorm(1) ), "\n") >> cat("columns: "); cat(system.time(for (c in 1:C) m[20,c]<- >> sqrt(abs(m[20,c])) + rnorm(1)), "\n") >> if (is.data.frame(m)) { cat("df: columns as names: "); >> cat(system.time(for (c in 1:C) m[[c]][20]<- sqrt(abs(m[[c]][20])) + >> rnorm(1)), "\n") } >> } >> >> cat("\n**** Now as matrix\n") >> example( matrix( rnorm(C*R), nrow=R ) ) >> >> cat("\n**** Now as data frame\n") >> example( as.data.frame( matrix( rnorm(C*R), nrow=R ) ) ) >> >> >> The following are the reported timing under R 2.12.0 on a Mac Pro 3,1 >> with ample RAM: >> >> matrix, columns: 0.01s >> matrix, rows: 0.175s >> data frame, columns: 53s >> data frame, rows: 56s >> data frame, names: 58s >> >> Data frame access is about 5,000 times slower than matrix column >> access, and 300 times slower than matrix row access. R's data frame >> operational speed is an amazing 40 data accesses per seconds. I have >> not seen access numbers this low for decades. >> >> >> How to avoid it? Not easy. One way is to create multiple matrices, >> and group them as an object. of course, this loses a lot of features >> of R. Another way is to copy all data used in calculations out of the >> data frame into a matrix, do the operations, and then copy them back. >> not ideal, either. >> >> In my opinion, this is an R design flow. Data frames are the >> fundamental unit of much statistical analysis, and should be fast. I >> think R lacks any indexing into data frames. Turning on indexing of >> data frames should at least be an optional feature. >> >> >> I hope this message post helps others. >> >> /iaw >> >> ---- >> Ivo Welch (ivo.we...@gmail.com) >> http://www.ivo-welch.info/ >> >> ______________________________**________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help> >> PLEASE do read the posting guide http://www.R-project.org/** >> posting-guide.html <http://www.R-project.org/posting-guide.html> >> and provide commented, minimal, self-contained, reproducible code. >> > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.