Re: [R] Speed Advice for R --- avoid data frames

Uwe Ligges Sun, 03 Jul 2011 09:20:01 -0700


On 02.07.2011 21:35, ivo welch wrote:

hi uwe---thanks for the clarification.  of course, my example should always
be done in vectorized form.  I only used it to show how iterative access
compares in the simplest possible fashion.<100 accesses per seconds is
REALLY slow, though.

I don't know R internals and the learning curve would be steep.  moreover,
there is no guarantee that changes I would make would be accepted.  so, I
cannot do this.

however, for an R expert, this should not be too difficult.  conceptually,
if data frame element access primitives are create/write/read/destroy in the
code, then it's truly trivial.  just add a matrix (dim the same as the data
frame) of byte pointers to point at the storage upon creation/change time.
  this would be quick-and-dirty.  for curiosity, do you know which source
file has the data frame internals?  maybe I will get tempted anyway if it is
simple enough.

I think you should start to look at the mechanisms to constructdata.frames (such as data.frame) and learn that data.frames are speciallists. Then you may want to look at the differences between the.Primitive("[") and .Primitive("[<-") used for vectors (includingvectors with dim attributes such as matrixes) and the correspodingmethods for data.frames: "[<-.data.frame" and "[.data.frame".

After that, I doubt you want to improve further on. Note also thatdata.frames can be pretty large and you really do not want to store amatrix of pointers as large as the data.frame. People working witrhlarge data.frames won't be happy with such a suggestion.

If you want to follow up, I'd suggest to move the thread to R-develwhere it seems to be more appropriate.


Best,
Uwe


(a more efficient but more involved way to do this would be to store a data
frame internally always as a matrix of data pointers, but this would
probably require more surgery.)

It is also not as important for me, as it is for others...to give a good
impression to those that are not aware of the tradeoffs---which is most
people considering to adopt R.

/iaw


----
Ivo Welch (ivo.we...@gmail.com)




2011/7/2 Uwe Ligges<lig...@statistik.tu-dortmund.de>

Some comments:

the comparison matrix rows vs. matrix columns is incorrect: Note that R has
lazy evaluation, hence you construct your matrix in the timing for the rows
and it is already constructed in the timing for the columns, hence you want
to use:

  M<- matrix( rnorm(C*R), nrow=R )
  D<- as.data.frame(matrix( rnorm(C*R), nrow=R ) )
  example(M)
  example(D)

Further on, you are correct with you statement that data.frame indexing is
much slower, but if you can store your data in matrix form, just go on as it
is.

I doubt anybody is really going to make the index operation you cited
within a loop. Then, with a data.frame, I can live with many vectorized
replacements again:

system.time(D[,20]<- sqrt(abs(D[,20])) + rnorm(1000))

   user  system elapsed
   0.01    0.00    0.01

system.time(D[20,]<- sqrt(abs(D[20,])) + rnorm(1000))

   user  system elapsed
   0.51    0.00    0.52

OK, it would be nice to do that faster, but this is not easy. I think R
Core is happy to see contributions to make it faster without breaking
existing features.



Best wishes,
Uwe




On 02.07.2011 20:35, ivo welch wrote:

This email is intended for R users that are not that familiar with R
internals and are searching google about how to speed up R.

Despite common misperception, R is not slow when it comes to iterative
access.  R is fast when it comes to matrices.  R is very slow when it
comes to iterative access into data frames.  Such access occurs when a
user uses "data$varname[index]", which is a very common operation.  To
illustrate, run the following program:

R<- 1000; C<- 1000

example<- function(m) {
   cat("rows: "); cat(system.time( for (r in 1:R) m[r,20]<-
sqrt(abs(m[r,20])) + rnorm(1) ), "\n")
   cat("columns: "); cat(system.time(for (c in 1:C) m[20,c]<-
sqrt(abs(m[20,c])) + rnorm(1)), "\n")
   if (is.data.frame(m)) { cat("df: columns as names: ");
cat(system.time(for (c in 1:C) m[[c]][20]<- sqrt(abs(m[[c]][20])) +
rnorm(1)), "\n") }
}

cat("\n**** Now as matrix\n")
example( matrix( rnorm(C*R), nrow=R ) )

cat("\n**** Now as data frame\n")
example( as.data.frame( matrix( rnorm(C*R), nrow=R ) ) )


The following are the reported timing under R 2.12.0 on a Mac Pro 3,1
with ample RAM:

matrix, columns: 0.01s
matrix, rows: 0.175s
data frame, columns: 53s
data frame, rows: 56s
data frame, names: 58s

Data frame access is about 5,000 times slower than matrix column
access, and 300 times slower than matrix row access.  R's data frame
operational speed is an amazing 40 data accesses per seconds.  I have
not seen access numbers this low for decades.


How to avoid it?  Not easy.  One way is to create multiple matrices,
and group them as an object.  of course, this loses a lot of features
of R.  Another way is to copy all data used in calculations out of the
data frame into a matrix, do the operations, and then copy them back.
not ideal, either.

In my opinion, this is an R design flow.  Data frames are the
fundamental unit of much statistical analysis, and should be fast.  I
think R lacks any indexing into data frames.  Turning on indexing of
data frames should at least be an optional feature.


I hope this message post helps others.

/iaw

----
Ivo Welch (ivo.we...@gmail.com)
http://www.ivo-welch.info/

______________________________**________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help>
PLEASE do read the posting guide http://www.R-project.org/**
posting-guide.html<http://www.R-project.org/posting-guide.html>
and provide commented, minimal, self-contained, reproducible code.


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Speed Advice for R --- avoid data frames

Reply via email to