Re: [R] Speed Advice for R --- avoid data frames

ivo welch Sat, 02 Jul 2011 12:36:54 -0700

hi uwe---thanks for the clarification.  of course, my example should always
be done in vectorized form.  I only used it to show how iterative access
compares in the simplest possible fashion.  <100 accesses per seconds is
REALLY slow, though.


I don't know R internals and the learning curve would be steep.  moreover,
there is no guarantee that changes I would make would be accepted.  so, I
cannot do this.

however, for an R expert, this should not be too difficult.  conceptually,
if data frame element access primitives are create/write/read/destroy in the
code, then it's truly trivial.  just add a matrix (dim the same as the data
frame) of byte pointers to point at the storage upon creation/change time.
 this would be quick-and-dirty.  for curiosity, do you know which source
file has the data frame internals?  maybe I will get tempted anyway if it is
simple enough.

(a more efficient but more involved way to do this would be to store a data
frame internally always as a matrix of data pointers, but this would
probably require more surgery.)

It is also not as important for me, as it is for others...to give a good
impression to those that are not aware of the tradeoffs---which is most
people considering to adopt R.

/iaw


----
Ivo Welch (ivo.we...@gmail.com)




2011/7/2 Uwe Ligges <lig...@statistik.tu-dortmund.de>

> Some comments:
>
> the comparison matrix rows vs. matrix columns is incorrect: Note that R has
> lazy evaluation, hence you construct your matrix in the timing for the rows
> and it is already constructed in the timing for the columns, hence you want
> to use:
>
>  M <- matrix( rnorm(C*R), nrow=R )
>  D <- as.data.frame(matrix( rnorm(C*R), nrow=R ) )
>  example(M)
>  example(D)
>
> Further on, you are correct with you statement that data.frame indexing is
> much slower, but if you can store your data in matrix form, just go on as it
> is.
>
> I doubt anybody is really going to make the index operation you cited
> within a loop. Then, with a data.frame, I can live with many vectorized
> replacements again:
>
> > system.time(D[,20] <- sqrt(abs(D[,20])) + rnorm(1000))
>   user  system elapsed
>   0.01    0.00    0.01
>
> > system.time(D[20,] <- sqrt(abs(D[20,])) + rnorm(1000))
>   user  system elapsed
>   0.51    0.00    0.52
>
> OK, it would be nice to do that faster, but this is not easy. I think R
> Core is happy to see contributions to make it faster without breaking
> existing features.
>
>
>
> Best wishes,
> Uwe
>
>
>
>
> On 02.07.2011 20:35, ivo welch wrote:
>
>> This email is intended for R users that are not that familiar with R
>> internals and are searching google about how to speed up R.
>>
>> Despite common misperception, R is not slow when it comes to iterative
>> access.  R is fast when it comes to matrices.  R is very slow when it
>> comes to iterative access into data frames.  Such access occurs when a
>> user uses "data$varname[index]", which is a very common operation.  To
>> illustrate, run the following program:
>>
>> R<- 1000; C<- 1000
>>
>> example<- function(m) {
>>   cat("rows: "); cat(system.time( for (r in 1:R) m[r,20]<-
>> sqrt(abs(m[r,20])) + rnorm(1) ), "\n")
>>   cat("columns: "); cat(system.time(for (c in 1:C) m[20,c]<-
>> sqrt(abs(m[20,c])) + rnorm(1)), "\n")
>>   if (is.data.frame(m)) { cat("df: columns as names: ");
>> cat(system.time(for (c in 1:C) m[[c]][20]<- sqrt(abs(m[[c]][20])) +
>> rnorm(1)), "\n") }
>> }
>>
>> cat("\n**** Now as matrix\n")
>> example( matrix( rnorm(C*R), nrow=R ) )
>>
>> cat("\n**** Now as data frame\n")
>> example( as.data.frame( matrix( rnorm(C*R), nrow=R ) ) )
>>
>>
>> The following are the reported timing under R 2.12.0 on a Mac Pro 3,1
>> with ample RAM:
>>
>> matrix, columns: 0.01s
>> matrix, rows: 0.175s
>> data frame, columns: 53s
>> data frame, rows: 56s
>> data frame, names: 58s
>>
>> Data frame access is about 5,000 times slower than matrix column
>> access, and 300 times slower than matrix row access.  R's data frame
>> operational speed is an amazing 40 data accesses per seconds.  I have
>> not seen access numbers this low for decades.
>>
>>
>> How to avoid it?  Not easy.  One way is to create multiple matrices,
>> and group them as an object.  of course, this loses a lot of features
>> of R.  Another way is to copy all data used in calculations out of the
>> data frame into a matrix, do the operations, and then copy them back.
>> not ideal, either.
>>
>> In my opinion, this is an R design flow.  Data frames are the
>> fundamental unit of much statistical analysis, and should be fast.  I
>> think R lacks any indexing into data frames.  Turning on indexing of
>> data frames should at least be an optional feature.
>>
>>
>> I hope this message post helps others.
>>
>> /iaw
>>
>> ----
>> Ivo Welch (ivo.we...@gmail.com)
>> http://www.ivo-welch.info/
>>
>> ______________________________**________________
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help>
>> PLEASE do read the posting guide http://www.R-project.org/**
>> posting-guide.html <http://www.R-project.org/posting-guide.html>
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Speed Advice for R --- avoid data frames

Reply via email to