This is just a quick, incomplete response, but the main misconception is really 
the use of data.frames. If you don't use the elaborate mechanics of data frames 
that involve the management of row names, then they are definitely the wrong 
tool to use, because most of the overhead is exactly to manage to row names and 
you pay a substantial penalty for that. Just drop that one feature and you get 
timings similar to a matrix:

> m=matrix(rnorm(C*R), nrow=R)
> example(m)
rows: 0.015 0 0.015 0 0 
columns: 0.01 0 0.01 0 0 

> l=as.list(as.data.frame( matrix( rnorm(C*R), nrow=R ) ))
> example(l)
rows: 0.015 0 0.016 0 0 
columns: 0.012 0 0.011 0 0 

(with example modified to use m[[y]][x] instad of m[x,y])

I would not be surprised that many people use data.frames for the convenience 
of the matrix subsetting/subassignement operators and don't really care about 
the row names and for all those uses data.frames are the wrong tool. (Just look 
at `[.data.frame` and `[<-.data.frame`).

As a side note, it's a bit pointless to compare the performance to matrices as 
they imposes much more rigorous structure (all columns have the same type) - if 
you use data frames in such special (rare) cases, it's really your fault ;). So 
the bottom line is to educate users to not use data frames where not needed 
and/or provide alternatives (and there may be some things coming up, too).

And as I said, this is just a quick note, so carry on and comment on the 
original question ;).

Cheers,
Simon


On Jul 2, 2011, at 2:23 PM, ivo welch wrote:

> Dear R developers: R is supposed to be slow for iterative
> calculations.  actually, it isn't.  matrix operations are fast.  it is
> data frame operations that are slow.
> 
> R <- 1000
> C <- 1000
> 
> example <- function(m) {
>  cat("rows: "); cat(system.time( for (r in 1:R) m[r,20] <-
> sqrt(abs(m[r,20])) + rnorm(1) ), "\n")
>  cat("columns: "); cat(system.time(for (c in 1:C) m[20,c] <-
> sqrt(abs(m[20,c])) + rnorm(1)), "\n")
>  if (is.data.frame(m)) { cat("df: columns as names: ");
> cat(system.time(for (c in 1:C) m[[c]][20] <- sqrt(abs(m[[c]][20])) +
> rnorm(1)), "\n") }
> }
> 
> cat("\n**** Now as matrix\n")
> example( matrix( rnorm(C*R), nrow=R ) )
> 
> cat("\n**** Now as data frame\n")
> example( as.data.frame( matrix( rnorm(C*R), nrow=R ) ) )
> 
> 
> When m is a data frame, the operation is about 300 times slower than
> when m is a matrix.    The program is basically accessing 1000
> numbers.  When m is a data frame, the speed of R is about 20 accesses
> per seconds on a Mac Pro.  This is pretty pathetic.
> 
> I do not know the R internals, so the following is pure speculation.
> I understand that an index calculation is faster than a vector lookup
> for arbitrary size objects, but it seems as if R relies on search to
> find its element.  maybe there isn't even a basic vector lookup table.
> a vector lookup table should be possible at least along the dimension
> of consecutive storage.  another possible improvement would be to add
> an operation that adds an attribute to the data frame that contains a
> full index table to the object for quick lookup.  (if the index table
> is there, it could be used.  otherwise, R could simply use the
> existing internal mechanism.)
> 
> I think faster data frame access would significantly improve the
> impression that R makes on novices.  just my 5 cents.
> 
> /iaw
> ----
> Ivo Welch (ivo.we...@gmail.com)
> http://www.ivo-welch.info/
> J. Fred Weston Professor of Finance
> Anderson School at UCLA, C519
> 
> ______________________________________________
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to