Duncan's analysis suggests another way to do this: extract the 'x' vector, operate on that vector in a loop, then insert the result into the data.frame. I added a df="quicker" option to your df argument and made the test dataset deterministic so we could verify that the algorithms do the same thing:
dumkoll <- function(n = 1000, df = TRUE){ dfr <- data.frame(x = log(seq_len(n)), y = sqrt(seq_len(n))) if (identical(df, "quicker")) { x <- dfr$x for(i in 2:length(x)) { x[i] <- x[i-1] } dfr$x <- x } else if (df){ for (i in 2:NROW(dfr)){ # if (!(i %% 100)) cat("i = ", i, "\n") dfr$x[i] <- dfr$x[i-1] } }else{ dm <- as.matrix(dfr) for (i in 2:NROW(dm)){ # if (!(i %% 100)) cat("i = ", i, "\n") dm[i, 1] <- dm[i-1, 1] } dfr$x <- dm[, 1] } dfr } Timings for 10^4, 2*10^4, and 4*10^4 show that the time is quadratic in n for the df=TRUE case and close to linear in the other cases, with the new method taking about 60% the time of the matrix method: > n <- c("10k"=1e4, "20k"=2e4, "40k"=4e4) > sapply(n, function(n)system.time(dumkoll(n, df=FALSE))[1:3]) 10k 20k 40k user.self 0.11 0.22 0.43 sys.self 0.02 0.00 0.00 elapsed 0.12 0.22 0.44 > sapply(n, function(n)system.time(dumkoll(n, df=TRUE))[1:3]) 10k 20k 40k user.self 3.59 14.74 78.37 sys.self 0.00 0.11 0.16 elapsed 3.59 14.91 78.81 > sapply(n, function(n)system.time(dumkoll(n, df="quicker"))[1:3]) 10k 20k 40k user.self 0.06 0.12 0.26 sys.self 0.00 0.00 0.00 elapsed 0.07 0.13 0.27 I also timed the 2 faster cases for n=10^6 and the time still looks linear in n, with vector approach still taking about 60% the time of the matrix approach. > system.time(dumkoll(n=10^6, df=FALSE)) user system elapsed 11.65 0.12 11.82 > system.time(dumkoll(n=10^6, df="quicker")) user system elapsed 6.79 0.08 6.91 The results from each method are identical: > identical(dumkoll(100,df=FALSE), dumkoll(100,df=TRUE)) [1] TRUE > identical(dumkoll(100,df=FALSE), dumkoll(100,df="quicker")) [1] TRUE If your data.frame has columns of various types, then as.matrix will coerce them all to a common type (often character), so it may give you the wrong result in addition to being unnecessarily slow. Bill Dunlap TIBCO Software wdunlap tibco.com > -----Original Message----- > From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On > Behalf > Of Duncan Murdoch > Sent: Sunday, March 16, 2014 3:56 PM > To: Göran Broström; r-help@r-project.org > Subject: Re: [R] data frame vs. matrix > > On 14-03-16 2:57 PM, Göran Broström wrote: > > I have always known that "matrices are faster than data frames", for > > instance this function: > > > > > > dumkoll <- function(n = 1000, df = TRUE){ > > dfr <- data.frame(x = rnorm(n), y = rnorm(n)) > > if (df){ > > for (i in 2:NROW(dfr)){ > > if (!(i %% 100)) cat("i = ", i, "\n") > > dfr$x[i] <- dfr$x[i-1] > > } > > }else{ > > dm <- as.matrix(dfr) > > for (i in 2:NROW(dm)){ > > if (!(i %% 100)) cat("i = ", i, "\n") > > dm[i, 1] <- dm[i-1, 1] > > } > > dfr$x <- dm[, 1] > > } > > } > > > > -------------------- > > > system.time(dumkoll()) > > > > user system elapsed > > 0.046 0.000 0.045 > > > > > system.time(dumkoll(df = FALSE)) > > > > user system elapsed > > 0.007 0.000 0.008 > > ---------------------- > > > > OK, no big deal, but I stumbled over a data frame with one million > > records. Then, with df = TRUE, > > ---------------------------- > > user system elapsed > > 44677.141 1271.544 46016.754 > > ---------------------------- > > This is around 12 hours. > > > > With df = FALSE, it took only six seconds! About 7500 time faster. > > > > I was really surprised by the huge difference, and I wonder if this is > > to be expected, or if it is some peculiarity with my installation: I'm > > running Ubuntu 13.10 on a MacBook Pro with 8 Gb memory, R-3.0.3. > > I don't find it surprising. The line > > dfr$x[i] <- dfr$x[i-1] > > will be executed about a million times. It does the following: > > 1. Get a pointer to the x element of dfr. This requires R to look > through all the names of dfr to figure out which one is "x". > > 2. Extract the i-1 element from it. Not particularly slow. > > 3. Get a pointer to the x element of dfr again. (R doesn't cache these > things.) > > 4. Set the i element of it to a new value. This could require the > entire column or even the entire dataframe to be copied, if R hasn't > kept track of the fact that it is really being changed in place. In a > complex assignment like that, I wouldn't be surprised if that took > place. (In the matrix equivalent, it would be easier to recognize that > it is safe to change the existing value.) > > Luke Tierney is making some changes in R-devel that might help a lot in > cases like this, but I expect the matrix code will always be faster. > > Duncan Murdoch > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.