Re: [R] slow computation of functions over large datasets

David Winsemius Wed, 03 Aug 2011 08:11:30 -0700


On Aug 3, 2011, at 9:59 AM, ONKELINX, Thierry wrote:

Dear Caroline,

Here is a faster and more elegant solution.
n <- 10000
exampledata <- data.frame(orderID = sample(floor(n / 5), n, replace= TRUE), itemPrice = rpois(n, 10))
library(plyr)
system.time({
+       ddply(exampledata, .(orderID), function(x){
+ data.frame(itemPrice = x$itemPrice, orderAmount = cumsum(x$itemPrice))
+       })
+ })
  user  system elapsed
  1.67    0.00    1.69
exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
system.time(for (i in 2:length(exampledata[,1]))
+ {exampledata[i,"orderAmount"]<-ifelse(exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
  user  system elapsed
 11.94    0.02   11.97

I tried running this method on the "large dataset" (2MM row) the OPoffered, and needed to eventually interrupt it so I could get myconsole back:


> system.time({
+       ddply(exampledata2, .(orderID), function(x){

+ data.frame(itemPrice = x$itemPrice, orderAmount = cumsum(x$itemPrice))

+       })
+  })

Timing stopped at: 808.473 1013.749 1816.125

The same task with ave() took 35 seconds.

--
david.

Best regards,

Thierry
-----Oorspronkelijk bericht-----
Van: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org]
Namens Caroline Faisst
Verzonden: woensdag 3 augustus 2011 15:26
Aan: r-help@r-project.org
Onderwerp: [R] slow computation of functions over large datasets

Hello there,
I'm computing the total value of an order from the price of theorder items usinga "for" loop and the "ifelse" function. I do this on a largedataframe (close to1m lines). The computation of this function is painfully slow: in1min only about
90 rows are calculated.
The computation time taken for a given number of rows increaseswith the size
of the dataset, see the example with my function below:


# small dataset: function performs well

exampledata<-
data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]

system.time(for (i in 2:length(exampledata[,1]))
{exampledata[i,"orderAmount"]<-
ifelse(exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
# large dataset: the very same computational task takes much longer

exampledata2<-
data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:2000000),itemPrice=c(10,17,9,12,25,1
0,1,9,7,25:2000020))

exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]

system.time(for (i in 2:9)
{exampledata2[i,"orderAmount"]<-
ifelse(exampledata2[i,"orderID"]==exampledata2[i-
1,"orderID"],exampledata2[i-
1,"orderAmount"]+exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})
Does someone know a way to increase the speed?


Thank you very much!

Caroline


David Winsemius, MD
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] slow computation of functions over large datasets

Reply via email to