Hi Bill Thanks for your answer and the explanations. I tried to use garbage collection but I'm still not satisfied with the result. Maybe the question was not stated clear enough. I want to test the speed of reading/loading of data into R when a 'fresh' R session is started (or even after a new start of the computer).
To understand what really happens, I tried: r1 <- sapply(1:10000, function(x) { gc(); t <- system.time(n <- readRDS('file.Rdata'))[3]; rm(n); gc(); return(t)}) and found a similar behavior as you; here and then the time is much larger, but the times are not as stable as in your example. Highest values are up to 50 times larger than most of the other times (8 sec vs 0.15 sec), even with garbage collection. I assume with the code above the time spent for garbage collection isn't measured. However, the first iteration always takes the longest. I'm wondering if I should take the first value as best guess. Cheers Raphael Von: William Dunlap [mailto:wdun...@tibco.com] Gesendet: Dienstag, 22. August 2017 19:13 An: Felber Raphael Agroscope <raphael.fel...@agroscope.admin.ch> Cc: r-help@r-project.org Betreff: Re: [R] How to benchmark speed of load/readRDS correctly Note that if you force a garbage collection each iteration the times are more stable. However, on the average it is faster to let the garbage collector decide when to leap into action. mb_gc <- microbenchmark::microbenchmark(gc(), { x <- as.list(sin(1:5e5)); x <- unlist(x) / cos(1:5e5) ; sum(x) }, times=1000, control=list(order="inorder")) with(mb_gc, plot(time[expr!="gc()"])) with(mb_gc, quantile(1e-6*time[expr!="gc()"], c(0, .5, .75, .9, .95, .99, 1))) # 0% 50% 75% 90% 95% 99% 100% # 59.33450 61.33954 63.43457 66.23331 68.93746 74.45629 158.09799 Bill Dunlap TIBCO Software wdunlap tibco.com<http://tibco.com> On Tue, Aug 22, 2017 at 9:26 AM, William Dunlap <wdun...@tibco.com<mailto:wdun...@tibco.com>> wrote: The large value for maximum time may be due to garbage collection, which happens periodically. E.g., try the following, where the unlist(as.list()) creates a lot of garbage. I get a very large time every 102 or 51 iterations and a moderately large time more often mb <- microbenchmark::microbenchmark({ x <- as.list(sin(1:5e5)); x <- unlist(x) / cos(1:5e5) ; sum(x) }, times=1000) plot(mb$time) quantile(mb$time * 1e-6, c(0, .5, .75, .90, .95, .99, 1)) # 0% 50% 75% 90% 95% 99% 100% # 59.04446 82.15453 102.17522 180.36986 187.52667 233.42062 249.33970 diff(which(mb$time > quantile(mb$time, .99))) # [1] 102 51 102 102 102 102 102 102 51 diff(which(mb$time > quantile(mb$time, .95))) # [1] 6 41 4 47 4 40 7 4 47 4 33 14 4 47 4 47 4 47 4 47 4 47 4 6 41 #[26] 4 6 7 9 25 4 47 4 47 4 47 4 22 25 4 33 14 4 6 41 4 47 4 22 Bill Dunlap TIBCO Software wdunlap tibco.com<http://tibco.com> On Tue, Aug 22, 2017 at 5:53 AM, <raphael.fel...@agroscope.admin.ch<mailto:raphael.fel...@agroscope.admin.ch>> wrote: Dear all I was thinking about efficient reading data into R and tried several ways to test if load(file.Rdata) or readRDS(file.rds) is faster. The files file.Rdata and file.rds contain the same data, the first created with save(d, ' file.Rdata', compress=F) and the second with saveRDS(d, ' file.rds', compress=F). First I used the function microbenchmark() and was a astonished about the max value of the output. FIRST TEST: > library(microbenchmark) > microbenchmark( + n <- readRDS('file.rds'), + load('file.Rdata') + ) Unit: milliseconds expr min lq mean median uq max neval n <- readRDS(fl1) 106.5956 109.6457 237.3844 117.8956 141.9921 10934.162 100 load(fl2) 295.0654 301.8162 335.6266 308.3757 319.6965 1915.706 100 It looks like the max value is an outlier. So I tried: SECOND TEST: > sapply(1:10, function(x) system.time(n <- readRDS('file.rds'))[3]) elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed 10.50 0.11 0.11 0.11 0.10 0.11 0.11 0.11 0.12 0.12 > sapply(1:10, function(x) system.time(load'flie.Rdata'))[3]) elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed 1.86 0.29 0.31 0.30 0.30 0.31 0.30 0.29 0.31 0.30 Which confirmed my suspicion; the first time loading the data takes much longer than the following times. I suspect that this has something to do how the data is assigned and that R doesn't has to 'fully' read the data, if it is read the second time. So the question remains, how can I make a realistic benchmark test? From the first test I would conclude that reading the *.rds file is faster. But this holds only for a large number of neval. If I set times = 1 then reading the *.Rdata would be faster (as also indicated by the second test). Thanks for any help or comments. Kind regards Raphael ------------------------------------------------------------------------------------ Raphael Felber, PhD Scientific Officer, Climate & Air Pollution Federal Department of Economic Affairs, Education and Research EAER Agroscope Research Division, Agroecology and Environment Reckenholzstrasse 191, CH-8046 Zürich Phone +41 58 468 75 11<tel:+41%2058%20468%2075%2011> Fax +41 58 468 72 01<tel:+41%2058%20468%2072%2001> raphael.fel...@agroscope.admin.ch<mailto:raphael.fel...@agroscope.admin.ch><mailto:raphael.fel...@agroscope.admin.ch<mailto:raphael.fel...@agroscope.admin.ch>> www.agroscope.ch<http://www.agroscope.ch><http://www.agroscope.ch/> [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.