Re: [R] How to benchmark speed of load/readRDS correctly

raphael.felber Wed, 23 Aug 2017 05:31:50 -0700

Hi Bill

Thanks for your answer and the explanations. I tried to use garbage collection 
but I'm still not satisfied with the result. Maybe the question was not stated 
clear enough. I want to test the speed of reading/loading of data into R when a 
'fresh' R session is started (or even after a new start of the computer).

To understand what really happens, I tried:
r1 <- sapply(1:10000, function(x) { gc(); t <- system.time(n <- 
readRDS('file.Rdata'))[3]; rm(n); gc(); return(t)})

and found a similar behavior as you; here and then the time is much larger, but 
the times are not as stable as in your example. Highest values are up to 50 
times larger than most of the other times (8 sec vs 0.15 sec), even with 
garbage collection. I assume with the code above the time spent for garbage 
collection isn't measured.

However, the first iteration always takes the longest. I'm wondering if I 
should take the first value as best guess.

Cheers Raphael
Von: William Dunlap [mailto:wdun...@tibco.com]
Gesendet: Dienstag, 22. August 2017 19:13
An: Felber Raphael Agroscope <raphael.fel...@agroscope.admin.ch>
Cc: r-help@r-project.org
Betreff: Re: [R] How to benchmark speed of load/readRDS correctly

Note that if you force a garbage collection each iteration the times are more 
stable.  However, on the average it is faster to let the garbage collector 
decide when to leap into action.

mb_gc <- microbenchmark::microbenchmark(gc(), { x <- as.list(sin(1:5e5)); x <- 
unlist(x) / cos(1:5e5) ; sum(x) }, times=1000, control=list(order="inorder"))
with(mb_gc, plot(time[expr!="gc()"]))
with(mb_gc, quantile(1e-6*time[expr!="gc()"], c(0, .5, .75, .9, .95, .99, 1)))
#       0%       50%       75%       90%       95%       99%      100%
# 59.33450  61.33954  63.43457  66.23331  68.93746  74.45629 158.09799

Bill Dunlap
TIBCO Software
wdunlap tibco.com<http://tibco.com>

On Tue, Aug 22, 2017 at 9:26 AM, William Dunlap 
<wdun...@tibco.com<mailto:wdun...@tibco.com>> wrote:
The large value for maximum time may be due to garbage collection, which 
happens periodically.   E.g., try the following, where the unlist(as.list()) 
creates a lot of garbage.  I get a very large time every 102 or 51 iterations 
and a moderately large time more often

mb <- microbenchmark::microbenchmark({ x <- as.list(sin(1:5e5)); x <- unlist(x) 
/ cos(1:5e5) ; sum(x) }, times=1000)
plot(mb$time)
quantile(mb$time * 1e-6, c(0, .5, .75, .90, .95, .99, 1))
#       0%       50%       75%       90%       95%       99%      100%
# 59.04446  82.15453 102.17522 180.36986 187.52667 233.42062 249.33970
diff(which(mb$time > quantile(mb$time, .99)))
# [1] 102  51 102 102 102 102 102 102  51
diff(which(mb$time > quantile(mb$time, .95)))
# [1]  6 41  4 47  4 40  7  4 47  4 33 14  4 47  4 47  4 47  4 47  4 47  4  6 41
#[26]  4  6  7  9 25  4 47  4 47  4 47  4 22 25  4 33 14  4  6 41  4 47  4 22

Bill Dunlap
TIBCO Software
wdunlap tibco.com<http://tibco.com>

On Tue, Aug 22, 2017 at 5:53 AM, 
<raphael.fel...@agroscope.admin.ch<mailto:raphael.fel...@agroscope.admin.ch>> 
wrote:
Dear all

I was thinking about efficient reading data into R and tried several ways to 
test if load(file.Rdata) or readRDS(file.rds) is faster. The files file.Rdata 
and file.rds contain the same data, the first created with save(d, ' 
file.Rdata', compress=F) and the second with saveRDS(d, ' file.rds', 
compress=F).

First I used the function microbenchmark() and was a astonished about the max 
value of the output.

FIRST TEST:
> library(microbenchmark)
> microbenchmark(
+   n <- readRDS('file.rds'),
+   load('file.Rdata')
+ )
Unit: milliseconds
              expr                     min                lq                    
   mean                    median                uq                           
max                      neval
n <- readRDS(fl1)        106.5956      109.6457         237.3844              
117.8956              141.9921              10934.162           100
         load(fl2)                  295.0654      301.8162        335.6266      
        308.3757              319.6965              1915.706              100

It looks like the max value is an outlier.

So I tried:
SECOND TEST:
> sapply(1:10, function(x) system.time(n <- readRDS('file.rds'))[3])
elapsed               elapsed               elapsed               elapsed       
        elapsed               elapsed               elapsed               
elapsed                 elapsed               elapsed
  10.50                   0.11                       0.11                       
0.11                       0.10                       0.11                      
 0.11                       0.11                       0.12                     
  0.12
> sapply(1:10, function(x) system.time(load'flie.Rdata'))[3])
elapsed               elapsed               elapsed               elapsed       
        elapsed               elapsed               elapsed               
elapsed                 elapsed               elapsed
   1.86                    0.29                       0.31                      
 0.30                       0.30                       0.31                     
  0.30                       0.29                       0.31                    
   0.30

Which confirmed my suspicion; the first time loading the data takes much longer 
than the following times. I suspect that this has something to do how the data 
is assigned and that R doesn't has to 'fully' read the data, if it is read the 
second time.

So the question remains, how can I make a realistic benchmark test? From the 
first test I would conclude that reading the *.rds file is faster. But this 
holds only for a large number of neval. If I set times = 1 then reading the 
*.Rdata would be faster (as also indicated by the second test).

Thanks for any help or comments.

Kind regards

Raphael
------------------------------------------------------------------------------------
Raphael Felber, PhD
Scientific Officer, Climate & Air Pollution

Federal Department of Economic Affairs,
Education and Research EAER
Agroscope
Research Division, Agroecology and Environment

Reckenholzstrasse 191, CH-8046 Zürich
Phone +41 58 468 75 11<tel:+41%2058%20468%2075%2011>
Fax     +41 58 468 72 01<tel:+41%2058%20468%2072%2001>
raphael.fel...@agroscope.admin.ch<mailto:raphael.fel...@agroscope.admin.ch><mailto:raphael.fel...@agroscope.admin.ch<mailto:raphael.fel...@agroscope.admin.ch>>
www.agroscope.ch<http://www.agroscope.ch><http://www.agroscope.ch/>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To 
UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How to benchmark speed of load/readRDS correctly

Reply via email to