Hi all, I am trying to understand the performance of functions applied to integer sequences. Consider the following:
### begin example ### library(lobstr) library(microbenchmark) x <- sample(1e6) obj_size(x) # 4,000,048 B y <- 1:1e6 obj_size(y) # 680 B # So we can see that 'y' uses ALTREP. These are, as expected, the same: sum(x) # [1] 500000500000 sum(y) # [1] 500000500000 # For 'x', we have to go through the trouble of actually summing up 1e6 integers. # For 'y', knowing its form, we really just need to do: 1e6*(1e6+1)/2 # [1] 500000500000 # which should be a whole lot faster. And indeed, it is: microbenchmark(sum(x),sum(y)) # Unit: nanoseconds # expr min lq mean median uq max neval cld # sum(x) 533452 595204.5 634266.90 613102.5 638271.5 978519 100 b # sum(y) 183 245.5 446.09 338.5 447.0 3233 100 a # Now what about mean()? mean(x) # [1] 500000.5 mean(y) # [1] 500000.5 # which is the same as (1e6+1)/2 # [1] 500000.5 # But this surprised me: microbenchmark(mean(x),mean(y)) # Unit: microseconds # expr min lq mean median uq max neval cld # mean(x) 935.389 943.4795 1021.423 954.689 985.122 2065.974 100 a # mean(y) 3500.262 3581.9530 3814.664 3637.984 3734.598 5866.768 100 b ### end example ### So why is mean() on an ALTREP sequence slower when sum() is faster? And more generally, when using sum() on an ALTREP integer sequence, does R actually use something like n*(n+1)/2 (or generalized to sequences a:b -- (a+b)*(b-a+1)/2) for computing the sum? If so, why not (it seems) for mean()? Best, Wolfgang ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel