I’m not sure there’s any single correct way to do benchmarks without information about what you’re trying to optimize.
If you’re trying to optimize the experience of people using your code, I think it’s important to use means rather than medians because you want to use a metric that’s effected by the entire shape of the distribution of times and not entirely determined by the "center" of that distribution. If you want a theoretically pure measurement for an algorithm, I think measuring time is kind of problematic. For algorithms, I’d prefer seeing a count of CPU instructions. — John On Jun 2, 2014, at 7:32 PM, Kevin Squire <[email protected]> wrote: > I think that, for many algorithms, triggering the gc() is simply a matter of > running a simulation for enough iterations. By calling gc() ahead of time, > you should be able to get the same number (n > 0) of gc calls, which isn't > ignoring gc(). > > That said, it can take some effort to figure out the number of iterations and > time to run the experiment. > > Cheers, Kevin > > On Monday, June 2, 2014, Stefan Karpinski <[email protected]> wrote: > I feel that ignoring gc can be a bit of a cheat since it does happen and it's > quite expensive – and other systems may be better or worse at it. Of course, > it can still be good to separate the cause of slowness explicitly into > execution time and overhead for things like gc. > > > On Mon, Jun 2, 2014 at 5:21 PM, Kevin Squire <[email protected]> wrote: > Thanks John. His argument definitely makes sense (that algorithms that cause > more garbage collection won't get penalized by median, unless, of course, > they cause gc() to occur more than 50% of the time). > > Most benchmarks of Julia code that I've done (or seen) have made some attempt > to take gc() differences out of the equation, usually by explicitly calling > gc() before the timing begins. For most algorithms, that would mean that the > same number of gc() calls should occur for each repetition, in which case, I > would think that any measure of central tendency (including mean and median) > would be useful. > > Is there a problem with this reasoning? > > Cheers, > Kevin > > > On Mon, Jun 2, 2014 at 1:04 PM, John Myles White <[email protected]> > wrote: > For some reasons why one might want not to use the median, see > http://radfordneal.wordpress.com/2014/02/02/inaccurate-results-from-microbenchmark/ > > -- John > > On Jun 2, 2014, at 11:06 AM, Kevin Squire <[email protected]> wrote: > >> median is probably also useful. I like it a little better in cases where the >> code being tested triggers gc() more than half the time. >> >> On Monday, June 2, 2014, Steven G. Johnson <[email protected]> wrote: >> >> >> On Monday, June 2, 2014 1:01:25 AM UTC-4, Jameson wrote: >> Therefore, for benchmarks, you should execute your code in a loop enough >> times that the measurement error (of the hardware and OS) is not too >> significant. >> >> >> You can also often benchmark multiple times and take the minimum (not the >> mean!) time for reasonable results with fairly small time intervals. > > >
