IMO unless your distribution has a very irregular shape (more than a few modes) and/or rare events (frequency < 1e-3), you can get reasonable estimate for all quantiles between 0.05-0.95 that's good enough for most practical purposes with 500-2000 samples, either with simple nonparametric or semiparametric statistical methods.
That said, if there is large variance or multiple modes, if I had the resources (time/expertise) then my most important goal would not be refining my estimate of the distribution, but figuring out what's going on (which, admittedly, can be a pain). I have only skimmed through the profiler functions in Julia (still a newbie, learning the language), but I noticed the https://github.com/timholy/ProfileView.jl supports gathering some information about GC. Best, Tamas On Thu, Jun 05 2014, John Myles White <[email protected]> wrote: > Ignoring pipelining and modern CPU trickery, the CPU instruction count is a > potentially low variance lower bound on the speed of an operation. The lower > bound property is all I meant by purity. In addition, because instruction > counts are often lower variance than other metrics, it's potentially useful > for comparing two proposed implementations/algorithms. Other metrics like > mean wall clock time may have such high variance that you can't collect > enough measurements to conclude that a change had any impact. > > Personally, I think you can never do better that an estimate of the full > distribution of wall clock timings. But you almost never collect enough data > to even come close to having a proper estimate of this distribution. In that > regard, I totally agree that there is no perfect measurement. But you do need > to make decisions. Lower bounds from CPU instructions are nice for doing > that, although you want to make sure your wall clock timings aren't doing > something nuts to create the change in CPU instructions. > > -- John > > On Jun 4, 2014, at 9:36 AM, Tamas Papp <[email protected]> wrote: > >> Sorry if I missed something in this discussion, but why are you assuming >> that the CPU instructions per se are the bottleneck, instead of, say, >> memory, especially how your algorithm interacts with various levels of >> CPU caches? Heavily optimized numerical algorithms make a lot of effort >> to optimize the latter, as fetching something from memory is on the >> order of ~100 CPU instructions. >> >> I am not a computer scientist, but I am not sure that such a >> theoretically pure measurement exists. If there is a lot of variation in >> execution times _with the same input_, I guess it would be to worthwhile >> to understand the cause. GC would be a prime suspect in dynamic >> languages, but I don't yet know Julia well enough to see if this >> generalizes to your case. >> >> Best, >> >> Tamas >> >> On Wed, Jun 04 2014, Joshua Job <[email protected]> wrote: >> >>> I want a theoretically pure measurement for an algorithm. I'm not sure how >>> I would go about counting CPU instructions, however, since the algorithm is >>> only a heuristic one and my only guide for how long it takes on a given set >>> of instances is to run it and see how long it took. Is there some way of >>> getting a count of CPU instructions due to the execution of a particular >>> function in Julia? >>> >>> I want to estimate how long a user would have to run the algorithm in order >>> to achieve 99% confidence that it will have succeeded at last once in >>> finding the true solution to the problem(s). The obvious way to do that is >>> to run the algorithm many times (say, 10^4) and take the 99th percentile of >>> that distribution and call it T. If one ran the algorithm for time T, then >>> we know empirically that 99 times out of a 100 it will find a solution in a >>> time less than or equal to T. >>> >>> I could use means or medians of course, but those are much less sensitive >>> to the tails of the distribution, and I'm really interested in how long it >>> takes to achieve confidence in the results, not just average time to >>> solution. Hence the desire to time the runs of the algorithms. >>> >>> In any case I may have underestimated the runtime of the algorithm in >>> question --- it appears for interesting problem sizes the C code for it >>> takes on the order of a millisecond or longer as opposed to tens of >>> microseconds. I suspect then that the @time macro or tic/toc should be >>> accurate enough for that purpose. >>> >>> On Tuesday, June 3, 2014 7:36:12 PM UTC-7, John Myles White wrote: >>>> >>>> I’m not sure there’s any single correct way to do benchmarks without >>>> information about what you’re trying to optimize. >>>> >>>> If you’re trying to optimize the experience of people using your code, I >>>> think it’s important to use means rather than medians because you want to >>>> use a metric that’s effected by the entire shape of the distribution of >>>> times and not entirely determined by the "center" of that distribution. >>>> >>>> If you want a theoretically pure measurement for an algorithm, I think >>>> measuring time is kind of problematic. For algorithms, I’d prefer seeing a >>>> count of CPU instructions. >>>> >>>> — John >>>> >>>> On Jun 2, 2014, at 7:32 PM, Kevin Squire <[email protected] >>>> <javascript:>> wrote: >>>> >>>> I think that, for many algorithms, triggering the gc() is simply a matter >>>> of running a simulation for enough iterations. By calling gc() ahead of >>>> time, you should be able to get the same number (n > 0) of gc calls, which >>>> isn't ignoring gc(). >>>> >>>> That said, it can take some effort to figure out the number of iterations >>>> and time to run the experiment. >>>> >>>> Cheers, Kevin >>>> >>>> On Monday, June 2, 2014, Stefan Karpinski <[email protected] >>>> <javascript:>> wrote: >>>> >>>>> I feel that ignoring gc can be a bit of a cheat since it does happen and >>>>> it's quite expensive – and other systems may be better or worse at it. Of >>>>> course, it can still be good to separate the cause of slowness explicitly >>>>> into execution time and overhead for things like gc. >>>>> >>>>> >>>>> On Mon, Jun 2, 2014 at 5:21 PM, Kevin Squire <[email protected]> >>>>> wrote: >>>>> >>>>>> Thanks John. His argument definitely makes sense (that algorithms that >>>>>> cause more garbage collection won't get penalized by median, unless, of >>>>>> course, they cause gc() to occur more than 50% of the time). >>>>>> >>>>>> Most benchmarks of Julia code that I've done (or seen) have made some >>>>>> attempt to take gc() differences out of the equation, usually by >>>>>> explicitly >>>>>> calling gc() before the timing begins. For most algorithms, that would >>>>>> mean that the same number of gc() calls should occur for each >>>>>> repetition, >>>>>> in which case, I would think that any measure of central tendency >>>>>> (including mean and median) would be useful. >>>>>> >>>>>> Is there a problem with this reasoning? >>>>>> >>>>>> Cheers, >>>>>> Kevin >>>>>> >>>>>> >>>>>> On Mon, Jun 2, 2014 at 1:04 PM, John Myles White < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> For some reasons why one might want not to use the median, see >>>>>>> http://radfordneal.wordpress.com/2014/02/02/inaccurate-results-from-microbenchmark/ >>>>>>> >>>>>>> -- John >>>>>>> >>>>>>> On Jun 2, 2014, at 11:06 AM, Kevin Squire <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> median is probably also useful. I like it a little better in cases >>>>>>> where the code being tested triggers gc() more than half the time. >>>>>>> >>>>>>> On Monday, June 2, 2014, Steven G. Johnson <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Monday, June 2, 2014 1:01:25 AM UTC-4, Jameson wrote: >>>>>>>>> >>>>>>>>> Therefore, for benchmarks, you should execute your code in a loop >>>>>>>>> enough times that the measurement error (of the hardware and OS) is >>>>>>>>> not too >>>>>>>>> significant. >>>>>>>>> >>>>>>>> >>>>>>>> You can also often benchmark multiple times and take the minimum (not >>>>>>>> the mean!) time for reasonable results with fairly small time >>>>>>>> intervals. >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >> >> -- --
