On Mon, Jan 16, 2017 at 2:45 AM, Savolainen, Petri (Nokia - FI/Espoo) <petri.savolai...@nokia-bell-labs.com> wrote: > > >> -----Original Message----- >> From: Bill Fischofer [mailto:bill.fischo...@linaro.org] >> Sent: Sunday, January 15, 2017 5:09 PM >> To: Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia-bell- >> labs.com> >> Cc: lng-odp@lists.linaro.org >> Subject: Re: [lng-odp] [API-NEXT PATCHv6 2/5] linux-generic: packet: >> implement reference apis >> >> I've been playing around with odp_bench_packet and there are two >> problems with using this test: >> >> 1. The test performs multiple runs of each test TEST_SIZE_RUN_COUNT >> times, which is good, however it only reports the average cycle time >> for these runs, There is considerable variability when running in a >> non-isolated environment, so what we really want to report is not just >> the average times, but also the minimum and maximum times encountered >> over those runs. From a microbenchmarking standpoint, the minimum >> times are what are of most interest since those are the "pure" >> measures of the functions being tested, with the least amount of >> interference from other sources. Statistically speaking, with a large >> TEST_SIZE_RUN_COUNT the minimums should get close to the actual >> isolated time for each of the tests without requiring the system be >> fully isolated. > > Performance should be tested in isolation, at least you need enough CPUs so > that OS does not utilize the CPU we are testing for anything else. Also > minimum is not a good measure, since it would report only the (potentially > rare) case when all cache and TLB accesses hit. Cache hit rate affects > performance a lot. E.g. a L1 hit rate of 98% may give you tens of percent > worse performance than the perfect 100% hit rate. So, if we print out only > one figure, an average over many runs in isolation is the best we can get > (easily).
Ideally yes, however that's not always possible or even necessary for basic tuning. Moreover, this is a microbenchmark, which means it's really designed to identify small deltas in pathlengths in individual API implementations rather than to assess overall performance at an application level. We're not printing only one figure, however just printing an average is inadequate. Recording and reporting the min, max, and avg values gives a fuller picture of what's going on. The mins show ideal performance and can be used to measure small pathlength differences while the max and avg give an insight as to how "isolated" the system really is. You may think you're running in an isolated environment, but without these additional data you really don't know that. On a true isolated system the max values should be consistent and not orders of magnitude larger than the mins. Only then can the averages be compared fairly. > > >> >> 2. Within each test, the function being tested is performed >> TEST_REPEAT_COUNT times, which is set at 1000. This is bad. The reason >> this is bad is that longer run times give more opportunity for "cut >> aways" and other interruptions that distort the measurement trying to >> be made. > > > The repeat count could be even higher but it would require more packet in the > pool. We used 1000 as a compromise. This needs to be large enough to hide one > time initialization and CPU cycle measurement overheads, which may add cycles > and measurement variation for these simple test cases. E.g. if it takes 100 > cycles to read a CPU cycle counter. A 10 cycle operation must be repeated > many times to hide it: 100 cycles / (1000*10 cycles) => 1% measurement > overhead per operation. That's what the bench_empty test does--it shows the fixed overhead imposed by the benchmark framework on each test. So those values can be subtracted from the individual tests to give a closer approximation to the relative impact of proposed changes. The goal here isn't exact cycle counts but rather a good and consistent approximation that can be used to assess the impact of proposed changes, which is what we're doing here. The problem with repeats is that the longer the test runs the more likely that the isolation assumption becomes invalid. When that happens, as reported here, the variability obscures anything else you're trying to measure. Ideally I'd like to see these be command line arguments to the test rather than #defines so that different measures can be made in different environments. > > -Petri > > >