On Mon, Jan 16, 2017 at 2:45 AM, Savolainen, Petri (Nokia - FI/Espoo)
<petri.savolai...@nokia-bell-labs.com> wrote:
>
>
>> -----Original Message-----
>> From: Bill Fischofer [mailto:bill.fischo...@linaro.org]
>> Sent: Sunday, January 15, 2017 5:09 PM
>> To: Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia-bell-
>> labs.com>
>> Cc: lng-odp@lists.linaro.org
>> Subject: Re: [lng-odp] [API-NEXT PATCHv6 2/5] linux-generic: packet:
>> implement reference apis
>>
>> I've been playing around with odp_bench_packet and there are two
>> problems with using this test:
>>
>> 1. The test performs multiple runs of each test TEST_SIZE_RUN_COUNT
>> times, which is good, however it only reports the average cycle time
>> for these runs, There is considerable variability when running in a
>> non-isolated environment, so what we really want to report is not just
>> the average times, but also the minimum and maximum times encountered
>> over those runs. From a microbenchmarking standpoint, the minimum
>> times are what are of most interest since those are the "pure"
>> measures of the functions being tested, with the least amount of
>> interference from other sources. Statistically speaking, with a large
>> TEST_SIZE_RUN_COUNT the minimums should get close to the actual
>> isolated time for each of the tests without requiring the system be
>> fully isolated.
>
> Performance should be tested in isolation, at least you need enough CPUs so 
> that OS does not utilize the CPU we are testing for anything else. Also 
> minimum is not a good measure, since it would report only the (potentially 
> rare) case when all cache and TLB accesses hit. Cache hit rate affects 
> performance a lot. E.g. a L1 hit rate of 98% may give you tens of percent 
> worse performance than the perfect 100% hit rate. So, if we print out only 
> one figure, an average over many runs in isolation is the best we can get 
> (easily).

Ideally yes, however that's not always possible or even necessary for
basic tuning. Moreover, this is a microbenchmark, which means it's
really designed to identify small deltas in pathlengths in individual
API implementations rather than to assess overall performance at an
application level. We're not printing only one figure, however just
printing an average is inadequate. Recording and reporting the min,
max, and avg values gives a fuller picture of what's going on. The
mins show ideal performance and can be used to measure small
pathlength differences while the max and avg give an insight as to how
"isolated" the system really is. You may think you're running in an
isolated environment, but without these additional data you really
don't know that. On a true isolated system the max values should be
consistent and not orders of magnitude larger than the mins. Only then
can the averages be compared fairly.

>
>
>>
>> 2. Within each test, the function being tested is performed
>> TEST_REPEAT_COUNT times, which is set at 1000. This is bad. The reason
>> this is bad is that longer run times give more opportunity for "cut
>> aways" and other interruptions that distort the measurement trying to
>> be made.
>
>
> The repeat count could be even higher but it would require more packet in the 
> pool. We used 1000 as a compromise. This needs to be large enough to hide one 
> time initialization and CPU cycle measurement overheads, which may add cycles 
> and measurement variation for these simple test cases. E.g. if it takes 100 
> cycles to read a CPU cycle counter. A 10 cycle operation must be repeated 
> many times to hide it: 100 cycles / (1000*10 cycles) => 1% measurement 
> overhead per operation.

That's what the bench_empty test does--it shows the fixed overhead
imposed by the benchmark framework on each test. So those values can
be subtracted from the individual tests to give a closer approximation
to the relative impact of proposed changes. The goal here isn't exact
cycle counts but rather a good and consistent approximation that can
be used to assess the impact of proposed changes, which is what we're
doing here. The problem with repeats is that the longer the test runs
the more likely that the isolation assumption becomes invalid. When
that happens, as reported here, the variability obscures anything else
you're trying to measure.

Ideally I'd like to see these be command line arguments to the test
rather than #defines so that different measures can be made in
different environments.

>
> -Petri
>
>
>

Reply via email to