Re: [Pixman] [PATCH 0/4] New fast paths and Raspberry Pi 1 benchmarking

2015-08-20 Thread Bill Spitzak
On Thu, Aug 20, 2015 at 6:58 AM, Pekka Paalanen ppaala...@gmail.com wrote:


 A thing that explains a great deal of these anomalies, but not all of it,
 has
 something to do with function addresses. There are hypotheses that it might
 have to do with the branch predictor and its cache. We made a test
 targeting
 exactly that idea: pick a fast path function that seems to be most
 susceptible
 to unexpected changes, pad it with x nops before the function start and N-x
 nops after the function end. We never execute those nops, but changing x
 changes the function start address while keeping everything else in the
 whole
 binary in the same place.

 The results were mind-boggling: depending on the function starting
 address, the
 src__ L1 test of lowlevel-blt-bench went either 355 Mpx/s or 470
 Mpx/s.
 There does not seem to be any predictable pattern on which addresses are
 fast
 and which are slow. Obviously this will screw up our benchmarks, because
 a
 change in an unrelated function may cause another function's address to
 shift,
 and therefore change its performance. See [1] for the plot.

 [1] The plot of alignment vs. performance

 https://git.collabora.com/cgit/user/pq/pixman-benchmarking.git/plain/octave/figures/fig-src---L1.pdf


Could this be whether some bad instruction ends up next to or split by a
cache line boundary? That would produce a random-looking plot, though it
really is a plot of the location of the bad instructions in the measured
function.

If this really is a problem then the ideal fix is for the compiler to
insert NOP instructions in order to move the bad instructions away from the
locations that make them bad. Yike.
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 0/4] New fast paths and Raspberry Pi 1 benchmarking

2015-08-20 Thread Ben Avison

On Thu, 20 Aug 2015 19:34:37 +0100, Bill Spitzak spit...@gmail.com wrote:


Could this be whether some bad instruction ends up next to or split
by a cache line boundary? That would produce a random-looking plot,
though it really is a plot of the location of the bad instructions in
the measured function.

If this really is a problem then the ideal fix is for the compiler to
insert NOP instructions in order to move the bad instructions away from
the locations that make them bad. Yike.


Thought of that, tried it, still baffled at the results. In other words,
merely ensuring instructions retained the same alignment to cachelines
wasn't enough to ensure reproducibility - it could only be achieved by
ensuring the same absolute address (which isn't an option with shared
libraries in the presence of ASLR).

My best theory at the moment is that the branch predictor in the ARM11
uses a hash of both the source and destination addresses of a branch to
choose which index in the predictor cache. Because it's a direct-mapped
cache, any collisions due to the branch moving to a different address can
have major effects on very tight loops like src__.

Ben
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman