I meant to add that the micro-benchmark went through each loop 100,000,000 iterations for each run... for 20 runs to come up with those numbers.
--bg On Mon, Mar 9, 2026 at 11:02 AM Bryan Green <[email protected]> wrote: > I performed a micro-benchmark on my dual epyc (zen 2) server and version 1 > wins for small values of n. > > 20 runs: > > n version min median mean max stddev noise% > ----------------------------------------------------------------------- > n=1 version1 2.440 2.440 2.450 2.550 0.024 4.5% > n=1 version2 4.260 4.280 4.277 4.290 0.007 0.7% > > n=2 version1 2.740 2.750 2.757 2.880 0.029 5.1% > n=2 version2 3.970 3.980 3.980 4.020 0.010 1.3% > > n=4 version1 4.580 4.595 4.649 4.910 0.094 7.2% > n=4 version2 5.780 5.815 5.809 5.820 0.013 0.7% > > But, micro-benchmarks always make me nervous, so I looked at the actual > instruction cost for my > platform given the version 1 and version 2 code. > > If we count cpu cycles using the AMD Zen 2 instruction latency/throughput > tables: version 1 (loop body) > has a critical path of ~5-6 cycles per iteration. version 2 (loop body) > has ~3-4 cycles per iteration. > > The problem for version 2 is that the call to memcpy is ~24-30 cycles due > to the stub + function call + return > and branch predictor pressure on first call. This probably results in > ~2.5 ns per iteration cost for version 2. > > So, no I wouldn't call it an optimization. But, it will be interesting to > hear other opinions on this. > > --bg > > > On Mon, Mar 9, 2026 at 10:25 AM Ranier Vilela <[email protected]> wrote: > >> >> >> Em seg., 9 de mar. de 2026 às 11:47, Bryan Green <[email protected]> >> escreveu: >> >>> I created an example that is a little bit closer to the actual code and >>> changed the compiler from C++ to C. >>> >>> It is interesting the optimization that the compiler has chosen for >>> version 1 versus version 2. One calls >>> memcpy and one doesn't. There is a good chance the inlining of memcpy >>> as SSE+scalar per iteration >>> will be faster for syscache scans-- which I believe are usually small >>> (1-4 keys?). >>> >> I doubt the inline version is better. >> Clang is supported too and the code generated is much better with memcpy >> one call outside of the loop. >> >> >>> >>> Probably the only reason to do this patch would be if N is normally >>> large or if this is considered an >>> improvement in code clarity without a detrimental impact on small N >>> syscache scans. >>> I realize you only said "possible small optimization". It might be >>> worthwhile to benchmark the code for >>> different values of n to determine if there is a tipping point either >>> way? >>> >> In your opinion, shouldn't this be considered an optimization, even a >> small one? >> >> best regards, >> Ranier Vilela >> >
