> Am 26.11.2018 um 16:26 schrieb Ilya Leoshkevich <i...@linux.ibm.com>: > >> Am 26.11.2018 um 16:07 schrieb Segher Boessenkool >> <seg...@kernel.crashing.org>: >> >>> # ppc64le-redhat-linux: >>> 511.povray_r -1.29% >>> 482.sphinx3 -0.65% >>> 456.hmmer -0.53% >>> 519.lbm_r -0.51% >>> # skip |dt| < 0.5% >>> 549.fotonik3d_r +1.13% >>> 403.gcc +1.76% >>> 500.perlbench_r +2.35% >> >> 2% degradation on gcc and perlbench isn't really acceptable. It is >> certainly possible this is an uarch effect of indirect jumps and we are >> just very unlucky now (and were lucky before), but this doesn't sound >> good to me :-/ >> >> What did you run this on? p8? > > That was p9 (gcc135).
I've had a look at gcc regression with perf. I made a 5x re-run, and confirmed that the run time grew from 225.76s to 228.82s (+1.4%). perf stat shows that the slow version consumes additional ~11E+9 cycles: 856,588,095,385 cycles:u # 3.740 GHz (33.33%) 36,451,588,171 stalled-cycles-frontend:u # 4.26% frontend cycles idle (50.01%) 438,654,175,652 stalled-cycles-backend:u # 51.21% backend cycles idle (16.68%) 937,926,993,826 instructions:u # 1.09 insn per cycle # 0.47 stalled cycles per insn (33.36%) 205,289,212,856 branches:u # 896.253 M/sec (50.02%) 9,019,757,337 branch-misses:u # 4.39% of all branches (16.65%) vs 867,688,505,674 cycles:u # 3.731 GHz (33.34%) Δ=11100410289 (+1.29%) 36,672,094,462 stalled-cycles-frontend:u # 4.23% frontend cycles idle (50.02%) Δ= 220506291 (+0.60%) 438,837,922,096 stalled-cycles-backend:u # 50.58% backend cycles idle (16.68%) Δ= 183746444 (+0.04%) 937,918,212,318 instructions:u # 1.08 insn per cycle # 0.47 stalled cycles per insn (33.37%) 205,201,306,341 branches:u # 882.403 M/sec (50.02%) 9,072,857,028 branch-misses:u # 4.42% of all branches (16.65%) Δ= 53099691 (+0.58%) It also shows that the slowdown cannot be explained by pipeline stalls, additional instructions or branch misses (the latter could still be the case if a single branch miss somehow translated to 200 cycles on p9). perf diff -c wdiff:1,1 shows, that there is just one function (htab_traverse) that is significantly slower now: 2.98% 11768891764 exe [.] htab_traverse 1.91% 563949986 exe [.] compute_dominance_frontiers_1 The additional cycles consumed by this function matches the overall number of additionaly consumed cycles, and the contribution of the runner up (compute_dominance_frontiers_1) is 20 times smaller, so I think it's really just this one function. However, the generated assembly is completely identical in both cases! I saw similar situations in the past, so I tried adding a nop to htab_traverse: --- hashtab.c +++ hashtab.c @@ -529,6 +529,8 @@ htab_traverse (htab, callback, info) htab_trav callback; PTR info; { + __asm__ volatile("nop\n"); + PTR *slot = htab->entries; PTR *limit = slot + htab->size; and made a 5x re-run. The new measurements are 227.01s and 227.44s (+0.19%). With two nops I get 227.25s and 227.29s (+0.02%), which also looks like noise. Can this be explained by some microarchitectural quirk after all?