Re: [PATCH v4] Repeat jump threading after combine

Ilya Leoshkevich Tue, 27 Nov 2018 08:07:39 -0800

> Am 26.11.2018 um 16:26 schrieb Ilya Leoshkevich <i...@linux.ibm.com>:
> 
>> Am 26.11.2018 um 16:07 schrieb Segher Boessenkool 
>> <seg...@kernel.crashing.org>:
>> 
>>> # ppc64le-redhat-linux:
>>> 511.povray_r       -1.29%
>>> 482.sphinx3        -0.65%
>>> 456.hmmer          -0.53%
>>> 519.lbm_r          -0.51%
>>> # skip |dt| < 0.5%
>>> 549.fotonik3d_r    +1.13%
>>> 403.gcc            +1.76%
>>> 500.perlbench_r    +2.35%
>> 
>> 2% degradation on gcc and perlbench isn't really acceptable.  It is
>> certainly possible this is an uarch effect of indirect jumps and we are
>> just very unlucky now (and were lucky before), but this doesn't sound
>> good to me :-/
>> 
>> What did you run this on?  p8?
> 
> That was p9 (gcc135).


I've had a look at gcc regression with perf.  I made a 5x re-run, and
confirmed that the run time grew from 225.76s to 228.82s (+1.4%).

perf stat shows that the slow version consumes additional ~11E+9 cycles:

   856,588,095,385      cycles:u                  #    3.740 GHz                
      (33.33%)
    36,451,588,171      stalled-cycles-frontend:u #    4.26% frontend cycles 
idle     (50.01%)
   438,654,175,652      stalled-cycles-backend:u  #   51.21% backend cycles 
idle      (16.68%)
   937,926,993,826      instructions:u            #    1.09  insn per cycle
                                                  #    0.47  stalled cycles per 
insn  (33.36%)
   205,289,212,856      branches:u                #  896.253 M/sec              
      (50.02%)
     9,019,757,337      branch-misses:u           #    4.39% of all branches    
      (16.65%)

vs

   867,688,505,674      cycles:u                  #    3.731 GHz                
      (33.34%)  Δ=11100410289 (+1.29%)
    36,672,094,462      stalled-cycles-frontend:u #    4.23% frontend cycles 
idle     (50.02%)  Δ=  220506291 (+0.60%)
   438,837,922,096      stalled-cycles-backend:u  #   50.58% backend cycles 
idle      (16.68%)  Δ=  183746444 (+0.04%)
   937,918,212,318      instructions:u            #    1.08  insn per cycle
                                                  #    0.47  stalled cycles per 
insn  (33.37%)
   205,201,306,341      branches:u                #  882.403 M/sec              
      (50.02%)
     9,072,857,028      branch-misses:u           #    4.42% of all branches    
      (16.65%)  Δ=   53099691 (+0.58%)

It also shows that the slowdown cannot be explained by pipeline stalls,
additional instructions or branch misses (the latter could still be the
case if a single branch miss somehow translated to 200 cycles on p9).

perf diff -c wdiff:1,1 shows, that there is just one function
(htab_traverse) that is significantly slower now:

     2.98%     11768891764  exe                [.] htab_traverse
     1.91%       563949986  exe                [.] compute_dominance_frontiers_1

The additional cycles consumed by this function matches the overall
number of additionaly consumed cycles, and the contribution of the
runner up (compute_dominance_frontiers_1) is 20 times smaller, so I
think it's really just this one function.

However, the generated assembly is completely identical in both cases!

I saw similar situations in the past, so I tried adding a nop to
htab_traverse:

--- hashtab.c
+++ hashtab.c
@@ -529,6 +529,8 @@ htab_traverse (htab, callback, info)
      htab_trav callback;
      PTR info;
 {
+  __asm__ volatile("nop\n");
+
   PTR *slot = htab->entries;
   PTR *limit = slot + htab->size;

and made a 5x re-run.  The new measurements are 227.01s and 227.44s
(+0.19%).  With two nops I get 227.25s and 227.29s (+0.02%), which also
looks like noise.

Can this be explained by some microarchitectural quirk after all?

Re: [PATCH v4] Repeat jump threading after combine

Reply via email to