[Bug middle-end/90056] 548.exchange2_r regressions on AMD Zen

2020-03-27 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90056

--- Comment #3 from Martin Jambor  ---
So replaced with more specific bugs for newer hardware: PR94373 and PR94375.

[Bug middle-end/90056] 548.exchange2_r regressions on AMD Zen

2020-03-27 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90056

Martin Jambor  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |MOVED

--- Comment #2 from Martin Jambor  ---
(In reply to Martin Jambor from comment #0)
> As of revision 270053, the 548.exchange2_r benchmark from SPEC 2017
> INTrate suite suffered a number of smaller regressions on AMD Zen
> CPUs:
> 
>   - At -O2, it is 4.5% slower than when compiled with GCC 7

I am about to file a specific bug about exchange at -O2.

>   - At -Ofast, it is 4.7% slower than when compiled with GCC 8

This is no longer true.

>   - At -Ofast -march=native -mutine=native, this difference is 6.9%

Again, I will file a more specific bug about -Ofast -march=native in a
little while.

>   - At -Ofast and native tuning, it is 6% slower with PGO than
> without it.

I can still see this in my measurements on Zen1-based CPU but not in
those done on AMD Zen2 or Intel Cascade Lake.  So I am not sure if we
care.  I'll e happy to file a specific bug if we do.

[Bug middle-end/90056] 548.exchange2_r regressions on AMD Zen

2019-04-15 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90056

--- Comment #1 from Martin Liška  ---
Created attachment 46169
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46169=edit
perf annotate - Ofast native vs. Ofast native PGO

I'm attaching HTML and txt perf annotate for Ofast native and Ofast native PGO
builds. As seen, it's still the same story. There's a big register pressure
that leads to spilling of some of the induction variables.

For these builds, the most significant difference is:

GOOD:

 :  if(block(row, 4, i4) <= 0) cycle
0.00 :41c660:   mov(%r9),%r12d
1.99 :41c663:   mov%r11d,0x80(%rsp)
0.11 :41c66b:   mov%r11d,%edx
0.02 :41c66e:   test   %r12d,%r12d
0.15 :41c671:   jg 41c7b0
<__brute_force_MOD_digits_2+0xe00>
0.01 :41c677:   inc%r11
0.64 :41c67a:   add$0x144,%r9
0.13 :41c681:   add$0x144,%r8
0.05 :41c688:   add$0x144,%r10
 : do i4 = l(4), u(4)
0.15 :41c68f:   cmp%r11d,0x6c(%rsp)
2.39 :41c694:   jge41c660
<__brute_force_MOD_digits_2+0xcb0>
0.00 :41c696:   mov0x168(%rsp),%r10
0.55 :41c69e:   mov0x170(%rsp),%r9
0.08 :41c6a6:   mov0x178(%rsp),%r11
0.05 :41c6ae:   mov0x180(%rsp),%r8
 : block(row, 4:9, i3) = block(row, 4:9, i3) + 10

BAD:

 :  if(block(row, 4, i4) <= 0) cycle
0.05 :41a8b0:   mov(%r11),%edi
0.78 :41a8b3:   mov%r10d,0x84(%rsp)
0.04 :41a8bb:   mov%r10d,%r13d
0.01 :41a8be:   test   %edi,%edi
0.26 :41a8c0:   jg 41aa10
<__brute_force_MOD_digits_2+0x1210>
0.44 :41a8c6:   addq   $0x144,0x48(%rsp)
4.04 :41a8cf:   addq   $0x144,0x58(%rsp)
1.31 :41a8d8:   inc%r10
0.02 :41a8db:   add$0x144,%r11
 : do i4 = l(4), u(4)
0.01 :41a8e2:   cmp%r10d,0x88(%rsp)
0.25 :41a8ea:   jge41a8b0
<__brute_force_MOD_digits_2+0x10b0>
 : block(row, 4:9, i3) = block(row, 4:9, i3) + 10
0.03 :41a8ec:   mov0xd0(%rsp),%r15
0.27 :41a8f4:   addl   $0xa,-0xdc(%r15)
0.20 :41a8fc:   addl   $0xa,-0xb8(%r15)
0.01 :41a904:   addl   $0xa,-0x94(%r15)
0.07 :41a90c:   addl   $0xa,-0x70(%r15)
0.05 :41a911:   addl   $0xa,-0x4c(%r15)
0.06 :41a916:   addl   $0xa,-0x28(%r15)

The benchmark is quite unpredictable, I'm leaving that for now.