[Bug other/110946] 3x perf regression with -Os on M1 Pro

2023-08-08 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946

Andrew Pinski  changed:

   What|Removed |Added

 Depends on||92716

--- Comment #7 from Andrew Pinski  ---
I am 99% sure this is basically PR 92716.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92716
[Bug 92716] -Os doesn't inline byteswap function even though it's a single
instruction

[Bug other/110946] 3x perf regression with -Os on M1 Pro

2023-08-08 Thread dave.rodgman at arm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946

--- Comment #6 from Dave Rodgman  ---
Under clang, we see that mbedtls_xor being inlined, or not, causes an
equivalent perf difference. Note that mbedtls_xor is inline in the gcc O2
version and not in the gcc Os version.

Not inline mbedtls_xor, -Os clang:
  AES-XTS-128  : 834549 KiB/s,  0 cycles/byte
  AES-XTS-256  : 674383 KiB/s,  0 cycles/byte

Inline mbedtls_xor, -Os clang:
  AES-XTS-128  :2664799 KiB/s,  0 cycles/byte
  AES-XTS-256  :2278008 KiB/s,  0 cycles/byte


However, if I mark mbedtls_xor as static inline (actually, for testing
purposes, I created a static inline copy in aes.c), gcc still does not inline
it. I am not sure why. If I use "__attribute__((always_inline))" gcc will
inline it.

So it looks like gcc is overly averse to inlining this function, or is getting
the cost/benefit of inline-ing wrong here?

For 3/5 cases, we know at compile time that n == 16, so the function will
compile to four instructions:

139c:   3dc00021ldr q1, [x1]
13a0:   3dc00040ldr q0, [x2]
13a4:   6e211c00eor v0.16b, v0.16b, v1.16b
13a8:   3d80str q0, [x0]

so it does seem surprising that gcc doesn't want to inline this.

[Bug other/110946] 3x perf regression with -Os on M1 Pro

2023-08-08 Thread dave.rodgman at arm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946

--- Comment #5 from Dave Rodgman  ---
(In reply to Richard Biener from comment #3)
> Note you shouldn't use -Os if you care about performance.  GCC is quite
> reasonable with code size increases at -O2 (as compared to other compilers).
> Instead I suggest you use -flto with -O2 to decrease the size of the final
> executable/library and give GCC better knowledge on unit growth.

Understood, but I think it depends on the magnitude of the perf difference. I'd
expect a smallish perf drop, say 10%, from -Os to be reasonable, but I'd
consider a 3x perf difference to be a compiler issue.(In reply to Alexander
Monakov from comment #2)
> So basically missed inlining at -Os, even memcpy wrappers are not inlined.
> 
> Can you provide a reproducible testcase?
> 
> Note that inline functions in mbedtls/library/alignment.h all miss the
> 'static' qualifier, which affects inlining decisions, and looks like a
> mistake anyway (if they are really meant to be non-static inlines, shouldn't
> there be a comment?)
> 
> Does making them 'static inline' rectify the problem?

The easiest way to reproduce is to use the benchmark tool:

make programs/test/benchmark CC=gcc CFLAGS="-Os"
programs/test/benchmark aes_xts

I don't have a compact reproducer, sorry.

[Bug other/110946] 3x perf regression with -Os on M1 Pro

2023-08-08 Thread dave.rodgman at arm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946

Dave Rodgman  changed:

   What|Removed |Added

   Keywords|missed-optimization |
  Component|ipa |other
 Target|aarch64 |

--- Comment #4 from Dave Rodgman  ---
>From a quick test, it doesn't look like the unaligned access inlining is the
issue:

Not static inline, -Os:
  AES-XTS-128  : 853799 KiB/s,  0 cycles/byte
  AES-XTS-256  : 749919 KiB/s,  0 cycles/byte

Static inline, -Os:

  AES-XTS-128  : 885380 KiB/s,  0 cycles/byte
  AES-XTS-256  : 752995 KiB/s,  0 cycles/byte

Not static inline, -O2:
  AES-XTS-128  :2822656 KiB/s,  0 cycles/byte
  AES-XTS-256  :2425721 KiB/s,  0 cycles/byte

Static inline, -O2:
  AES-XTS-128  :2692321 KiB/s,  0 cycles/byte
  AES-XTS-256  :2446391 KiB/s,  0 cycles/byte

[Bug other/110946] 3x perf regression with -Os on M1 Pro

2023-08-08 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #2 from Alexander Monakov  ---
So basically missed inlining at -Os, even memcpy wrappers are not inlined.

Can you provide a reproducible testcase?

Note that inline functions in mbedtls/library/alignment.h all miss the 'static'
qualifier, which affects inlining decisions, and looks like a mistake anyway
(if they are really meant to be non-static inlines, shouldn't there be a
comment?)

Does making them 'static inline' rectify the problem?

[Bug other/110946] 3x perf regression with -Os on M1 Pro

2023-08-08 Thread dave.rodgman at arm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946

--- Comment #1 from Dave Rodgman  ---
Disassembly under -Os:

139c :
139c:   a9b67bfdstp x29, x30, [sp, #-160]!
13a0:   910003fdmov x29, sp
13a4:   a9046bf9stp x25, x26, [sp, #64]
13a8:   aa0003f9mov x25, x0
13ac:   9000adrpx0, 0 <__stack_chk_guard>
13b0:   a90153f3stp x19, x20, [sp, #16]
13b4:   f940ldr x0, [x0]
13b8:   a9025bf5stp x21, x22, [sp, #32]
13bc:   2a0103f6mov w22, w1
13c0:   a90363f7stp x23, x24, [sp, #48]
13c4:   a90573fbstp x27, x28, [sp, #80]
13c8:   f941ldr x1, [x0]
13cc:   f9004fe1str x1, [sp, #152]
13d0:   d281mov x1, #0x0// #0
13d4:   710006dfcmp w22, #0x1
13d8:   54000c28b.hi155c   //
b.pmore
13dc:   d1004041sub x1, x2, #0x10
13e0:   aa0203f3mov x19, x2
13e4:   b27c4fe0mov x0, #0xf0   //
#16777200
13e8:   eb3fcmp x1, x0
13ec:   54000bc8b.hi1564   //
b.pmore
13f0:   9101a3f5add x21, sp, #0x68
13f4:   aa0303e2mov x2, x3
13f8:   aa0403f8mov x24, x4
13fc:   aa0503f7mov x23, x5
1400:   aa1503e3mov x3, x21
1404:   91048320add x0, x25, #0x120
1408:   52800021mov w1, #0x1// #1
140c:   9400bl  1210 
1410:   2a0003f4mov w20, w0
1414:   35000540cbnzw0, 14bc 
1418:   520002dbeor w27, w22, #0x1
141c:   d344fe7alsr x26, x19, #4
1420:   1200037band w27, w27, #0x1
1424:   92400e73and x19, x19, #0xf
1428:   910223fcadd x28, sp, #0x88
142c:   d100075asub x26, x26, #0x1
1430:   b100075fcmn x26, #0x1
1434:   54000541b.ne14dc   //
b.any
1438:   b4000433cbz x19, 14bc 
143c:   710002dfcmp w22, #0x0
1440:   d10042fbsub x27, x23, #0x10
1444:   9101e3faadd x26, sp, #0x78
1448:   aa1303e2mov x2, x19
144c:   9a95035acselx26, x26, x21, eq  // eq = none
1450:   aa1b03e1mov x1, x27
1454:   910223f5add x21, sp, #0x88
1458:   aa1703e0mov x0, x23
145c:   9400bl  0 
1460:   d2800217mov x23, #0x10  // #16
1464:   aa1303e3mov x3, x19
1468:   aa1a03e2mov x2, x26
146c:   aa1803e1mov x1, x24
1470:   aa1503e0mov x0, x21
1474:   9400bl  0 
1478:   cb1302e3sub x3, x23, x19
147c:   8b130342add x2, x26, x19
1480:   8b130361add x1, x27, x19
1484:   8b1302a0add x0, x21, x19
1488:   9400bl  0 
148c:   aa1503e3mov x3, x21
1490:   aa1503e2mov x2, x21
1494:   2a1603e1mov w1, w22
1498:   aa1903e0mov x0, x25
149c:   9400bl  1210 
14a0:   2a0003f4mov w20, w0
14a4:   35c0cbnzw0, 14bc 
14a8:   aa1703e3mov x3, x23
14ac:   aa1a03e2mov x2, x26
14b0:   aa1503e1mov x1, x21
14b4:   aa1b03e0mov x0, x27
14b8:   9400bl  0 
14bc:   9000adrpx0, 0 <__stack_chk_guard>
14c0:   f940ldr x0, [x0]
14c4:   f9404fe2ldr x2, [sp, #152]
14c8:   f941ldr x1, [x0]
14cc:   eb010042subsx2, x2, x1
14d0:   d281mov x1, #0x0// #0
14d4:   54000500b.eq1574   //
b.none
14d8:   9400bl  0 <__stack_chk_fail>
14dc:   f100027fcmp x19, #0x0
14e0:   1a9f07e0csetw0, ne  // ne = any
14e4:   6a1b001ftst w0, w27
14e8:   54e0b.eq1504   //
b.none
14ec:   b5dacbnzx26, 1504 
14f0:   a94687e0ldp x0, x1, [sp, #104]
14f4:   a90787e0stp x0, x1, [sp, #120]
14f8:   aa1503e1mov x1, x21
14fc:   aa1503e0mov x0, x21
1500:   97fffb63bl  28c 
1504:   aa1503e2mov x2,