[Bug other/110946] 3x perf regression with -Os on M1 Pro
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946 Andrew Pinski changed: What|Removed |Added Depends on||92716 --- Comment #7 from Andrew Pinski --- I am 99% sure this is basically PR 92716. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92716 [Bug 92716] -Os doesn't inline byteswap function even though it's a single instruction
[Bug other/110946] 3x perf regression with -Os on M1 Pro
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946 --- Comment #6 from Dave Rodgman --- Under clang, we see that mbedtls_xor being inlined, or not, causes an equivalent perf difference. Note that mbedtls_xor is inline in the gcc O2 version and not in the gcc Os version. Not inline mbedtls_xor, -Os clang: AES-XTS-128 : 834549 KiB/s, 0 cycles/byte AES-XTS-256 : 674383 KiB/s, 0 cycles/byte Inline mbedtls_xor, -Os clang: AES-XTS-128 :2664799 KiB/s, 0 cycles/byte AES-XTS-256 :2278008 KiB/s, 0 cycles/byte However, if I mark mbedtls_xor as static inline (actually, for testing purposes, I created a static inline copy in aes.c), gcc still does not inline it. I am not sure why. If I use "__attribute__((always_inline))" gcc will inline it. So it looks like gcc is overly averse to inlining this function, or is getting the cost/benefit of inline-ing wrong here? For 3/5 cases, we know at compile time that n == 16, so the function will compile to four instructions: 139c: 3dc00021ldr q1, [x1] 13a0: 3dc00040ldr q0, [x2] 13a4: 6e211c00eor v0.16b, v0.16b, v1.16b 13a8: 3d80str q0, [x0] so it does seem surprising that gcc doesn't want to inline this.
[Bug other/110946] 3x perf regression with -Os on M1 Pro
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946 --- Comment #5 from Dave Rodgman --- (In reply to Richard Biener from comment #3) > Note you shouldn't use -Os if you care about performance. GCC is quite > reasonable with code size increases at -O2 (as compared to other compilers). > Instead I suggest you use -flto with -O2 to decrease the size of the final > executable/library and give GCC better knowledge on unit growth. Understood, but I think it depends on the magnitude of the perf difference. I'd expect a smallish perf drop, say 10%, from -Os to be reasonable, but I'd consider a 3x perf difference to be a compiler issue.(In reply to Alexander Monakov from comment #2) > So basically missed inlining at -Os, even memcpy wrappers are not inlined. > > Can you provide a reproducible testcase? > > Note that inline functions in mbedtls/library/alignment.h all miss the > 'static' qualifier, which affects inlining decisions, and looks like a > mistake anyway (if they are really meant to be non-static inlines, shouldn't > there be a comment?) > > Does making them 'static inline' rectify the problem? The easiest way to reproduce is to use the benchmark tool: make programs/test/benchmark CC=gcc CFLAGS="-Os" programs/test/benchmark aes_xts I don't have a compact reproducer, sorry.
[Bug other/110946] 3x perf regression with -Os on M1 Pro
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946 Dave Rodgman changed: What|Removed |Added Keywords|missed-optimization | Component|ipa |other Target|aarch64 | --- Comment #4 from Dave Rodgman --- >From a quick test, it doesn't look like the unaligned access inlining is the issue: Not static inline, -Os: AES-XTS-128 : 853799 KiB/s, 0 cycles/byte AES-XTS-256 : 749919 KiB/s, 0 cycles/byte Static inline, -Os: AES-XTS-128 : 885380 KiB/s, 0 cycles/byte AES-XTS-256 : 752995 KiB/s, 0 cycles/byte Not static inline, -O2: AES-XTS-128 :2822656 KiB/s, 0 cycles/byte AES-XTS-256 :2425721 KiB/s, 0 cycles/byte Static inline, -O2: AES-XTS-128 :2692321 KiB/s, 0 cycles/byte AES-XTS-256 :2446391 KiB/s, 0 cycles/byte
[Bug other/110946] 3x perf regression with -Os on M1 Pro
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #2 from Alexander Monakov --- So basically missed inlining at -Os, even memcpy wrappers are not inlined. Can you provide a reproducible testcase? Note that inline functions in mbedtls/library/alignment.h all miss the 'static' qualifier, which affects inlining decisions, and looks like a mistake anyway (if they are really meant to be non-static inlines, shouldn't there be a comment?) Does making them 'static inline' rectify the problem?
[Bug other/110946] 3x perf regression with -Os on M1 Pro
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946 --- Comment #1 from Dave Rodgman --- Disassembly under -Os: 139c : 139c: a9b67bfdstp x29, x30, [sp, #-160]! 13a0: 910003fdmov x29, sp 13a4: a9046bf9stp x25, x26, [sp, #64] 13a8: aa0003f9mov x25, x0 13ac: 9000adrpx0, 0 <__stack_chk_guard> 13b0: a90153f3stp x19, x20, [sp, #16] 13b4: f940ldr x0, [x0] 13b8: a9025bf5stp x21, x22, [sp, #32] 13bc: 2a0103f6mov w22, w1 13c0: a90363f7stp x23, x24, [sp, #48] 13c4: a90573fbstp x27, x28, [sp, #80] 13c8: f941ldr x1, [x0] 13cc: f9004fe1str x1, [sp, #152] 13d0: d281mov x1, #0x0// #0 13d4: 710006dfcmp w22, #0x1 13d8: 54000c28b.hi155c // b.pmore 13dc: d1004041sub x1, x2, #0x10 13e0: aa0203f3mov x19, x2 13e4: b27c4fe0mov x0, #0xf0 // #16777200 13e8: eb3fcmp x1, x0 13ec: 54000bc8b.hi1564 // b.pmore 13f0: 9101a3f5add x21, sp, #0x68 13f4: aa0303e2mov x2, x3 13f8: aa0403f8mov x24, x4 13fc: aa0503f7mov x23, x5 1400: aa1503e3mov x3, x21 1404: 91048320add x0, x25, #0x120 1408: 52800021mov w1, #0x1// #1 140c: 9400bl 1210 1410: 2a0003f4mov w20, w0 1414: 35000540cbnzw0, 14bc 1418: 520002dbeor w27, w22, #0x1 141c: d344fe7alsr x26, x19, #4 1420: 1200037band w27, w27, #0x1 1424: 92400e73and x19, x19, #0xf 1428: 910223fcadd x28, sp, #0x88 142c: d100075asub x26, x26, #0x1 1430: b100075fcmn x26, #0x1 1434: 54000541b.ne14dc // b.any 1438: b4000433cbz x19, 14bc 143c: 710002dfcmp w22, #0x0 1440: d10042fbsub x27, x23, #0x10 1444: 9101e3faadd x26, sp, #0x78 1448: aa1303e2mov x2, x19 144c: 9a95035acselx26, x26, x21, eq // eq = none 1450: aa1b03e1mov x1, x27 1454: 910223f5add x21, sp, #0x88 1458: aa1703e0mov x0, x23 145c: 9400bl 0 1460: d2800217mov x23, #0x10 // #16 1464: aa1303e3mov x3, x19 1468: aa1a03e2mov x2, x26 146c: aa1803e1mov x1, x24 1470: aa1503e0mov x0, x21 1474: 9400bl 0 1478: cb1302e3sub x3, x23, x19 147c: 8b130342add x2, x26, x19 1480: 8b130361add x1, x27, x19 1484: 8b1302a0add x0, x21, x19 1488: 9400bl 0 148c: aa1503e3mov x3, x21 1490: aa1503e2mov x2, x21 1494: 2a1603e1mov w1, w22 1498: aa1903e0mov x0, x25 149c: 9400bl 1210 14a0: 2a0003f4mov w20, w0 14a4: 35c0cbnzw0, 14bc 14a8: aa1703e3mov x3, x23 14ac: aa1a03e2mov x2, x26 14b0: aa1503e1mov x1, x21 14b4: aa1b03e0mov x0, x27 14b8: 9400bl 0 14bc: 9000adrpx0, 0 <__stack_chk_guard> 14c0: f940ldr x0, [x0] 14c4: f9404fe2ldr x2, [sp, #152] 14c8: f941ldr x1, [x0] 14cc: eb010042subsx2, x2, x1 14d0: d281mov x1, #0x0// #0 14d4: 54000500b.eq1574 // b.none 14d8: 9400bl 0 <__stack_chk_fail> 14dc: f100027fcmp x19, #0x0 14e0: 1a9f07e0csetw0, ne // ne = any 14e4: 6a1b001ftst w0, w27 14e8: 54e0b.eq1504 // b.none 14ec: b5dacbnzx26, 1504 14f0: a94687e0ldp x0, x1, [sp, #104] 14f4: a90787e0stp x0, x1, [sp, #120] 14f8: aa1503e1mov x1, x21 14fc: aa1503e0mov x0, x21 1500: 97fffb63bl 28c 1504: aa1503e2mov x2,