[Bug target/118328] Implement preserve_none for AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328 --- Comment #22 from Diego Russo --- Another reason to have this implemented is the CPython JIT. It is a template (stencil) JIT where every micro OP is precompiled as stencil. At run time these stencils will be stitched together and patched with the next micro OP instruction. This heavily uses preserve_none (https://github.com/python/cpython/blob/main/Tools/jit/template.c#L86) and so far we can only use clang to build these stencils. It would be really great if gcc reaches feature parity with llvm so, we can start building the JIT with GCC as well.
[Bug target/118328] Implement preserve_none for AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328 --- Comment #21 from Ken Jin --- I sincerely apologize for my previous performance figures. The baseline was worse due to a Clang-19 bug https://github.com/llvm/llvm-project/issues/106846. So the numbers were inaccurate. On Clang-20, on the pystones (Dhrystone variant) benchmark, I get a roughly 3% speedup with tailcalling interpreter versus computed goto. I have some numbers to report for CPython compilation time as well. These are with dynamic frequency scaling off: CC=clang-20 ./configure --with-lto=thin && make clean && time make -j18 + Tail call: real1m8.183s - Tail call: real1m11.004s CC=clang-20 ./configure --with-lto=full && make clean && time make -j18 + Tail call: real3m49.285s - Tail call: real3m59.679s CC=/home/ken/GCC-15.0-trunk/bin/gcc ./configure --with-lto=full && make clean && time make -j18 + Tail call: real10m5.521s - Tail call: real10m14.972s So we save roughly 4-5% compilation time by switching the interpreter from a over-1-line switch case of computed gotos to smaller per-bytecode tail calls handlers on Clang 20. The savings on GCC 15 are lower (around 1%). I have no clue how this 4-5% translates to GCC 15, as the comparison between clang and gcc here is not apples-to-apples. The clang-20 on my system is a release distribution, while my GCC 15 is built from source just with configure and make. Anyways, I don't mean to push for anything here. Just updating the record and providing new numbers. Thanks again GCC devs for all your work on GCC!
[Bug target/118328] Implement preserve_none for AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328 --- Comment #20 from Ken Jin --- (In reply to Andrew Pinski from comment #17) > I am not sure if I understand this correctly. > Can you make a simple table: > > w/o tail-call - 1 > with tail-call but not preserve_none - XYZ > with tail-call and preserve_none - PQR I talked to Diego and this is roughly the table from my understanding w/o tail-call - 1 with tail-call but not preserve_none - 0.94 with tail-call and preserve_none - 1 The fact that without `preserve_none` is a huge regression is pretty clear. Whether `tail-call and preserve_none` gains a speedup over traditional computed goto/labels-as-values (w/o tail call) is inconclusive. CPython needs PGO[1] and the register pinning (mentioned in Diego's LLVM PR) to produce reliable benchmarking results. However, PGO with musttail is still broken as of right now https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118442. And the preserve_none patch is not pinning registers. We already introduced tailcall+preserve_none for perf reasons in CPython on Clang. However, even if not for perf reasons, I am also motivated to adopt the tailcall interpreter for significantly better debugging experience. Each interpreter instruction is now its own function, and can be measured properly by perf and other tools (previous computed gotos interpreter could not). As a side note, GCC 15 is extremely impressive here. GCC 15 w/o tail calls performs roughly same as tailcall+preserve none on the pystones benchmark **without PGO**. However, once PGO is enabled on both, clang 19 performs roughly 20% better on pystones than GCC 15 w/o tail calls. So PGO benefits the tail call+preserve_none stuff more than non-tailcall. Hence why we can't make any perf uplift conclusions on CPython yet. For simplicity, on pystones (different benchmark than Diego's): Clang-19 w/o tail call no PGO no LTO much worse than GCC 15 GCC 15 w/o tail call no PGO no LTO:<1 GCC 15 w/o tail call PGO+LTO: 1 Clang-19 with tailcall+preserve_none PGO+LTO: 1.25 [1] Note: this is mostly due to code placement issues in CPython's over 6000 line computed goto interpreter loop.
[Bug target/118328] Implement preserve_none for AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328 --- Comment #19 from Diego Russo --- > Can you make a simple table: w/o tail-call - 1 with tail-call but not preserve_none - 0.94 with tail-call and preserve_none - 1 You understood correctly. I think there is still value in having it on AArch64. The debug experience will be much more pleasant :) > Is there real documentation on this attribute or is it just ad hoc on what it > does on the LLVM side about the ABI implications? I'll ping Brandt and let you know.
[Bug target/118328] Implement preserve_none for AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328 Sam James changed: What|Removed |Added CC||fw at gcc dot gnu.org, ||hjl.tools at gmail dot com, ||matz at gcc dot gnu.org --- Comment #18 from Sam James --- (In reply to Andrew Pinski from comment #17) > >Can we have the same implementation/interface of LLVM? > > Is there real documentation on this attribute or is it just ad hoc on what > it does on the LLVM side about the ABI implications? It seems to me there > should be 2 seperate attributes, one to change the argument passing and one > for preserve_none part. > See the discussion in PR110899. It is actually a bit worrying as they don't guarantee stability: https://clang.llvm.org/docs/AttributeReference.html#preserve-none.
[Bug target/118328] Implement preserve_none for AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328 --- Comment #17 from Andrew Pinski --- >Can we have the same implementation/interface of LLVM? Is there real documentation on this attribute or is it just ad hoc on what it does on the LLVM side about the ABI implications? It seems to me there should be 2 seperate attributes, one to change the argument passing and one for preserve_none part. >Anyway I re-ran the benchmarks and the binary without preserve_none is >actually 6% slower than the build without tail-calling interpreter. I am not sure if I understand this correctly. Can you make a simple table: w/o tail-call - 1 with tail-call but not preserve_none - XYZ with tail-call and preserve_none - PQR >From my read is that with tail-call but not preserve_none is 0.94 but with both it is some increase or close to 1. Maybe this is an argument that for aarch64, using the tail-calling interpreter is not useful rather than an argument to add preserve_none.
[Bug target/118328] Implement preserve_none for AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328 --- Comment #16 from Diego Russo --- Right, I had a couple of problems with running the benchmarks. A few failures and the wrong environment variable to select the binary of the compiler. Anyway I re-ran the benchmarks and the binary without preserve_none is actually 6% slower than the build without tail-calling interpreter. If we are going to introduce the preserve_none attribute, the 6% is regained and it is 0% faster. Hence the preserve-none is needed otherwise we will have a regression. BTW Brandt (CPython Core developers) pointed me at this Github issued:https://github.com/llvm/llvm-project/pull/88333 where it tries to use non-volatile registers for preserve_none parameters. With that change we notice a significant speed-up whilst executing benchmarks. LLVM uses normally-non-volatile (x19-x28) first, then normally-volatile registers (x0-x15). I tried compiling that small example and what I have is: $ objdump -d boring boring: file format elf64-littleaarch64 Disassembly of section .text: : 0: a9bf7bfdstp x29, x30, [sp, #-16]! 4: 910003fdmov x29, sp 8: aa0003f3mov x19, x0 c: aa0103f4mov x20, x1 10: aa0203f5mov x21, x2 14: aa0303f6mov x22, x3 18: 9400bl 0 1c: aa1603e3mov x3, x22 20: aa1503e2mov x2, x21 24: aa1403e1mov x1, x20 28: aa1303e0mov x0, x19 2c: a8c17bfdldp x29, x30, [sp], #16 30: 1400b 0 which differ from the second block of text on that PR. Can we have the same implementation/interface of LLVM? Thanks
[Bug target/118328] Implement preserve_none for AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328 Sam James changed: What|Removed |Added Last reconfirmed||2025-02-07 Status|UNCONFIRMED |NEW Ever confirmed|0 |1
[Bug target/118328] Implement preserve_none for AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328 --- Comment #15 from Diego Russo --- Folks, I think I've botched the performance measurement. Need to retake the measurement. Give me some time and I'll come back with the right results.
[Bug target/118328] Implement preserve_none for AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328 --- Comment #14 from Richard Sandiford --- (In reply to Sam James from comment #13) > The request here notwithstanding, bug report(s) with testcases for missed > opportunities in ipa-ra would be welcome too. Agreed, if we find any. But just in case it seemed otherwise, the effect that Diego described in comment 12 isn't a missed ipa-ra opportunity, but a direct benefit of having preserve_none functions calling normal functions (see also comment 2). ipa-ra would not be able to do that, since it is bound by the ABI of the function that it's compiling.
[Bug target/118328] Implement preserve_none for AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328 --- Comment #13 from Sam James --- The request here notwithstanding, bug report(s) with testcases for missed opportunities in ipa-ra would be welcome too. (btw, x86 has no_callee_saved_registers / no_caller_saved_registers too.)
[Bug target/118328] Implement preserve_none for AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328 Diego Russo changed: What|Removed |Added CC||Diego.Russo at arm dot com --- Comment #12 from Diego Russo --- Hello, I was able to test Richard's patch and I'm glad to confirm that it brings the benefit expected. I built gcc with the patch and with it I compiled https://github.com/Fidget-Spinner/cpython/tree/tail-call-gcc-2 branch that implements the tail-calling interpreter. I've also compiled a modified version of that branch which doesn't use the preserve_none attribute. We noticed improvements in the code generation. This is the version without preserve_none 0060600c <_TAIL_CALL_BINARY_OP_ADD_INT>: 60600c: f85f0026ldurx6, [x1, #-16] 606010: aa0103e5mov x5, x1 606014: 900018e1adrpx1, 922000 606018: 91114021add x1, x1, #0x450 60601c: aa0003e9mov x9, x0 606020: f94004c7ldr x7, [x6, #8] 606024: f9001c03str x3, [x0, #56] 606028: eb0100ffcmp x7, x1 60602c: 54a1b.ne606040 <_TAIL_CALL_BINARY_OP_ADD_INT+0x34> // b.any 606030: f85f80a8ldurx8, [x5, #-8] 606034: f9400500ldr x0, [x8, #8] 606038: eb07001fcmp x0, x7 60603c: 5480b.eq60604c <_TAIL_CALL_BINARY_OP_ADD_INT+0x40> // b.none 606040: aa0503e1mov x1, x5 606044: aa0903e0mov x0, x9 606048: 17fffd9ab 6056b0 <_TAIL_CALL_BINARY_OP> 60604c: a9bb7bfdstp x29, x30, [sp, #-80]! 606050: aa0803e1mov x1, x8 606054: aa0603e0mov x0, x6 606058: 910003fdmov x29, sp 60605c: a90153f3stp x19, x20, [sp, #16] 606060: 91003073add x19, x3, #0xc 606064: aa0203f4mov x20, x2 606068: a90223e6stp x6, x8, [sp, #32] 60606c: a90327e3stp x3, x9, [sp, #48] 606070: f90023e5str x5, [sp, #64] 606074: 97fb2ca4bl 4d1304 <_PyLong_Add> 606078: a94223e6ldp x6, x8, [sp, #32] 60607c: aa0003e4mov x4, x0 606080: f94023e5ldr x5, [sp, #64] 606084: a94327e3ldp x3, x9, [sp, #48] 606088: b9400100ldr w0, [x8] 60608c: 37f80340tbnzw0, #31, 6060f4 <_TAIL_CALL_BINARY_OP_ADD_INT+0xe8> 606090: 51000400sub w0, w0, #0x1 606094: b9000100str w0, [x8] 606098: 350002e0cbnzw0, 6060f4 <_TAIL_CALL_BINARY_OP_ADD_INT+0xe8> 60609c: 90001a20adrpx0, 94a000 6060a0: 91176000add x0, x0, #0x5d8 6060a4: f9544807ldr x7, [x0, #10384] 6060a8: b4000167cbz x7, 6060d4 <_TAIL_CALL_BINARY_OP_ADD_INT+0xc8> 6060ac: f9544c02ldr x2, [x0, #10392] 6060b0: a9021be8stp x8, x6, [sp, #32] 6060b4: aa0803e0mov x0, x8 6060b8: 52800021mov w1, #0x1// #1 6060bc: f9001be4str x4, [sp, #48] 6060c0: f90027e3str x3, [sp, #72] 6060c4: d63f00e0blr x7 6060c8: a9421be8ldp x8, x6, [sp, #32] 6060cc: a94327e4ldp x4, x9, [sp, #48] 6060d0: a9440fe5ldp x5, x3, [sp, #64] 6060d4: aa0803e0mov x0, x8 6060d8: a90213e6stp x6, x4, [sp, #32] 6060dc: a90317e9stp x9, x5, [sp, #48] 6060e0: f90023e3str x3, [sp, #64] 6060e4: 97fb2c71bl 4d12a8 <_PyLong_ExactDealloc> 6060e8: a94213e6ldp x6, x4, [sp, #32] 6060ec: a94317e9ldp x9, x5, [sp, #48] 6060f0: f94023e3ldr x3, [sp, #64] 6060f4: b94000c0ldr w0, [x6] 6060f8: 37f80300tbnzw0, #31, 606158 <_TAIL_CALL_BINARY_OP_ADD_INT+0x14c> 6060fc: 51000400sub w0, w0, #0x1 606100: b9c0str w0, [x6] 606104: 350002a0cbnzw0, 606158 <_TAIL_CALL_BINARY_OP_ADD_INT+0x14c> 606108: 90001a20adrpx0, 94a000 60610c: 91176000add x0, x0, #0x5d8 606110: f9544807ldr x7, [x0, #10384] 606114: b4000167cbz x7, 606140 <_TAIL_CALL_BINARY_OP_ADD_INT+0x134> 606118: f9544c02ldr x2, [x0, #10392] 60611c: a90213e6stp x6, x4, [sp, #32] 606120: aa0603e0mov x0, x6 606124: a90317e9stp x9, x5, [sp, #48] 606128: 52800021
[Bug target/118328] Implement preserve_none for AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328 --- Comment #11 from Richard Sandiford --- Created attachment 60175 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=60175&action=edit Proof-of-concept patch Here's a lightly-tested proof-of-concept patch for preserve_none on AArch64. In practice, I don't think there's much scope for sharing implementation code between targets.
[Bug target/118328] Implement preserve_none for AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328 --- Comment #10 from Andrew Pinski --- (In reply to Ken Jin from comment #7) > The files are too big to upload here, so I've uploaded them to > https://github.com/Fidget-Spinner/debugging-dump. They correspond to the > main interpreter loop of CPython > https://github.com/python/cpython/blob/ > e1988942ca26440a0df6f3949e93ddc0dbd1e57e/Python/ceval.c Filed that issue as PR 118465. Since I work on aarch64, I am not going to do the exaction of the testcase in the end.
[Bug target/118328] Implement preserve_none for AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328 --- Comment #9 from Andrew Pinski --- (In reply to Ken Jin from comment #7) > Specifically, zoom in on the function _TAIL_CALL_YIELD_VALUE, it produces on > GCC 15 (note the assembly here might be slightly different than the one in > .s file, because it's from a different build but same flags passed): That is about aligning the stack. And that is a x86_64 specific issue I think. Let me try to get a reduced testcase for that and file seperately.
[Bug target/118328] Implement preserve_none for AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328 --- Comment #8 from Andrew Pinski --- (In reply to Ken Jin from comment #7) > The files are too big to upload here, so I've uploaded them to > https://github.com/Fidget-Spinner/debugging-dump. They correspond to the > main interpreter loop of CPython > https://github.com/python/cpython/blob/ > e1988942ca26440a0df6f3949e93ddc0dbd1e57e/Python/ceval.c Since this bug is about adding preserve_none for aarch64, do you have the preprocessed source for aarch64?
[Bug target/118328] Implement preserve_none for AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328 --- Comment #7 from Ken Jin --- The files are too big to upload here, so I've uploaded them to https://github.com/Fidget-Spinner/debugging-dump. They correspond to the main interpreter loop of CPython https://github.com/python/cpython/blob/e1988942ca26440a0df6f3949e93ddc0dbd1e57e/Python/ceval.c . Compiled with /home/ken/GCC-15.0-trunk/bin/gcc -c -fno-strict-overflow -Wsign-compare -DNDEBUG -g -O3 -Wall-std=c11 -Wextra -Wno-unused-parameter -Wno-missing-field-initializers -Wstrict-prototypes -Werror=implicit-function-declaration -fvisibility=hidden -I./Include/internal -I./Include/internal/mimalloc -I. -I./Include-DPy_BUILD_CORE --save-temps -o Python/ceval.o Python/ceval.c Specifically, zoom in on the function _TAIL_CALL_YIELD_VALUE, it produces on GCC 15 (note the assembly here might be slightly different than the one in .s file, because it's from a different build but same flags passed): pushq %rbx movq-24(%rdi), %rax addq$2, %rcx subl$2, %r9d movq-8(%rsi), %r8 subq$8, %rsi movb%r9b, -5(%rdi) movq%rcx, 56(%rdi) movq%rsi, 64(%rdi) movq%rax, 120(%rdx) movq8(%rdi), %rax movq$0, -24(%rdi) movq56(%rax), %rcx movq64(%rax), %rsi movq%rax, 72(%rdx) addl$1, 44(%rdx) movzwl 4(%rcx), %r9d movq$0, 8(%rdi) addq$4, %rcx addq$8, %rsi movq$0, 64(%rax) movl%r9d, %ebx movzbl %r9b, %edi movq%r8, -8(%rsi) movzbl %bh, %ebx movqINSTRUCTION_TABLE(,%rdi,8), %r10 movq%rdi, %r8 movq%rax, %rdi movl%ebx, %r9d popq%rbx jmp *%r10 .string "ENTER_EXECUTOR is not supported in this build" On Clang-19.1, it produces: movq%r15, 56(%r12) movq-8(%r13), %rcx addq$-8, %r13 addq$2, %r15 movq%r15, 56(%r12) addb$-2, %sil movb%sil, -5(%r12) movq%r13, 64(%r12) movq-24(%r12), %rax movq%rax, 120(%r14) movq$0, -24(%r12) incl44(%r14) movq8(%r12), %rax movq%rax, 72(%r14) movq$0, 8(%r12) movq56(%rax), %r15 movq64(%rax), %r13 movq$0, 64(%rax) movq%rcx, (%r13) addq$8, %r13 movzwl 4(%r15), %esi addq$4, %r15 movzbl %sil, %edi shrl$8, %esi leaqINSTRUCTION_TABLE(%rip), %rcx movq%rax, %r12 jmpq*(%rcx,%rdi,8)
[Bug target/118328] Implement preserve_none for AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328 --- Comment #6 from Andrew Pinski --- (In reply to Ken Jin from comment #5) > However, it seems to me that there's still extraneous push and pops for > function prologue/epilogue that could be removed with preserve_none. GCC's > regalloc is definitely a lot better than Clang when both don't have > preserve_none, but with preserve_none it seems that Clang does better > regalloc. So I think this might still be worth looking at. Can you provide the preproccessed source where you think the extraneous push and pops happen? It might a different issue and preserve_none might not solve it.
[Bug target/118328] Implement preserve_none for AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328 --- Comment #5 from Ken Jin --- However, it seems to me that there's still extraneous push and pops for function prologue/epilogue that could be removed with preserve_none. GCC's regalloc is definitely a lot better than Clang when both don't have preserve_none, but with preserve_none it seems that Clang does better regalloc. So I think this might still be worth looking at.
[Bug target/118328] Implement preserve_none for AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328 --- Comment #4 from Ken Jin --- I can confirm that in the case of tail calls, GCC does produce better/equivalent register spilling code than clang 19.1.0, by manual inspection of call sites.
[Bug target/118328] Implement preserve_none for AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328 Ken Jin changed: What|Removed |Added CC||kenjin4096 at gmail dot com --- Comment #3 from Ken Jin --- Hi, I'm the OP in the CPython issue. I updated the PR to say that it is pure speculation on my part that GCC produces not-good-enough code without preserve_none. Sorry for the confusion. I don't have GCC trunk to test with musttail, but I'm happy to do so after I land that PR in CPython.
[Bug target/118328] Implement preserve_none for AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328 --- Comment #2 from Richard Sandiford --- (In reply to Andrew Pinski from comment #1) > Note most of the use cases in my view for these attributes. These attributes > are there specifically to work around the fact that llvm does not do ipa ra > and the compiler does not record which registers are already preserved. I think the use case for preserve_none is a bit different from IPA RA, at least in the CPython case. IPA RA is about optimising callers based on information about callees, but preserve_none is instead about optimising the callees themselves (regardless of who the caller might be). If a function consists of a long chain of musttail calls, then it's relatively unlikely that saving and restoring registers “for the caller” will be beneficial. Each call in the musttail chain would need to save and restore the same call-preserved registers (if the function uses the registers internally). E.g. if you have f1 tail calling to f2, tail calling to f3, ... tail calling to f100, and all 100 functions use X19, you'll get 100 saves and restore of X19, all for one unknown caller. It's more efficient to tell the caller that it must preserve X19 itself. > I suspect gcc code generation is already decent . My impression from the CPython issue was that the GCC code quality wasn't acceptable without the attribute, but I agree that that's implied rather than explicit.
[Bug target/118328] Implement preserve_none for AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328 --- Comment #1 from Andrew Pinski --- Note most of the use cases in my view for these attributes. These attributes are there specifically to work around the fact that llvm does not do ipa ra and the compiler does not record which registers are already preserved. I suspect gcc code generation is already decent .