[Bug target/118328] Implement preserve_none for AArch64

2025-04-07 Thread Diego.Russo at arm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328

--- Comment #22 from Diego Russo  ---
Another reason to have this implemented is the CPython JIT. It is a template
(stencil) JIT where every micro OP is precompiled as stencil. At run time these
stencils will be stitched together and patched with the next micro OP
instruction. This heavily uses preserve_none
(https://github.com/python/cpython/blob/main/Tools/jit/template.c#L86) and so
far we can only use clang to build these stencils.
It would be really great if gcc reaches feature parity with llvm so, we can
start building the JIT with GCC as well.

[Bug target/118328] Implement preserve_none for AArch64

2025-04-05 Thread kenjin4096 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328

--- Comment #21 from Ken Jin  ---
I sincerely apologize for my previous performance figures. The baseline was
worse due to a Clang-19 bug https://github.com/llvm/llvm-project/issues/106846.
So the numbers were inaccurate.

On Clang-20, on the pystones (Dhrystone variant) benchmark, I get a roughly 3%
speedup with tailcalling interpreter versus computed goto.

I have some numbers to report for CPython compilation time as well. These are
with dynamic frequency scaling off:

CC=clang-20 ./configure --with-lto=thin && make clean && time make -j18

+ Tail call:
real1m8.183s

- Tail call:
real1m11.004s

CC=clang-20 ./configure --with-lto=full && make clean && time make -j18

+ Tail call:
real3m49.285s

- Tail call:
real3m59.679s

CC=/home/ken/GCC-15.0-trunk/bin/gcc ./configure --with-lto=full && make clean
&& time make -j18

+ Tail call:
real10m5.521s

- Tail call:
real10m14.972s

So we save roughly 4-5% compilation time by switching the interpreter from a
over-1-line switch case of computed gotos to smaller per-bytecode tail
calls handlers on Clang 20. The savings on GCC 15 are lower (around 1%).

I have no clue how this 4-5% translates to GCC 15, as the comparison between
clang and gcc here is not apples-to-apples. The clang-20 on my system is a
release distribution, while my GCC 15 is built from source just with configure
and make.

Anyways, I don't mean to push for anything here. Just updating the record and
providing new numbers. Thanks again GCC devs for all your work on GCC!

[Bug target/118328] Implement preserve_none for AArch64

2025-02-07 Thread kenjin4096 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328

--- Comment #20 from Ken Jin  ---
(In reply to Andrew Pinski from comment #17)
> I am not sure if I understand this correctly.
> Can you make a simple table:
> 
> w/o tail-call - 1
> with tail-call but not preserve_none  - XYZ
> with tail-call and preserve_none  - PQR

I talked to Diego and this is roughly the table from my understanding

w/o tail-call - 1
with tail-call but not preserve_none  - 0.94
with tail-call and preserve_none  - 1

The fact that without `preserve_none` is a huge regression is pretty clear.
Whether `tail-call and preserve_none` gains a speedup over traditional computed
goto/labels-as-values (w/o tail call) is inconclusive. CPython needs PGO[1] and
the register pinning (mentioned in Diego's LLVM PR) to produce reliable
benchmarking results.
However, PGO with musttail is still broken as of right now
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118442. And the preserve_none
patch is not pinning registers.

We already introduced tailcall+preserve_none for perf reasons in CPython on
Clang. However, even if not for perf reasons, I am also motivated to adopt the
tailcall interpreter for significantly better debugging experience. Each
interpreter instruction is now its own function, and can be measured properly
by perf and other tools (previous computed gotos interpreter could not).

As a side note, GCC 15 is extremely impressive here. GCC 15 w/o tail calls
performs roughly same as tailcall+preserve none on the pystones benchmark
**without PGO**. However, once PGO is enabled on both, clang 19 performs
roughly 20% better on pystones than GCC 15 w/o tail calls. So PGO benefits the
tail call+preserve_none stuff more than non-tailcall. Hence why we can't make
any perf uplift conclusions on CPython yet.

For simplicity, on pystones (different benchmark than Diego's):

Clang-19 w/o tail call no PGO no LTO   much worse than GCC 15
GCC 15 w/o tail call no PGO no LTO:<1
GCC 15 w/o tail call PGO+LTO:  1
Clang-19 with tailcall+preserve_none PGO+LTO:  1.25

[1] Note: this is mostly due to code placement issues in CPython's over 6000
line computed goto interpreter loop.

[Bug target/118328] Implement preserve_none for AArch64

2025-02-07 Thread Diego.Russo at arm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328

--- Comment #19 from Diego Russo  ---
> Can you make a simple table:

w/o tail-call - 1
with tail-call but not preserve_none  - 0.94
with tail-call and preserve_none  - 1

You understood correctly.

I think there is still value in having it on AArch64. The debug experience will
be much more pleasant :)

> Is there real documentation on this attribute or is it just ad hoc on what it 
> does on the LLVM side about the ABI implications? 

I'll ping Brandt and let you know.

[Bug target/118328] Implement preserve_none for AArch64

2025-02-07 Thread sjames at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328

Sam James  changed:

   What|Removed |Added

 CC||fw at gcc dot gnu.org,
   ||hjl.tools at gmail dot com,
   ||matz at gcc dot gnu.org

--- Comment #18 from Sam James  ---
(In reply to Andrew Pinski from comment #17)
> >Can we have the same implementation/interface  of LLVM?
> 
> Is there real documentation on this attribute or is it just ad hoc on what
> it does on the LLVM side about the ABI implications? It seems to me there
> should be 2 seperate attributes, one to change the argument passing and one
> for preserve_none part.
> 

See the discussion in PR110899. It is actually a bit worrying as they don't
guarantee stability:
https://clang.llvm.org/docs/AttributeReference.html#preserve-none.

[Bug target/118328] Implement preserve_none for AArch64

2025-02-07 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328

--- Comment #17 from Andrew Pinski  ---
>Can we have the same implementation/interface  of LLVM?

Is there real documentation on this attribute or is it just ad hoc on what it
does on the LLVM side about the ABI implications? It seems to me there should
be 2 seperate attributes, one to change the argument passing and one for
preserve_none part.

>Anyway I re-ran the benchmarks and the binary without preserve_none is 
>actually 6% slower than the build without tail-calling interpreter.

I am not sure if I understand this correctly.
Can you make a simple table:

w/o tail-call - 1
with tail-call but not preserve_none  - XYZ
with tail-call and preserve_none  - PQR

>From my read is that with tail-call but not preserve_none is 0.94 but with both
it is some increase or close to 1.

Maybe this is an argument that for aarch64, using the tail-calling interpreter
is not useful rather than an argument to add preserve_none.

[Bug target/118328] Implement preserve_none for AArch64

2025-02-07 Thread Diego.Russo at arm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328

--- Comment #16 from Diego Russo  ---
Right, I had a couple of problems with running the benchmarks. A few failures
and the wrong environment variable to select the binary of the compiler.

Anyway I re-ran the benchmarks and the binary without preserve_none is actually
6% slower than the build without tail-calling interpreter. If we are going to
introduce the preserve_none attribute, the 6% is regained and it is 0%  faster.
Hence the preserve-none is needed otherwise we will have a regression.

BTW Brandt (CPython Core developers) pointed me at this Github
issued:https://github.com/llvm/llvm-project/pull/88333 where it tries to use
non-volatile registers for preserve_none parameters. With that change we notice
a significant  speed-up whilst executing benchmarks.

LLVM uses normally-non-volatile (x19-x28) first, then normally-volatile
registers (x0-x15).

I tried compiling that small example and what I have is:

$ objdump -d boring

boring: file format elf64-littleaarch64


Disassembly of section .text:

 :
   0:   a9bf7bfdstp x29, x30, [sp, #-16]!
   4:   910003fdmov x29, sp
   8:   aa0003f3mov x19, x0
   c:   aa0103f4mov x20, x1
  10:   aa0203f5mov x21, x2
  14:   aa0303f6mov x22, x3
  18:   9400bl  0 
  1c:   aa1603e3mov x3, x22
  20:   aa1503e2mov x2, x21
  24:   aa1403e1mov x1, x20
  28:   aa1303e0mov x0, x19
  2c:   a8c17bfdldp x29, x30, [sp], #16
  30:   1400b   0 

which differ from the second block of text on that PR.

Can we have the same implementation/interface  of LLVM?

Thanks

[Bug target/118328] Implement preserve_none for AArch64

2025-02-07 Thread sjames at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328

Sam James  changed:

   What|Removed |Added

   Last reconfirmed||2025-02-07
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1

[Bug target/118328] Implement preserve_none for AArch64

2025-02-07 Thread Diego.Russo at arm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328

--- Comment #15 from Diego Russo  ---
Folks, I think I've botched the performance measurement. Need to retake the
measurement. Give me some time and I'll come back with the right results.

[Bug target/118328] Implement preserve_none for AArch64

2025-02-07 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328

--- Comment #14 from Richard Sandiford  ---
(In reply to Sam James from comment #13)
> The request here notwithstanding, bug report(s) with testcases for missed
> opportunities in ipa-ra would be welcome too.
Agreed, if we find any.  But just in case it seemed otherwise, the effect that
Diego described in comment 12 isn't a missed ipa-ra opportunity, but a direct
benefit of having preserve_none functions calling normal functions (see also
comment 2).  ipa-ra would not be able to do that, since it is bound by the ABI
of the function that it's compiling.

[Bug target/118328] Implement preserve_none for AArch64

2025-02-07 Thread sjames at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328

--- Comment #13 from Sam James  ---
The request here notwithstanding, bug report(s) with testcases for missed
opportunities in ipa-ra would be welcome too.

(btw, x86 has no_callee_saved_registers / no_caller_saved_registers too.)

[Bug target/118328] Implement preserve_none for AArch64

2025-02-07 Thread Diego.Russo at arm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328

Diego Russo  changed:

   What|Removed |Added

 CC||Diego.Russo at arm dot com

--- Comment #12 from Diego Russo  ---
Hello,

I was able to test Richard's patch and I'm glad to confirm that it brings the
benefit expected.
I built gcc with the patch and with it I compiled
https://github.com/Fidget-Spinner/cpython/tree/tail-call-gcc-2 branch that
implements the tail-calling interpreter.
I've also compiled a modified version of that branch which doesn't use the
preserve_none attribute.

We noticed improvements in the code generation. This is the version without
preserve_none


0060600c <_TAIL_CALL_BINARY_OP_ADD_INT>:
  60600c:   f85f0026ldurx6, [x1, #-16]
  606010:   aa0103e5mov x5, x1
  606014:   900018e1adrpx1, 922000 
  606018:   91114021add x1, x1, #0x450
  60601c:   aa0003e9mov x9, x0
  606020:   f94004c7ldr x7, [x6, #8]
  606024:   f9001c03str x3, [x0, #56]
  606028:   eb0100ffcmp x7, x1
  60602c:   54a1b.ne606040
<_TAIL_CALL_BINARY_OP_ADD_INT+0x34>  // b.any
  606030:   f85f80a8ldurx8, [x5, #-8]
  606034:   f9400500ldr x0, [x8, #8]
  606038:   eb07001fcmp x0, x7
  60603c:   5480b.eq60604c
<_TAIL_CALL_BINARY_OP_ADD_INT+0x40>  // b.none
  606040:   aa0503e1mov x1, x5
  606044:   aa0903e0mov x0, x9
  606048:   17fffd9ab   6056b0 <_TAIL_CALL_BINARY_OP>
  60604c:   a9bb7bfdstp x29, x30, [sp, #-80]!
  606050:   aa0803e1mov x1, x8
  606054:   aa0603e0mov x0, x6
  606058:   910003fdmov x29, sp
  60605c:   a90153f3stp x19, x20, [sp, #16]
  606060:   91003073add x19, x3, #0xc
  606064:   aa0203f4mov x20, x2
  606068:   a90223e6stp x6, x8, [sp, #32]
  60606c:   a90327e3stp x3, x9, [sp, #48]
  606070:   f90023e5str x5, [sp, #64]
  606074:   97fb2ca4bl  4d1304 <_PyLong_Add>
  606078:   a94223e6ldp x6, x8, [sp, #32]
  60607c:   aa0003e4mov x4, x0
  606080:   f94023e5ldr x5, [sp, #64]
  606084:   a94327e3ldp x3, x9, [sp, #48]
  606088:   b9400100ldr w0, [x8]
  60608c:   37f80340tbnzw0, #31, 6060f4
<_TAIL_CALL_BINARY_OP_ADD_INT+0xe8>
  606090:   51000400sub w0, w0, #0x1
  606094:   b9000100str w0, [x8]
  606098:   350002e0cbnzw0, 6060f4
<_TAIL_CALL_BINARY_OP_ADD_INT+0xe8>
  60609c:   90001a20adrpx0, 94a000 
  6060a0:   91176000add x0, x0, #0x5d8
  6060a4:   f9544807ldr x7, [x0, #10384]
  6060a8:   b4000167cbz x7, 6060d4
<_TAIL_CALL_BINARY_OP_ADD_INT+0xc8>
  6060ac:   f9544c02ldr x2, [x0, #10392]
  6060b0:   a9021be8stp x8, x6, [sp, #32]
  6060b4:   aa0803e0mov x0, x8
  6060b8:   52800021mov w1, #0x1// #1
  6060bc:   f9001be4str x4, [sp, #48]
  6060c0:   f90027e3str x3, [sp, #72]
  6060c4:   d63f00e0blr x7
  6060c8:   a9421be8ldp x8, x6, [sp, #32]
  6060cc:   a94327e4ldp x4, x9, [sp, #48]
  6060d0:   a9440fe5ldp x5, x3, [sp, #64]
  6060d4:   aa0803e0mov x0, x8
  6060d8:   a90213e6stp x6, x4, [sp, #32]
  6060dc:   a90317e9stp x9, x5, [sp, #48]
  6060e0:   f90023e3str x3, [sp, #64]
  6060e4:   97fb2c71bl  4d12a8 <_PyLong_ExactDealloc>
  6060e8:   a94213e6ldp x6, x4, [sp, #32]
  6060ec:   a94317e9ldp x9, x5, [sp, #48]
  6060f0:   f94023e3ldr x3, [sp, #64]
  6060f4:   b94000c0ldr w0, [x6]
  6060f8:   37f80300tbnzw0, #31, 606158
<_TAIL_CALL_BINARY_OP_ADD_INT+0x14c>
  6060fc:   51000400sub w0, w0, #0x1
  606100:   b9c0str w0, [x6]
  606104:   350002a0cbnzw0, 606158
<_TAIL_CALL_BINARY_OP_ADD_INT+0x14c>
  606108:   90001a20adrpx0, 94a000 
  60610c:   91176000add x0, x0, #0x5d8
  606110:   f9544807ldr x7, [x0, #10384]
  606114:   b4000167cbz x7, 606140
<_TAIL_CALL_BINARY_OP_ADD_INT+0x134>
  606118:   f9544c02ldr x2, [x0, #10392]
  60611c:   a90213e6stp x6, x4, [sp, #32]
  606120:   aa0603e0mov x0, x6
  606124:   a90317e9stp x9, x5, [sp, #48]
  606128:   52800021

[Bug target/118328] Implement preserve_none for AArch64

2025-01-16 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328

--- Comment #11 from Richard Sandiford  ---
Created attachment 60175
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=60175&action=edit
Proof-of-concept patch

Here's a lightly-tested proof-of-concept patch for preserve_none on AArch64. 
In practice, I don't think there's much scope for sharing implementation code
between targets.

[Bug target/118328] Implement preserve_none for AArch64

2025-01-13 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328

--- Comment #10 from Andrew Pinski  ---
(In reply to Ken Jin from comment #7)
> The files are too big to upload here, so I've uploaded them to
> https://github.com/Fidget-Spinner/debugging-dump. They correspond to the
> main interpreter loop of CPython
> https://github.com/python/cpython/blob/
> e1988942ca26440a0df6f3949e93ddc0dbd1e57e/Python/ceval.c

Filed that issue as PR 118465. Since I work on aarch64, I am not going to do
the exaction of the testcase in the end.

[Bug target/118328] Implement preserve_none for AArch64

2025-01-13 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328

--- Comment #9 from Andrew Pinski  ---
(In reply to Ken Jin from comment #7)
> Specifically, zoom in on the function _TAIL_CALL_YIELD_VALUE, it produces on
> GCC 15 (note the assembly here might be slightly different than the one in
> .s file, because it's from a different build but same flags passed):

That is about aligning the stack. And that is a x86_64 specific issue I think.
Let me try to get a reduced testcase for that and file seperately.

[Bug target/118328] Implement preserve_none for AArch64

2025-01-13 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328

--- Comment #8 from Andrew Pinski  ---
(In reply to Ken Jin from comment #7)
> The files are too big to upload here, so I've uploaded them to
> https://github.com/Fidget-Spinner/debugging-dump. They correspond to the
> main interpreter loop of CPython
> https://github.com/python/cpython/blob/
> e1988942ca26440a0df6f3949e93ddc0dbd1e57e/Python/ceval.c

Since this bug is about adding preserve_none for aarch64, do you have the
preprocessed source for aarch64?

[Bug target/118328] Implement preserve_none for AArch64

2025-01-13 Thread kenjin4096 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328

--- Comment #7 from Ken Jin  ---
The files are too big to upload here, so I've uploaded them to
https://github.com/Fidget-Spinner/debugging-dump. They correspond to the main
interpreter loop of CPython
https://github.com/python/cpython/blob/e1988942ca26440a0df6f3949e93ddc0dbd1e57e/Python/ceval.c
.

Compiled with

/home/ken/GCC-15.0-trunk/bin/gcc -c -fno-strict-overflow -Wsign-compare
-DNDEBUG -g -O3 -Wall-std=c11 -Wextra -Wno-unused-parameter
-Wno-missing-field-initializers -Wstrict-prototypes
-Werror=implicit-function-declaration -fvisibility=hidden  -I./Include/internal
-I./Include/internal/mimalloc  -I. -I./Include-DPy_BUILD_CORE --save-temps
-o Python/ceval.o Python/ceval.c

Specifically, zoom in on the function _TAIL_CALL_YIELD_VALUE, it produces on
GCC 15 (note the assembly here might be slightly different than the one in .s
file, because it's from a different build but same flags passed):
pushq   %rbx
movq-24(%rdi), %rax
addq$2, %rcx
subl$2, %r9d
movq-8(%rsi), %r8
subq$8, %rsi
movb%r9b, -5(%rdi)
movq%rcx, 56(%rdi)
movq%rsi, 64(%rdi)
movq%rax, 120(%rdx)
movq8(%rdi), %rax
movq$0, -24(%rdi)
movq56(%rax), %rcx
movq64(%rax), %rsi
movq%rax, 72(%rdx)
addl$1, 44(%rdx)
movzwl  4(%rcx), %r9d
movq$0, 8(%rdi)
addq$4, %rcx
addq$8, %rsi
movq$0, 64(%rax)
movl%r9d, %ebx
movzbl  %r9b, %edi
movq%r8, -8(%rsi)
movzbl  %bh, %ebx
movqINSTRUCTION_TABLE(,%rdi,8), %r10
movq%rdi, %r8
movq%rax, %rdi
movl%ebx, %r9d
popq%rbx
jmp *%r10
.string "ENTER_EXECUTOR is not supported in this build"

On Clang-19.1, it produces:

movq%r15, 56(%r12)
movq-8(%r13), %rcx
addq$-8, %r13
addq$2, %r15
movq%r15, 56(%r12)
addb$-2, %sil
movb%sil, -5(%r12)
movq%r13, 64(%r12)
movq-24(%r12), %rax
movq%rax, 120(%r14)
movq$0, -24(%r12)
incl44(%r14)
movq8(%r12), %rax
movq%rax, 72(%r14)
movq$0, 8(%r12)
movq56(%rax), %r15
movq64(%rax), %r13
movq$0, 64(%rax)
movq%rcx, (%r13)
addq$8, %r13
movzwl  4(%r15), %esi
addq$4, %r15
movzbl  %sil, %edi
shrl$8, %esi
leaqINSTRUCTION_TABLE(%rip), %rcx
movq%rax, %r12
jmpq*(%rcx,%rdi,8)

[Bug target/118328] Implement preserve_none for AArch64

2025-01-13 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328

--- Comment #6 from Andrew Pinski  ---
(In reply to Ken Jin from comment #5)
> However, it seems to me that there's still extraneous push and pops for
> function prologue/epilogue that could be removed with preserve_none. GCC's
> regalloc is definitely a lot better than Clang when both don't have
> preserve_none, but with preserve_none it seems that Clang does better
> regalloc. So I think this might still be worth looking at.

Can you provide the preproccessed source where you think the extraneous push
and pops happen? It might a different issue and preserve_none might not solve
it.

[Bug target/118328] Implement preserve_none for AArch64

2025-01-13 Thread kenjin4096 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328

--- Comment #5 from Ken Jin  ---
However, it seems to me that there's still extraneous push and pops for
function prologue/epilogue that could be removed with preserve_none. GCC's
regalloc is definitely a lot better than Clang when both don't have
preserve_none, but with preserve_none it seems that Clang does better regalloc.
So I think this might still be worth looking at.

[Bug target/118328] Implement preserve_none for AArch64

2025-01-12 Thread kenjin4096 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328

--- Comment #4 from Ken Jin  ---
I can confirm that in the case of tail calls, GCC does produce
better/equivalent register spilling code than clang 19.1.0, by manual
inspection of call sites.

[Bug target/118328] Implement preserve_none for AArch64

2025-01-08 Thread kenjin4096 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328

Ken Jin  changed:

   What|Removed |Added

 CC||kenjin4096 at gmail dot com

--- Comment #3 from Ken Jin  ---
Hi, I'm the OP in the CPython issue. I updated the PR to say that it is pure
speculation on my part that GCC produces not-good-enough code without
preserve_none. Sorry for the confusion. I don't have GCC trunk to test with
musttail, but I'm happy to do so after I land that PR in CPython.

[Bug target/118328] Implement preserve_none for AArch64

2025-01-07 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328

--- Comment #2 from Richard Sandiford  ---
(In reply to Andrew Pinski from comment #1)
> Note most of the use cases in my view for these attributes. These attributes
> are there specifically to work around the fact that llvm does not do ipa ra
> and the compiler does not record which registers are already preserved.
I think the use case for preserve_none is a bit different from IPA RA, at least
in the CPython case.  IPA RA is about optimising callers based on information
about callees, but preserve_none is instead about optimising the callees
themselves (regardless of who the caller might be).

If a function consists of a long chain of musttail calls, then it's relatively
unlikely that saving and restoring registers “for the caller” will be
beneficial.  Each call in the musttail chain would need to save and restore the
same call-preserved registers (if the function uses the registers internally).

E.g. if you have f1 tail calling to f2, tail calling to f3, ... tail calling to
f100, and all 100 functions use X19, you'll get 100 saves and restore of X19,
all for one unknown caller.  It's more efficient to tell the caller that it
must preserve X19 itself.

> I suspect gcc code generation is already decent .
My impression from the CPython issue was that the GCC code quality wasn't
acceptable without the attribute, but I agree that that's implied rather than
explicit.

[Bug target/118328] Implement preserve_none for AArch64

2025-01-07 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328

--- Comment #1 from Andrew Pinski  ---
Note most of the use cases in my view for these attributes. These attributes
are there specifically to work around the fact that llvm does not do ipa ra and
the compiler does not record which registers are already preserved.

I suspect gcc code generation is already decent .