https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328

--- Comment #20 from Ken Jin <kenjin4096 at gmail dot com> ---
(In reply to Andrew Pinski from comment #17)
> I am not sure if I understand this correctly.
> Can you make a simple table:
> 
> w/o tail-call                         - 1
> with tail-call but not preserve_none  - XYZ
> with tail-call and preserve_none      - PQR

I talked to Diego and this is roughly the table from my understanding

w/o tail-call                         - 1
with tail-call but not preserve_none  - 0.94
with tail-call and preserve_none      - 1

The fact that without `preserve_none` is a huge regression is pretty clear.
Whether `tail-call and preserve_none` gains a speedup over traditional computed
goto/labels-as-values (w/o tail call) is inconclusive. CPython needs PGO[1] and
the register pinning (mentioned in Diego's LLVM PR) to produce reliable
benchmarking results.
However, PGO with musttail is still broken as of right now
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118442. And the preserve_none
patch is not pinning registers.

We already introduced tailcall+preserve_none for perf reasons in CPython on
Clang. However, even if not for perf reasons, I am also motivated to adopt the
tailcall interpreter for significantly better debugging experience. Each
interpreter instruction is now its own function, and can be measured properly
by perf and other tools (previous computed gotos interpreter could not).

As a side note, GCC 15 is extremely impressive here. GCC 15 w/o tail calls
performs roughly same as tailcall+preserve none on the pystones benchmark
**without PGO**. However, once PGO is enabled on both, clang 19 performs
roughly 20% better on pystones than GCC 15 w/o tail calls. So PGO benefits the
tail call+preserve_none stuff more than non-tailcall. Hence why we can't make
any perf uplift conclusions on CPython yet.

For simplicity, on pystones (different benchmark than Diego's):

Clang-19 w/o tail call no PGO no LTO               much worse than GCC 15
GCC 15 w/o tail call no PGO no LTO:                <1
GCC 15 w/o tail call PGO+LTO:                      1
Clang-19 with tailcall+preserve_none PGO+LTO:      1.25

[1] Note: this is mostly due to code placement issues in CPython's over 6000
line computed goto interpreter loop.

Reply via email to