https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328
--- Comment #20 from Ken Jin <kenjin4096 at gmail dot com> --- (In reply to Andrew Pinski from comment #17) > I am not sure if I understand this correctly. > Can you make a simple table: > > w/o tail-call - 1 > with tail-call but not preserve_none - XYZ > with tail-call and preserve_none - PQR I talked to Diego and this is roughly the table from my understanding w/o tail-call - 1 with tail-call but not preserve_none - 0.94 with tail-call and preserve_none - 1 The fact that without `preserve_none` is a huge regression is pretty clear. Whether `tail-call and preserve_none` gains a speedup over traditional computed goto/labels-as-values (w/o tail call) is inconclusive. CPython needs PGO[1] and the register pinning (mentioned in Diego's LLVM PR) to produce reliable benchmarking results. However, PGO with musttail is still broken as of right now https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118442. And the preserve_none patch is not pinning registers. We already introduced tailcall+preserve_none for perf reasons in CPython on Clang. However, even if not for perf reasons, I am also motivated to adopt the tailcall interpreter for significantly better debugging experience. Each interpreter instruction is now its own function, and can be measured properly by perf and other tools (previous computed gotos interpreter could not). As a side note, GCC 15 is extremely impressive here. GCC 15 w/o tail calls performs roughly same as tailcall+preserve none on the pystones benchmark **without PGO**. However, once PGO is enabled on both, clang 19 performs roughly 20% better on pystones than GCC 15 w/o tail calls. So PGO benefits the tail call+preserve_none stuff more than non-tailcall. Hence why we can't make any perf uplift conclusions on CPython yet. For simplicity, on pystones (different benchmark than Diego's): Clang-19 w/o tail call no PGO no LTO much worse than GCC 15 GCC 15 w/o tail call no PGO no LTO: <1 GCC 15 w/o tail call PGO+LTO: 1 Clang-19 with tailcall+preserve_none PGO+LTO: 1.25 [1] Note: this is mostly due to code placement issues in CPython's over 6000 line computed goto interpreter loop.