https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119628
--- Comment #23 from Ken Jin <kenjin4096 at gmail dot com> --- > Hi Ken, my patch has been merged into GCC master branch. Can you give it a > try? I did a bench, note that this is not 100% what we use in CPython release builds, as I had to pass `-fno-omit-frame-pointer -mno-omit-leaf-frame-pointer` to all my configurations to get the main branch of GCC to not miscompile the current code. LTO+PGO enabled for all configurations, disabled PGO only around tail call bytecode handlers as it regressed performance for those. Intel Turbo boost off. NO preserve_none: Pystone(1.1) time for 1000000 passes = 1.98081 This machine benchmarks at 504844 pystones/second preserve_none: Pystone(1.1) time for 1000000 passes = 1.7661 This machine benchmarks at 566219 pystones/second I also took some benchmarks from the pyperformance benchmark suite that are Python-heavy. Specifically, nbody, spectral_norm, and deltablue. Mean +- std dev: [NO_preserve_none_nbody] 108 ms +- 2 ms -> [preserve_none_nbody] 95.3 ms +- 2.0 ms: 1.13x faster Mean +- std dev: [NO_preserve_none_spectralnorm] 95.7 ms +- 0.4 ms -> [preserve_none_spectralnorm] 83.8 ms +- 0.3 ms: 1.14x faster Mean +- std dev: [NO_preserve_none_deltablue] 3.59 ms +- 0.03 ms -> [preserve_none_deltablue] 3.24 ms +- 0.02 ms: 1.11x faster So seems like the actual speedup is the ~10% range for preserve_none vs no_preserve_none. On my system, labels-as-values (indirect goto) performs roughly same as preserve_none + tail calls. However, note that PGO is disabled for the tail call handlers, and CPython has been optimizing for indirect goto style for over 10 years! So the fact the performance matches is actually incredibly good.