https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110649
--- Comment #14 from Jan Hubicka <hubicka at gcc dot gnu.org> --- Chasing profile update bugs out of the hottest two functions did not solve the regression. Moreover the weekly testers confirm it was not noise on zens either. Before the change we get: 34.58% sphinx_livepret [.] mgau_eval ◆ 26.61% sphinx_livepret [.] vector_gautbl_eval_logs3 ▒ 8.94% sphinx_livepret [.] subvq_mgau_shortlist ▒ 7.36% sphinx_livepret [.] logs3_add ▒ 5.66% sphinx_livepret [.] approx_cont_mgau_frame_eval ▒ 4.68% sphinx_livepret [.] mdef_sseq2sen_active ▒ 3.38% sphinx_livepret [.] dict2pid_comsenscr ▒ 1.66% sphinx_livepret [.] hmm_vit_eval_3st ▒ 0.90% sphinx_livepret [.] lextree_hmm_eval ▒ 0.73% sphinx_livepret [.] lextree_hmm_propagate ▒ 0.71% sphinx_livepret [.] lextree_enter ▒ 0.68% sphinx_livepret [.] fe_fft ▒ 0.49% sphinx_livepret [.] dict2pid_comsseq2sen_active ▒ 0.35% sphinx_livepret [.] lextree_ssid_active ▒ 0.20% sphinx_livepret [.] vithist_rescore ▒ So difference seems to be mgau_eval. Both version of mgau_eval has almost same code layout. Main difference is registr allocation. In old version we do more spill around call: 0.01 │ and $0xffffffffffffffe0,%rsp ▒ 0.14 │ mov %rcx,%rbx ▒ 0.00 │ sub $0xa0,%rsp ▒ 0.04 │ mov 0x10(%rdi),%rax ▒ 0.13 │ mov 0x8(%rdi),%r15d ▒ 0.01 │ vmovaps %xmm3,0x80(%rsp) ▒ 0.22 │ vmovaps %xmm2,0x90(%rsp) ▒ 0.03 │ mov %rdi,0x70(%rsp) ▒ 0.05 │ lea (%rax,%rdx,8),%r14 ▒ 0.01 │ call log_to_logs3_factor ▒ 1.00 │ test %r13,%r13 ▒ 0.00 │ vxorps %xmm4,%xmm4,%xmm4 ▒ 0.02 │ vmovsd %xmm0,0x78(%rsp) ▒ 0.00 │ je 433 ▒ 0.01 │ movslq 0x0(%r13),%rax ▒ 0.02 │ mov $0xc8000000,%edi ▒ 0.01 │ vmovaps 0x90(%rsp),%xmm2 ▒ 0.23 │ vmovaps 0x80(%rsp),%xmm3 ▒ 0.09 │ test %eax,%eax ▒ 0.00 │ js 3f9 ▒ new verison is missing the spill of xmm2/3 0.02 │ and $0xffffffffffffffe0,%rsp ▒ 0.03 │ mov %rcx,%rbx ▒ 0.01 │ add $0xffffffffffffff80,%rsp ▒ 0.03 │ mov 0x10(%rdi),%rax ▒ 0.16 │ mov 0x8(%rdi),%r15d ▒ 0.06 │ mov %rdi,0x50(%rsp) ▒ 0.12 │ lea (%rax,%rdx,8),%r14 ▒ 0.01 │ call log_to_logs3_factor ▒ 0.75 │ test %r12,%r12 ▒ 0.00 │ vxorps %xmm3,%xmm3,%xmm3 ▒ 0.01 │ vmovsd %xmm0,0x58(%rsp) ▒ 0.01 │ je 3f2 ▒ 0.01 │ movslq (%r12),%rcx ▒ 0.00 │ mov $0xc8000000,%edi ▒ │ test %ecx,%ecx ▒ 0.14 │ js 3b8 ▒ Which looks better. log_to_logs3_factor just returns constant: Percent│ vmovsd invlogB,%xmm0 │ ret I wonder why we no longer need to spill. log_to_logs3_factor is from other translation unit and this is non-LTO build. Maybe there are undefined variables. New version does: 0.29 │ vmovhps %xmm4,0x70(%rsp) ▒ 0.11 │ vmovaps 0x70(%rsp),%xmm7 ▒ and this looks odd.