https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110649

--- Comment #14 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Chasing profile update bugs out of the hottest two functions did not solve the
regression. Moreover the weekly testers confirm it was not noise on zens
either.

Before the change we get:

  34.58%  sphinx_livepret  [.] mgau_eval                              ◆
  26.61%  sphinx_livepret  [.] vector_gautbl_eval_logs3               ▒
   8.94%  sphinx_livepret  [.] subvq_mgau_shortlist                   ▒
   7.36%  sphinx_livepret  [.] logs3_add                              ▒
   5.66%  sphinx_livepret  [.] approx_cont_mgau_frame_eval            ▒
   4.68%  sphinx_livepret  [.] mdef_sseq2sen_active                   ▒
   3.38%  sphinx_livepret  [.] dict2pid_comsenscr                     ▒
   1.66%  sphinx_livepret  [.] hmm_vit_eval_3st                       ▒
   0.90%  sphinx_livepret  [.] lextree_hmm_eval                       ▒
   0.73%  sphinx_livepret  [.] lextree_hmm_propagate                  ▒
   0.71%  sphinx_livepret  [.] lextree_enter                          ▒
   0.68%  sphinx_livepret  [.] fe_fft                                 ▒
   0.49%  sphinx_livepret  [.] dict2pid_comsseq2sen_active            ▒
   0.35%  sphinx_livepret  [.] lextree_ssid_active                    ▒
   0.20%  sphinx_livepret  [.] vithist_rescore                        ▒

So difference seems to be mgau_eval.
Both version of mgau_eval has almost same code layout. Main difference is
registr allocation.  In old version we do more spill around call:

 0.01 │       and          $0xffffffffffffffe0,%rsp                  ▒
  0.14 │       mov          %rcx,%rbx                                 ▒
  0.00 │       sub          $0xa0,%rsp                                ▒
  0.04 │       mov          0x10(%rdi),%rax                           ▒
  0.13 │       mov          0x8(%rdi),%r15d                           ▒
  0.01 │       vmovaps      %xmm3,0x80(%rsp)                          ▒
  0.22 │       vmovaps      %xmm2,0x90(%rsp)                          ▒
  0.03 │       mov          %rdi,0x70(%rsp)                           ▒
  0.05 │       lea          (%rax,%rdx,8),%r14                        ▒
  0.01 │       call         log_to_logs3_factor                       ▒
  1.00 │       test         %r13,%r13                                 ▒
  0.00 │       vxorps       %xmm4,%xmm4,%xmm4                         ▒
  0.02 │       vmovsd       %xmm0,0x78(%rsp)                          ▒
  0.00 │       je           433                                       ▒
  0.01 │       movslq       0x0(%r13),%rax                            ▒
  0.02 │       mov          $0xc8000000,%edi                          ▒
  0.01 │       vmovaps      0x90(%rsp),%xmm2                          ▒
  0.23 │       vmovaps      0x80(%rsp),%xmm3                          ▒
  0.09 │       test         %eax,%eax                                 ▒
  0.00 │       js           3f9                                       ▒

new verison is missing the spill of xmm2/3

  0.02 │       and          $0xffffffffffffffe0,%rsp                  ▒
  0.03 │       mov          %rcx,%rbx                                 ▒
  0.01 │       add          $0xffffffffffffff80,%rsp                  ▒
  0.03 │       mov          0x10(%rdi),%rax                           ▒
  0.16 │       mov          0x8(%rdi),%r15d                           ▒
  0.06 │       mov          %rdi,0x50(%rsp)                           ▒
  0.12 │       lea          (%rax,%rdx,8),%r14                        ▒
  0.01 │       call         log_to_logs3_factor                       ▒
  0.75 │       test         %r12,%r12                                 ▒
  0.00 │       vxorps       %xmm3,%xmm3,%xmm3                         ▒
  0.01 │       vmovsd       %xmm0,0x58(%rsp)                          ▒
  0.01 │       je           3f2                                       ▒
  0.01 │       movslq       (%r12),%rcx                               ▒
  0.00 │       mov          $0xc8000000,%edi                          ▒
       │       test         %ecx,%ecx                                 ▒
  0.14 │       js           3b8                                       ▒

Which looks better. log_to_logs3_factor just returns constant:

Percent│     vmovsd invlogB,%xmm0                                      
       │     ret                                                       

I wonder why we no longer need to spill. log_to_logs3_factor is from other
translation unit and this is non-LTO build. Maybe there are undefined
variables.

New version does:
  0.29 │       vmovhps      %xmm4,0x70(%rsp)                          ▒
  0.11 │       vmovaps      0x70(%rsp),%xmm7                          ▒
and this looks odd.

Reply via email to