https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105504

            Bug ID: 105504
           Summary: Fails to break dependency for vcvtss2sd xmm, xmm, mem
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: amonakov at gcc dot gnu.org
  Target Milestone: ---

Created attachment 52933
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52933&action=edit
testcase

Hit by core-math team at
https://gcc.gnu.org/pipermail/gcc-help/2022-May/141480.html

Compile the attached testcase with -O2 -march=haswell (other AVX-capable Intel
families except Alderlake are affected too) and observe that the big basic
block begins with

.L6:
        vcvtss2sd       xmm1, xmm1, DWORD PTR [rsp-4]

This creates a false dependency on the previous assignment into xmm1, resulting
in wildly varying (and suboptimal) throughput figures depending on how long the
CPU stalls waiting for the previous assignment to complete.

GCC has code to emit such instructions in a manner that avoids false
dependencies (see e.g. PR89071), but here it doesn't seem to work.


Also there's a potentially related issue that GCC copies the initial xmm0 value
to eax via stack in the beginning of the function:

cr_exp10f:
        vmovss  DWORD PTR [rsp-4], xmm0
        mov     eax, DWORD PTR [rsp-4]

This seems wrong since xmm-reg moves on Haswell are 1 cycle afaict.

Reply via email to