https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105504
Bug ID: 105504 Summary: Fails to break dependency for vcvtss2sd xmm, xmm, mem Product: gcc Version: unknown Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: amonakov at gcc dot gnu.org Target Milestone: --- Created attachment 52933 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52933&action=edit testcase Hit by core-math team at https://gcc.gnu.org/pipermail/gcc-help/2022-May/141480.html Compile the attached testcase with -O2 -march=haswell (other AVX-capable Intel families except Alderlake are affected too) and observe that the big basic block begins with .L6: vcvtss2sd xmm1, xmm1, DWORD PTR [rsp-4] This creates a false dependency on the previous assignment into xmm1, resulting in wildly varying (and suboptimal) throughput figures depending on how long the CPU stalls waiting for the previous assignment to complete. GCC has code to emit such instructions in a manner that avoids false dependencies (see e.g. PR89071), but here it doesn't seem to work. Also there's a potentially related issue that GCC copies the initial xmm0 value to eax via stack in the beginning of the function: cr_exp10f: vmovss DWORD PTR [rsp-4], xmm0 mov eax, DWORD PTR [rsp-4] This seems wrong since xmm-reg moves on Haswell are 1 cycle afaict.