https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70434
Bug ID: 70434 Summary: adding an extraneous cast to vector type results in different code Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: zsojka at seznam dot cz Target Milestone: --- Created attachment 38119 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38119&action=edit testcases Originally observed in PR70421 When the attached code is compiled, barN() results in a different code compared to fooN(), even though the only difference is a useless cast of 'vNsi v' to 'vNsi'. For example, v4si on x86_64 -O3 -mavx512f -masm=intel: foo4: vpextrd edx, xmm0, 1 vmovd eax, xmm0 movsx rdi, edi xor eax, edx vpinsrd xmm1, xmm0, eax, 0 vmovaps XMMWORD PTR [rsp-24], xmm1 mov eax, DWORD PTR [rsp-24+rdi*4] ret bar4: vmovaps XMMWORD PTR [rsp-24], xmm0 movsx rdi, edi mov eax, DWORD PTR [rsp-20] xor DWORD PTR [rsp-24], eax mov eax, DWORD PTR [rsp-24+rdi*4] ret I haven't benchmarked which one is faster, but why is the code different at all? For foo32/bar32 case, bar32 is certainly faster, because foo32 creates an extra copy of the variable on the stack.