https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63756
Bug ID: 63756 Summary: _mm_cvtepi16_epi32 with a memory operand produces either broken or slow asm Product: gcc Version: 4.9.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: tterribe at xiph dot org Created attachment 33900 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33900&action=edit Reduced testcase With optimizations enabled, the call _mm_cvtepi16_epi32(*(__m128i *)x) for some pointer x produces the asm pmovsxwd (%rax), %xmm0 which is all well and good, and what was intended. However, with optimizations disabled, the same code produces movdqa (%rax), %xmm0 movaps %xmm0, -48(%rbp) movdqa -48(%rbp), %xmm0 pmovsxwd %xmm0, %xmm0 The problem here is that the initial movdqa has added a 16-byte alignment requirement and reads 8 bytes past where the original pmovsxwd instruction would have read in the optimized version. This is very much not equivalent, and causes crashes in code that runs just fine in the optimized version. _mm_cvtepi16_epi32() takes an __m128i argument, and the dereference happens before the function call. Even though the asm instruction it's standing in for can do it all together, we don't have a single intrinsic which specifies exactly that. None of the semantics here are very well documented anywhere, but I can understand why the compiler might think it has the right to do what it did. So I try the following code: _mm_cvtepi16_epi32(_mm_loadl_epi64((__m128i *)x) With optimizations disabled, this produces the slightly long-winded, but at least correct asm of: movq (%rax), %rax movl $0, %edx movq %rdx, -128(%rbp) movq %rax, -120(%rbp) movq -120(%rbp), %rax movq -128(%rbp), %rdx movq %rdx, -112(%rbp) movq %rax, -104(%rbp) movq -112(%rbp), %rax movq -104(%rbp), %xmm0 pinsrq $1, %rax, $xmm0 movaps %xmm0, -64(%rbp) movdqa -64(%rbp), %xmm0 pmovsxwd %xmm0, %xmm0 So that's all good. movq has the same semantics as a memory operand in pmovsxwd, so we haven't added any extra alignment requirements or read any extra data. Turning optimizations back on, one might reasonably expect the optimizer to collapse the two intrinsics into the same single instruction it had before, since they should, in fact, be equivalent to what that instruction did. However, the asm one gets instead is pxor %xmm0, %xmm0 pinsrq $0, (%rax), %xmm0 pmovsxwd %xmm0, %xmm0 So this is 3 instructions, 4 uops, and a 4-cycle latency (minimum) for what should have been 1 instruction, 1 fused-domain uop, and a 2-cycle latency (minimum). It makes a noticeable difference. The current code I am using in my project is to wrap these in a macro that says #ifdef __OPTIMIZE__, leave out the _mm_loadl_epi64(), but otherwise put it in. However, that seems moderately terrible, and like something the compiler could choose to break at any time.