https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77438
Bug ID: 77438 Summary: MMX intrinsic on x86_64 generates bloated code Product: gcc Version: 4.8.4 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: acahalan at gmail dot com Target Milestone: --- __m64 __attribute__((noinline)) mmx(__m64 x, __m64 y){return _mm_add_pi8(x,y);} That gives 6 lines of assembly. (movq,movdq2q,paddb,movq,movq,ret) Stuff even gets moved to the stack. Good code would just do the operation in an xmm register instead of moving it to a mm register. Failing that, gcc could at least avoid using the stack.