https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756
--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> --- So the remaining piece may be that of the init-regs issue. We have vf_24 = BIT_INSERT_EXPR <vf_23(D), _26, 0 (32 bits)>; which leaves the upper elements undefined, but init-regs forces them to zero. Another issue is that in _26 = BIT_FIELD_REF <v_13(D), 32, 32>; vf_24 = BIT_INSERT_EXPR <vf_23(D), _26, 0 (32 bits)>; _25 = __builtin_ia32_shufps (vf_24, vf_24, 0); the shufps is not exposed to gimple optimizations and thus we can't simplify it in any way. Only the backend knows that it could be simplified to _25 = __builtin_ia32_shufps (vf_13(D), vf_13(D), 85); so the backend might want to "expand" __builtin_ia32_shufps to a VEC_PERM_EXPR in its target specific builtin folding hook (making sure the reverse works well enough obviously).