https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68483
--- Comment #4 from Jakub Jelinek <jakub at gcc dot gnu.org> --- Ah, no, the problem is not on the backend side, but during veclower2 pass. Before that pass we after the replacement of v>> 64 or v>>32 shifts we have: vect_sum_15.12_58 = VEC_PERM_EXPR <vect_sum_15.10_57, { 0, 0, 0, 0 }, { 2, 3, 4, 5 }>; vect_sum_15.12_59 = vect_sum_15.12_58 + vect_sum_15.10_57; vect_sum_15.12_60 = VEC_PERM_EXPR <vect_sum_15.12_59, { 0, 0, 0, 0 }, { 1, 2, 3, 4 }>; vect_sum_15.12_61 = vect_sum_15.12_60 + vect_sum_15.12_59; but veclower2 for some reason decides to lower the latter VEC_PERM_EXPR into: _32 = BIT_FIELD_REF <vect_sum_15.12_59, 32, 32>; _17 = BIT_FIELD_REF <vect_sum_15.12_59, 32, 64>; _23 = BIT_FIELD_REF <vect_sum_15.12_59, 32, 96>; vect_sum_15.12_60 = {_32, _17, _23, 0}; The first VEC_PERM_EXPR is kept and generates efficient code. If I manually disable in the debugger the lowering, the code regression is gone.