https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103905
--- Comment #6 from Uroš Bizjak <ubizjak at gmail dot com> --- @Jakub: It looks the problem is in expand_vec_perm_pshufb, where permutation vector is recalculated for partial vectors: if (vmode == V4QImode || vmode == V8QImode) { rtx m128 = GEN_INT (-128); /* Remap elements from the second operand, as we have to account for inactive top elements from the first operand. */ if (!d->one_operand_p) { int sz = GET_MODE_SIZE (vmode); for (i = 0; i < nelt; ++i) { int ival = INTVAL (rperm[i]); if (ival >= sz) ival += 16-sz; rperm[i] = GEN_INT (ival); } } /* V4QI/V8QI is emulated with V16QI instruction, fill inactive elements in the top positions with zeros. */ for (i = nelt; i < 16; ++i) rperm[i] = m128; vpmode = V16QImode; } I must admit I only eyeballed the generated code, so perhaps there lies the dragon.