https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107563

--- Comment #15 from Levy Hsu <admin at levyhsu dot com> ---
Tree (lower/tree) dump:
https://godbolt.org/z/o7GrvjMqq
slow_rotate still contains a single wide
VEC_PERM_EXPR <_1, _1, {3,0,1,2,7,4,5,6}>
while fast_rotate is already expressed as element extracts + vector
constructor.

RTL (expand) dump:
https://godbolt.org/z/WT9cqbx7h
fast_rotate expands to two 128-bit vec_select:V4SI shuffles (one per 16B half),
which is the expected shape to select pshufd on an SSE2 baseline. In contrast,
slow_rotate expands to scalar loads/stores (no vector perm/select remains), so
the backend never sees a permute it can map to pshufd.

So this looks like a generic vector-lowering / tree -> RTL expansion gap for
non-native (32B) VEC_PERM_EXPR on SSE2 targets: masks that do not cross the
128-bit boundary should be decomposed into two 16B perms, but currently fall
back to scalarization.

Reply via email to