https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107563
--- Comment #15 from Levy Hsu <admin at levyhsu dot com> --- Tree (lower/tree) dump: https://godbolt.org/z/o7GrvjMqq slow_rotate still contains a single wide VEC_PERM_EXPR <_1, _1, {3,0,1,2,7,4,5,6}> while fast_rotate is already expressed as element extracts + vector constructor. RTL (expand) dump: https://godbolt.org/z/WT9cqbx7h fast_rotate expands to two 128-bit vec_select:V4SI shuffles (one per 16B half), which is the expected shape to select pshufd on an SSE2 baseline. In contrast, slow_rotate expands to scalar loads/stores (no vector perm/select remains), so the backend never sees a permute it can map to pshufd. So this looks like a generic vector-lowering / tree -> RTL expansion gap for non-native (32B) VEC_PERM_EXPR on SSE2 targets: masks that do not cross the 128-bit boundary should be decomposed into two 16B perms, but currently fall back to scalarization.
