[Bug target/107563] __builtin_shufflevector fails to pshufd instructions under default x86_64 compilation toggle which is the sse2 one

admin at levyhsu dot com via Gcc-bugs Tue, 13 Jan 2026 14:54:32 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107563


--- Comment #15 from Levy Hsu <admin at levyhsu dot com> ---
Tree (lower/tree) dump:
https://godbolt.org/z/o7GrvjMqq
slow_rotate still contains a single wide
VEC_PERM_EXPR <_1, _1, {3,0,1,2,7,4,5,6}>
while fast_rotate is already expressed as element extracts + vector
constructor.

RTL (expand) dump:
https://godbolt.org/z/WT9cqbx7h
fast_rotate expands to two 128-bit vec_select:V4SI shuffles (one per 16B half),
which is the expected shape to select pshufd on an SSE2 baseline. In contrast,
slow_rotate expands to scalar loads/stores (no vector perm/select remains), so
the backend never sees a permute it can map to pshufd.

So this looks like a generic vector-lowering / tree -> RTL expansion gap for
non-native (32B) VEC_PERM_EXPR on SSE2 targets: masks that do not cross the
128-bit boundary should be decomposed into two 16B perms, but currently fall
back to scalarization.

[Bug target/107563] __builtin_shufflevector fails to pshufd instructions under default x86_64 compilation toggle which is the sse2 one

Reply via email to