https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> --- This delays some checks to eventually support part of the BB vectorization which is what succeeds here. I suspect that w/o vectorization we manage to elide the tmp[] array but with the part vectorization that occurs we fail to do that. On the cost side there would be a lot needed to make the vectorization not profitable: Vector inside of basic block cost: 8 Vector prologue cost: 36 Vector epilogue cost: 0 Scalar cost of basic block: 64 the thing to double-check is 0x123b1ff0 <unknown> 1 times vec_construct costs 17 in prologue that is the cost of the V16QI construct _813 = {_437, _448, _459, _470, _490, _501, _512, _523, _543, _554, _565, _576, _125, _143, _161, _179}; maybe you can extract a testcase for the function as well?