https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
This delays some checks to eventually support part of the BB vectorization
which is what succeeds here.  I suspect that w/o vectorization we manage
to elide the tmp[] array but with the part vectorization that occurs we
fail to do that.

On the cost side there would be a lot needed to make the vectorization
not profitable:

  Vector inside of basic block cost: 8
  Vector prologue cost: 36
  Vector epilogue cost: 0
  Scalar cost of basic block: 64

the thing to double-check is

0x123b1ff0 <unknown> 1 times vec_construct costs 17 in prologue

that is the cost of the V16QI construct

 _813 = {_437, _448, _459, _470, _490, _501, _512, _523, _543, _554, _565,
_576, _125, _143, _161, _179}; 

maybe you can extract a testcase for the function as well?

Reply via email to