https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98855
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org --- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> --- So we vectorized the bswap part after the loop before but now we connect it to the reduction performed in the loop and catching the bswap at the top. But the inner loop remains unvectorized apart from the XOR with the reduction value so we have to decompose/compose the reduction value. What's special here is that the "unprofitable" part of the vectorization is in a deeper loop than the "profitable" part but the profitable part outweights the unprofitable one. Thus we miss weighting the costs like we for example do (very crudely) in outer loop vectorization. For the scalar inner loop we have 48 0x25e59f0 _39 ^ _42 1 times scalar_stmt costs 4 in body 0x25e59f0 _42 ^ _48 1 times scalar_stmt costs 4 in body 0x25e59f0 _42 ^ _54 1 times scalar_stmt costs 4 in body 0x25e59f0 _42 ^ _60 1 times scalar_stmt costs 4 in body 0x25e59f0 _43 + R0_204 1 times scalar_stmt costs 4 in body 0x25e59f0 _49 + R1_206 1 times scalar_stmt costs 4 in body 0x25e59f0 _55 + R2_208 1 times scalar_stmt costs 4 in body 0x25e59f0 _61 + R3_210 1 times scalar_stmt costs 4 in body 0x25e59f0 R0_204 = PHI <_44(18), _117(6)> 1 times scalar_stmt costs 4 in body 0x25e59f0 R1_206 = PHI <_50(18), _121(6)> 1 times scalar_stmt costs 4 in body 0x25e59f0 R2_208 = PHI <_56(18), _125(6)> 1 times scalar_stmt costs 4 in body 0x25e59f0 R3_210 = PHI <_62(18), _129(6)> 1 times scalar_stmt costs 4 in body and the vector part 116 0x2728470 _7 ^ _10 1 times vector_stmt costs 4 in body 0x2728470 _11 + L0_203 1 times vector_stmt costs 4 in body 0x2728470 <unknown> 1 times vec_construct costs 44 in prologue 0x2728470 <unknown> 1 times vec_construct costs 44 in prologue 0x2728470 L0_203 = PHI <_12(18), _115(6)> 1 times vector_stmt costs 4 in body 0x2728470 _11 + L0_203 1 times vec_to_scalar costs 4 in epilogue 0x2728470 _20 + L1_205 1 times vec_to_scalar costs 4 in epilogue 0x2728470 _28 + L2_207 1 times vec_to_scalar costs 4 in epilogue 0x2728470 _34 + L3_209 1 times vec_to_scalar costs 4 in epilogue for outer loop vect we scale the inner loop cost by 50. We could also have done better in vectorizing. We've detected build/include/botan/mem_ops.h:148:15: note: node 0x297eff8 (max_nunits=8, refcnt=2) build/include/botan/mem_ops.h:148:15: note: op template: _10 = *_9; build/include/botan/mem_ops.h:148:15: note: stmt 0 _10 = *_9; build/include/botan/mem_ops.h:148:15: note: stmt 1 _42 = *_41; build/include/botan/mem_ops.h:148:15: note: stmt 2 _10 = *_9; build/include/botan/mem_ops.h:148:15: note: stmt 3 _42 = *_41; build/include/botan/mem_ops.h:148:15: note: stmt 4 _10 = *_9; build/include/botan/mem_ops.h:148:15: note: stmt 5 _42 = *_41; build/include/botan/mem_ops.h:148:15: note: stmt 6 _10 = *_9; build/include/botan/mem_ops.h:148:15: note: stmt 7 _42 = *_41; build/include/botan/mem_ops.h:148:15: note: load permutation { 0 1 0 1 0 1 0 1 } but decided build/include/botan/mem_ops.h:148:15: note: ==> examining statement: _10 = *_9; build/include/botan/mem_ops.h:148:15: missed: BB vectorization with gaps at the end of a load is not supported src/lib/block/xtea/xtea.cpp:32:55: missed: not vectorized: relevant stmt not supported: _10 = *_9; build/include/botan/mem_ops.h:148:15: note: Building vector operands of 0x297eff8 from scalars instead I think we have a duplicate for this somewhere. The desired vectorization would be a HImode load and splat. We also miss to represent the bswap nodes as VEC_PERM ones and to optimize them away entirely. And we fail to elide the bswap even though we vectorize it!?