https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98855

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot 
gnu.org

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
So we vectorized the bswap part after the loop before but now we connect it
to the reduction performed in the loop and catching the bswap at the top.
But the inner loop remains unvectorized apart from the XOR with the reduction
value so we have to decompose/compose the reduction value.

What's special here is that the "unprofitable" part of the vectorization
is in a deeper loop than the "profitable" part but the profitable part
outweights the unprofitable one.  Thus we miss weighting the costs like
we for example do (very crudely) in outer loop vectorization.

For the scalar inner loop we have 48

0x25e59f0 _39 ^ _42 1 times scalar_stmt costs 4 in body
0x25e59f0 _42 ^ _48 1 times scalar_stmt costs 4 in body
0x25e59f0 _42 ^ _54 1 times scalar_stmt costs 4 in body
0x25e59f0 _42 ^ _60 1 times scalar_stmt costs 4 in body
0x25e59f0 _43 + R0_204 1 times scalar_stmt costs 4 in body
0x25e59f0 _49 + R1_206 1 times scalar_stmt costs 4 in body
0x25e59f0 _55 + R2_208 1 times scalar_stmt costs 4 in body
0x25e59f0 _61 + R3_210 1 times scalar_stmt costs 4 in body
0x25e59f0 R0_204 = PHI <_44(18), _117(6)> 1 times scalar_stmt costs 4 in body
0x25e59f0 R1_206 = PHI <_50(18), _121(6)> 1 times scalar_stmt costs 4 in body
0x25e59f0 R2_208 = PHI <_56(18), _125(6)> 1 times scalar_stmt costs 4 in body
0x25e59f0 R3_210 = PHI <_62(18), _129(6)> 1 times scalar_stmt costs 4 in body

and the vector part 116

0x2728470 _7 ^ _10 1 times vector_stmt costs 4 in body
0x2728470 _11 + L0_203 1 times vector_stmt costs 4 in body
0x2728470 <unknown> 1 times vec_construct costs 44 in prologue
0x2728470 <unknown> 1 times vec_construct costs 44 in prologue
0x2728470 L0_203 = PHI <_12(18), _115(6)> 1 times vector_stmt costs 4 in body
0x2728470 _11 + L0_203 1 times vec_to_scalar costs 4 in epilogue
0x2728470 _20 + L1_205 1 times vec_to_scalar costs 4 in epilogue
0x2728470 _28 + L2_207 1 times vec_to_scalar costs 4 in epilogue
0x2728470 _34 + L3_209 1 times vec_to_scalar costs 4 in epilogue

for outer loop vect we scale the inner loop cost by 50.

We could also have done better in vectorizing.  We've detected

build/include/botan/mem_ops.h:148:15: note:   node 0x297eff8 (max_nunits=8,
refcnt=2)
build/include/botan/mem_ops.h:148:15: note:   op template: _10 = *_9;
build/include/botan/mem_ops.h:148:15: note:     stmt 0 _10 = *_9;
build/include/botan/mem_ops.h:148:15: note:     stmt 1 _42 = *_41;
build/include/botan/mem_ops.h:148:15: note:     stmt 2 _10 = *_9;
build/include/botan/mem_ops.h:148:15: note:     stmt 3 _42 = *_41;
build/include/botan/mem_ops.h:148:15: note:     stmt 4 _10 = *_9;
build/include/botan/mem_ops.h:148:15: note:     stmt 5 _42 = *_41;
build/include/botan/mem_ops.h:148:15: note:     stmt 6 _10 = *_9;
build/include/botan/mem_ops.h:148:15: note:     stmt 7 _42 = *_41;
build/include/botan/mem_ops.h:148:15: note:     load permutation { 0 1 0 1 0 1
0 1 }

but decided

build/include/botan/mem_ops.h:148:15: note:   ==> examining statement: _10 =
*_9;
build/include/botan/mem_ops.h:148:15: missed:   BB vectorization with gaps at
the end of a load is not supported
src/lib/block/xtea/xtea.cpp:32:55: missed:   not vectorized: relevant stmt not
supported: _10 = *_9;
build/include/botan/mem_ops.h:148:15: note:   Building vector operands of
0x297eff8 from scalars instead

I think we have a duplicate for this somewhere.  The desired vectorization
would be a HImode load and splat.

We also miss to represent the bswap nodes as VEC_PERM ones and to optimize
them away entirely.

And we fail to elide the bswap even though we vectorize it!?

Reply via email to