https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108229

--- Comment #3 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Thank you! I considered this unprofitable for these reasons:

1. As you said, the code grows in size, but the speed benefit is not clear.

2. The transform converts load+add operations in a loop, and their final uses
outside of the loop. How does the costing work in this case, i.e. how are
changes for the more frequently executed instructions are weighted against
changes for the instructions that will be executed once?

3. The scalar 'add reg, mem' instruction results in one micro-fused uop that is
handled as one uop during renaming (one of narrowest point in the pipeline). It
is then issued on two execution units (for the load and for the add).

4. On AMD, there are separate fp/simd pipes, so when the code is already
simd-heavy as in this example, STV offloads instructions from the integer pipes
to the possibly already-busy simd/fp pipes.

That said, the transformed portion is small relative to the inner loop of the
example, so benchmarking yesterday's trunk with/without -mno-stv on Zen 2, I
get:

27.26 bytes/cycle, 3.07 instruction/cycle

vs.

26.01 bytes/cycle, 2.97 instruction/cycle

So it's not the end of the world for this particular example, but I wanted to
raise the issue in case there's a costing problem in STV that needs correcting.

Reply via email to