https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108229
--- Comment #3 from Alexander Monakov <amonakov at gcc dot gnu.org> --- Thank you! I considered this unprofitable for these reasons: 1. As you said, the code grows in size, but the speed benefit is not clear. 2. The transform converts load+add operations in a loop, and their final uses outside of the loop. How does the costing work in this case, i.e. how are changes for the more frequently executed instructions are weighted against changes for the instructions that will be executed once? 3. The scalar 'add reg, mem' instruction results in one micro-fused uop that is handled as one uop during renaming (one of narrowest point in the pipeline). It is then issued on two execution units (for the load and for the add). 4. On AMD, there are separate fp/simd pipes, so when the code is already simd-heavy as in this example, STV offloads instructions from the integer pipes to the possibly already-busy simd/fp pipes. That said, the transformed portion is small relative to the inner loop of the example, so benchmarking yesterday's trunk with/without -mno-stv on Zen 2, I get: 27.26 bytes/cycle, 3.07 instruction/cycle vs. 26.01 bytes/cycle, 2.97 instruction/cycle So it's not the end of the world for this particular example, but I wanted to raise the issue in case there's a costing problem in STV that needs correcting.