https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122746
--- Comment #3 from vekumar at gcc dot gnu.org --- I used -O3 -march=znver4 and saw extracts. https://godbolt.org/z/5ajq8rjTr with -O3 -march-znver5 https://godbolt.org/z/Exd3bT3T7 Current trunk code is still bad GCC 15 Chain 1 (xmm1 - array a): Load → Add → Load → Add → Load → Add ... Chain 2 (xmm0 - array b): Load → Add → Load → Add → Load → Add ... GCC 16 Single Chain (xmm0): Build → Add → Build → Add → Build → Add ... Remainder loop is masked in GCC 16 and has costly permutes.
