https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062
--- Comment #11 from Jan Hubicka <hubicka at gcc dot gnu.org> --- trunk -O3 -flto -march=native -fopenmp Operation: Sharpen: 257 256 256 Average: 256 Iterations Per Minute GCC13 -O3 -flto -march=native -fopenmp 257 256 256 Average: 256 Iterations Per Minute clang17 O3 -flto -march=native -fopenmp Operation: Sharpen: 257 256 256 Average: 256 Iterations Per Minute So I guess I will need to try on zen3 to see if there is any difference. the internal loop is: 0.00 │460:┌─→movzbl 0x2(%rdx,%rax,4),%esi ▒ 0.02 │ │ vmovss (%r8,%rax,4),%xmm2 ▒ 0.95 │ │ vcvtsi2ss %esi,%xmm0,%xmm1 ▒ 20.22 │ │ movzbl 0x1(%rdx,%rax,4),%esi ▒ 0.01 │ │ vfmadd231ss %xmm1,%xmm2,%xmm3 ▒ 11.97 │ │ vcvtsi2ss %esi,%xmm0,%xmm1 ▒ 18.76 │ │ movzbl (%rdx,%rax,4),%esi ▒ 0.00 │ │ inc %rax ▒ 0.72 │ │ vfmadd231ss %xmm1,%xmm2,%xmm4 ▒ 12.55 │ │ vcvtsi2ss %esi,%xmm0,%xmm1 ▒ 14.95 │ │ vfmadd231ss %xmm1,%xmm2,%xmm5 ▒ 15.93 │ ├──cmp %rax,%r13 ▒ 0.35 │ └──jne 460 so it still does not get....