https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

--- Comment #11 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
trunk -O3 -flto -march=native -fopenmp
    Operation: Sharpen:
        257
        256
        256

    Average: 256 Iterations Per Minute
GCC13 -O3 -flto -march=native -fopenmp
        257
        256
        256

    Average: 256 Iterations Per Minute
clang17 O3 -flto -march=native -fopenmp
   Operation: Sharpen:
        257
        256
        256
    Average: 256 Iterations Per Minute

So I guess I will need to try on zen3 to see if there is any difference.

the internal loop is:
  0.00 │460:┌─→movzbl      0x2(%rdx,%rax,4),%esi ▒
  0.02 │    │  vmovss      (%r8,%rax,4),%xmm2    ▒
  0.95 │    │  vcvtsi2ss   %esi,%xmm0,%xmm1      ▒
 20.22 │    │  movzbl      0x1(%rdx,%rax,4),%esi ▒
  0.01 │    │  vfmadd231ss %xmm1,%xmm2,%xmm3     ▒
 11.97 │    │  vcvtsi2ss   %esi,%xmm0,%xmm1      ▒
 18.76 │    │  movzbl      (%rdx,%rax,4),%esi    ▒
  0.00 │    │  inc         %rax                  ▒
  0.72 │    │  vfmadd231ss %xmm1,%xmm2,%xmm4     ▒
 12.55 │    │  vcvtsi2ss   %esi,%xmm0,%xmm1      ▒
 14.95 │    │  vfmadd231ss %xmm1,%xmm2,%xmm5     ▒
 15.93 │    ├──cmp         %rax,%r13             ▒
  0.35 │    └──jne         460                                                  

so it still does not get....

Reply via email to