[Bug libfortran/78379] Processor-specific versions for matmul

jvdelisle at gcc dot gnu.org Thu, 17 Nov 2016 19:22:10 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78379


--- Comment #8 from Jerry DeLisle <jvdelisle at gcc dot gnu.org> ---
(In reply to Thomas Koenig from comment #6)
> > You may notice I was invoking the wrong executable in what I posted in
> > comment #3. I did rerun the correct one several times and tried it with
> > -mavx -mprefer-avx128. I get the same poor results regardless.
> 
> Several things could go wrong here...
> 
> If you run the benchmark under gdb and break, then type
> "disassemble $pc,$pc+200", do you actually end up in the right
> program part (the one with AVX instructions)?

452                                   f32 += t1[l - ll + 1 + ((i - ii + 3) <<
8) - 257]
(gdb) disassemble $pc,$pc+200
Dump of assembler code from 0x7ffff7af3554 to 0x7ffff7af361c:
=> 0x00007ffff7af3554 <aux_matmul_r8+5220>:     vaddpd %ymm12,%ymm4,%ymm4
   0x00007ffff7af3559 <aux_matmul_r8+5225>:     vmulpd %ymm10,%ymm15,%ymm12
   0x00007ffff7af355e <aux_matmul_r8+5230>:     vaddpd %ymm11,%ymm5,%ymm5
   0x00007ffff7af3563 <aux_matmul_r8+5235>:     vmulpd %ymm14,%ymm15,%ymm15
   0x00007ffff7af3568 <aux_matmul_r8+5240>:     vmulpd %ymm10,%ymm13,%ymm10
   0x00007ffff7af356d <aux_matmul_r8+5245>:     vaddpd %ymm12,%ymm6,%ymm6
   0x00007ffff7af3572 <aux_matmul_r8+5250>:     vmulpd %ymm14,%ymm13,%ymm14
   0x00007ffff7af3577 <aux_matmul_r8+5255>:     vaddpd %ymm15,%ymm8,%ymm8
   0x00007ffff7af357c <aux_matmul_r8+5260>:     vaddpd %ymm10,%ymm7,%ymm7
   0x00007ffff7af3581 <aux_matmul_r8+5265>:     vaddpd %ymm14,%ymm9,%ymm9
   0x00007ffff7af3586 <aux_matmul_r8+5270>:     ja     0x7ffff7af3433
<aux_matmul_r8+4931>
   0x00007ffff7af358c <aux_matmul_r8+5276>:     mov    -0x801f8(%rbp),%rdx
   0x00007ffff7af3593 <aux_matmul_r8+5283>:     vhaddpd %ymm9,%ymm9,%ymm13
   0x00007ffff7af3598 <aux_matmul_r8+5288>:     vhaddpd %ymm8,%ymm8,%ymm15
   0x00007ffff7af359d <aux_matmul_r8+5293>:     vhaddpd %ymm7,%ymm7,%ymm7
   0x00007ffff7af35a1 <aux_matmul_r8+5297>:     vperm2f128
$0x1,%ymm13,%ymm13,%ymm11
   0x00007ffff7af35a7 <aux_matmul_r8+5303>:     vhaddpd %ymm5,%ymm5,%ymm5
   0x00007ffff7af35ab <aux_matmul_r8+5307>:     vperm2f128
$0x1,%ymm15,%ymm15,%ymm8
   0x00007ffff7af35b1 <aux_matmul_r8+5313>:     vaddpd %ymm11,%ymm13,%ymm12
   0x00007ffff7af35b6 <aux_matmul_r8+5318>:     vperm2f128
$0x1,%ymm7,%ymm7,%ymm13
   0x00007ffff7af35bc <aux_matmul_r8+5324>:     vaddpd %ymm8,%ymm15,%ymm14
   0x00007ffff7af35c1 <aux_matmul_r8+5329>:     vhaddpd %ymm6,%ymm6,%ymm6
---Type <return> to continue, or q <return> to quit---
   0x00007ffff7af35c5 <aux_matmul_r8+5333>:     vaddsd
-0x80068(%rbp),%xmm12,%xmm10
   0x00007ffff7af35cd <aux_matmul_r8+5341>:     vaddsd
-0x80070(%rbp),%xmm14,%xmm9
   0x00007ffff7af35d5 <aux_matmul_r8+5349>:     vperm2f128
$0x1,%ymm5,%ymm5,%ymm14
   0x00007ffff7af35db <aux_matmul_r8+5355>:     vhaddpd %ymm4,%ymm4,%ymm4
   0x00007ffff7af35df <aux_matmul_r8+5359>:     vaddpd %ymm13,%ymm7,%ymm11
   0x00007ffff7af35e4 <aux_matmul_r8+5364>:     vmovsd %xmm10,-0x80068(%rbp)
   0x00007ffff7af35ec <aux_matmul_r8+5372>:     vperm2f128
$0x1,%ymm6,%ymm6,%ymm10
   0x00007ffff7af35f2 <aux_matmul_r8+5378>:     vperm2f128
$0x1,%ymm4,%ymm4,%ymm13
   0x00007ffff7af35f8 <aux_matmul_r8+5384>:     vmovsd %xmm9,-0x80070(%rbp)
   0x00007ffff7af3600 <aux_matmul_r8+5392>:     vaddpd %ymm14,%ymm5,%ymm9
   0x00007ffff7af3605 <aux_matmul_r8+5397>:     vhaddpd %ymm0,%ymm0,%ymm0
   0x00007ffff7af3609 <aux_matmul_r8+5401>:     vaddsd
-0x80058(%rbp),%xmm11,%xmm12
   0x00007ffff7af3611 <aux_matmul_r8+5409>:     vaddpd %ymm10,%ymm6,%ymm15
   0x00007ffff7af3616 <aux_matmul_r8+5414>:     vaddpd %ymm13,%ymm4,%ymm11
   0x00007ffff7af361b <aux_matmul_r8+5419>:     vperm2f128
$0x1,%ymm0,%ymm0,%ymm13
End of assembler dump.



> 
> Or does your machine prefer AVX128?
> 
> To find out, what are the timings for inline code using
> 
> -mavx -Ofast
> 
> -mavx -mprefer=avx128 -Ofast
> 
> ?
$ gfc  -finline-matmul-limit=64 -Ofast compare.f90
$ ./a.out 
 =========================================================
 ================            MEASURED GIGAFLOPS          =
 =========================================================
                 Matmul                           Matmul
                 fixed                 Matmul     variable
 Size  Loops     explicit   refMatmul  assumed    explicit
 =========================================================
    2  2000      4.933      0.045      0.086      0.144
    4  2000      1.418      0.225      0.271      0.347
    8  2000      2.168      0.616      1.296      1.830
   16  2000      5.330      2.824      1.784      2.907
   32  2000      6.239      3.488      1.446      3.406
   64  2000      2.650      2.746      1.552      2.691

$ gfc  -finline-matmul-limit=64 -mavx -Ofast compare.f90
$ ./a.out 
 =========================================================
 ================            MEASURED GIGAFLOPS          =
 =========================================================
                 Matmul                           Matmul
                 fixed                 Matmul     variable
 Size  Loops     explicit   refMatmul  assumed    explicit
 =========================================================
    2  2000      6.934      0.042      0.091      0.134
    4  2000      1.320      0.181      0.365      0.252
    8  2000      1.007      0.446      1.595      0.982
   16  2000      0.581      1.163      2.411      1.180
   32  2000      1.346      1.276      2.061      1.277
   64  2000      1.397      1.327      2.288      1.328

$ gfc  -finline-matmul-limit=64 -mavx -mprefer-avx128 -Ofast compare.f90
$ ./a.out 
 =========================================================
 ================            MEASURED GIGAFLOPS          =
 =========================================================
                 Matmul                           Matmul
                 fixed                 Matmul     variable
 Size  Loops     explicit   refMatmul  assumed    explicit
 =========================================================
    2  2000      5.021      0.045      0.088      0.139
    4  2000      1.607      0.202      0.288      0.341
    8  2000      2.482      0.575      0.743      1.861
   16  2000      5.674      2.804      1.809      2.792
   32  2000      6.323      3.460      1.478      3.293
   64  2000      2.714      2.832      1.582      2.694

If I put -mavx -prefer-avx128 in the Makefile.am I get as good or better than
without your patch. I also see none of the HAVE_AVX defined in config.

$ gfc  -finline-matmul-limit=0 -Ofast compare.f90
$ ./a.out 
 =========================================================
 ================            MEASURED GIGAFLOPS          =
 =========================================================
                 Matmul                           Matmul
                 fixed                 Matmul     variable
 Size  Loops     explicit   refMatmul  assumed    explicit
 =========================================================
    2  2000      0.043      0.041      0.034      0.043
    4  2000      0.272      0.234      0.223      0.256
    8  2000      0.835      1.687      1.627      1.709
   16  2000      2.886      2.887      2.859      2.869
   32  2000      4.733      3.494      4.755      4.652
   64  2000      6.933      2.837      6.933      6.877
  128  2000      7.949      3.285      8.705      7.914
  256   477     10.040      3.447      9.999      9.951
  512    59      8.885      2.341      8.923      8.940
 1024     7      8.937      1.367      8.978      8.991
 2048     1      8.799      1.672      8.831      8.854

The following in config.h.in for what it is worth:

/* Define if AVX instructions can be compiled. */
#undef HAVE_AVX

/* Define if AVX2 instructions can be compiled. */
#undef HAVE_AVX2

/* Define if AVX512f instructions can be compiled. */
#undef HAVE_AVX512F

[Bug libfortran/78379] Processor-specific versions for matmul

Reply via email to