https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #25 from Janne Blomqvist <jb at gcc dot gnu.org> ---
(In reply to Jerry DeLisle from comment #24)
> (In reply to Jerry DeLisle from comment #16)
> > For what its worth:
> > 
> > $ gfc pr51119.f90 -lblas -fno-external-blas -Ofast -march=native 
> > $ ./a.out 
> >  Time, MATMUL:    21.2483196       21.254449646000001     1.5055670945599979
> > 
> >  Time, dgemm:    33.2441711       33.243087289000002      .96260614189671445
> > 
> 
> Running a sample matrix multiply program on this same platform using the
> default OpenCL (Mesa on Fedora 22) the machine is achieving:
> 
> 64 x 64      2.76 Gflops
> 1000 x 1000  14.10
> 2000 x 2000  24.4

But, that is not particularly impressive, is it? I don't know about current low
end graphics adapters, but at least the high end GPU cards (Tesla) are capable
of several Tflops. Of course, there is a non-trivial threshold size to amortize
the data movement to/from the GPU.

With the test program from #12, with OpenBLAS (which BTW should be available in
Fedora 22 as well) I get 337 Gflops/s, or 25 Gflops/s if I restrict it to a
single core with the OMP_NUM_THREADS=1 environment variable. This on a machine
with 20 2.8 GHz Ivy bridge cores.

I'm not per se against using GPU's, but I think there's a lot of low hanging
fruit to be had just by making it easier for users to use a high performance
BLAS implementation.

Reply via email to