https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #25 from Janne Blomqvist <jb at gcc dot gnu.org> --- (In reply to Jerry DeLisle from comment #24) > (In reply to Jerry DeLisle from comment #16) > > For what its worth: > > > > $ gfc pr51119.f90 -lblas -fno-external-blas -Ofast -march=native > > $ ./a.out > > Time, MATMUL: 21.2483196 21.254449646000001 1.5055670945599979 > > > > Time, dgemm: 33.2441711 33.243087289000002 .96260614189671445 > > > > Running a sample matrix multiply program on this same platform using the > default OpenCL (Mesa on Fedora 22) the machine is achieving: > > 64 x 64 2.76 Gflops > 1000 x 1000 14.10 > 2000 x 2000 24.4 But, that is not particularly impressive, is it? I don't know about current low end graphics adapters, but at least the high end GPU cards (Tesla) are capable of several Tflops. Of course, there is a non-trivial threshold size to amortize the data movement to/from the GPU. With the test program from #12, with OpenBLAS (which BTW should be available in Fedora 22 as well) I get 337 Gflops/s, or 25 Gflops/s if I restrict it to a single core with the OMP_NUM_THREADS=1 environment variable. This on a machine with 20 2.8 GHz Ivy bridge cores. I'm not per se against using GPU's, but I think there's a lot of low hanging fruit to be had just by making it easier for users to use a high performance BLAS implementation.