https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #23 from Jerry DeLisle <jvdelisle at gcc dot gnu.org> --- (In reply to Thomas Koenig from comment #21) > > Hidden behind a -fexternal-blas-n switch might be an option. Including GPUs > > seems even a tad more tricky. We have a paper on GPU (small) matrix > > multiplication, http://dbcsr.cp2k.org/_media/gpu_book_chapter_submitted.pdf > > Quite interesting what can be done with GPUs... > Run of the mill graphics processing units have many floating point compute cores. 128 cores is not unusual, usually a lot more. These cores perform basic things like a + b * c on scalars. and other useful functions. Softwares like OpenCL will compile compute kernels which will run efficiently in parallel on these GPU architectures. clBLAS is a runtime library which encapsulates this capability with a BLAS compatible API. Conceptually you initialize for particular matrices and hand of the work to the GPU. My low end laptop (300 dollar variety) is running an nbody 3D model with several thousand masses without even pressing the CPU as an example. MATMUL should be doable. The main GPU competitors are Nvidia, AMD. and Intel. OpenCL is supported on all three.