https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #21 from Thomas Koenig <tkoenig at gcc dot gnu.org> --- > Hidden behind a -fexternal-blas-n switch might be an option. Including GPUs > seems even a tad more tricky. We have a paper on GPU (small) matrix > multiplication, http://dbcsr.cp2k.org/_media/gpu_book_chapter_submitted.pdf Quite interesting what can be done with GPUs... > . BTW, another interesting project is the libxsmm library more aimed at > small (<128) matrices see : https://github.com/hfp/libxsmm . Not sure if > this info is useful in this context, but it might provide inspiration. I assume that for small matrices bordering on the silly (say, a matrix multiplication with dimensions of (1,2) and (2,1)) the inline code will be faster if the code is compiled with the right options, due to function call overhead. I also assume that libxsmm will become faster quite soon for bigger sizes. Do you have an idea where the crossover is?