https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119

--- Comment #21 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---

> Hidden behind a -fexternal-blas-n switch might be an option. Including GPUs
> seems even a tad more tricky. We have a paper on GPU (small) matrix
> multiplication, http://dbcsr.cp2k.org/_media/gpu_book_chapter_submitted.pdf

Quite interesting what can be done with GPUs...

> . BTW, another interesting project is the libxsmm library more aimed at
> small (<128) matrices see : https://github.com/hfp/libxsmm . Not sure if
> this info is useful in this context, but it might provide inspiration.

I assume that for  small matrices bordering on the silly
(say, a matrix multiplication with dimensions of (1,2) and (2,1))
the inline code will be faster if the code is compiled with the
right options, due to function call overhead.  I also assume that
libxsmm will become faster quite soon for bigger sizes.

Do you have an idea where the crossover is?

Reply via email to