On Thu, Nov 17, 2016 at 9:41 AM, Thomas Koenig <tkoe...@netcologne.de> wrote: > Am 17.11.2016 um 00:20 schrieb Jakub Jelinek: >> >> On Thu, Nov 17, 2016 at 12:03:18AM +0100, Thomas Koenig wrote: >>>> >>>> Don't you need to test in configure if the assembler supports AVX? >>>> Otherwise if somebody is bootstrapping gcc with older assembler, it will >>>> just fail to bootstrap. >>> >>> >>> That's a good point. The AVX instructions were added in binutils 2.19, >>> which was released in 2011. This could be put in the prerequisites. >>> >>> What should the test do? Fail with an error message "you need newer >>> binutils" or simply (and silently) not compile the AVX vesion? >> >> >>> From what I understood, you want those functions just to be >>> implementation >> >> details, not exported from libgfortran.so*. Thus the test would do >> something similar to what gcc/testsuite/lib/target-supports.exp >> (check_effective_target_avx) >> does, but of course in autoconf way, not in tcl. > > > OK, that looks straightworward enough. I'll give it a shot. > >> Also, from what I see, target_clones just use IFUNCs, so you probably also >> need some configure test whether ifuncs are supported (the >> gcc.target/i386/mvc* tests use dg-require-ifunc, so you'd need something >> similar again in configure. But if so, then I have no idea why you use >> a wrapper around the function, instead of using it on the exported APIs. > > > As you wrote above, I wanted this as an implementation detail. I also > wanted the ability to be able to add new instruction sets without > breaking the ABI. > > Because the caller generates the ifunc, using a wrapper function seemed > like the best way to do it. The overhead is neglible (the function > is one simple jump), especially considering that we only call the > library function for larger matrices. > >>>> For matmul_i*, wouldn't it make more sense to use avx2 instead of avx, >>>> or both avx and avx2 and maybe avx512f? >>> >>> >>> I did a vdiff of the disassembled code generated or avx and avx2, and >>> (somewhat to my surprise) there was no difference. Maybe, with more >>> unrolling, something more might have happened. I didn't check for >>> AVX512f, but I can do that. >> >> >> For the float/double code it wouldn't surprise me (assuming you don't need >> gather insns and similar stuff). But for integers generally most of the >> avx instructions can only handle 128-bit vectors, while avx2 has 256-bit >> ones, > > > You're right - integer multiplication looks different. > > Nobody I know cares about integer matrix multiplication > speed, whereas real has gotten a _lot_ of attention over > the decades. So, putting in AVX will make the code run > faster on more machines, while putting in AVX2 will > (IMHO) bloat the library for no good reason. However, > I am willing to stand corrected on this. Putting in AVX512f > makes sense. > > I have also been trying to get target_clones to work on POWER > to get Altivec instructions, but to no avail. I also cannot > find any examples in the testsuite. > > Since a lot of supercomputers use POWER nodes, that might also > be attractive. > > Regards > > Thomas
Hi, In order to reduce bloat, might it make sense to make the core blocked gemm algorithm that Jerry committed a few days ago into a separate static function, and then only do the target_clone stuff for that one? The rest of the matmul function deals with all kinds of stuff like setup, handling non-stride-1 cases, calling the external gemm function for -fexternal-blas etc., none of which vectorizes anyway so generating different versions of this code using different vector instructions looks like a waste? In that case I guess one could add the avx2 variant as well on the odd chance that somebody for some reason cares about integer matmul. -- Janne Blomqvist