On Thu, Nov 17, 2016 at 9:41 AM, Thomas Koenig <tkoe...@netcologne.de> wrote:
> Am 17.11.2016 um 00:20 schrieb Jakub Jelinek:
>>
>> On Thu, Nov 17, 2016 at 12:03:18AM +0100, Thomas Koenig wrote:
>>>>
>>>> Don't you need to test in configure if the assembler supports AVX?
>>>> Otherwise if somebody is bootstrapping gcc with older assembler, it will
>>>> just fail to bootstrap.
>>>
>>>
>>> That's a good point.  The AVX instructions were added in binutils 2.19,
>>> which was released in 2011. This could be put in the prerequisites.
>>>
>>> What should the test do?  Fail with an error message "you need newer
>>> binutils" or simply (and silently) not compile the AVX vesion?
>>
>>
>>> From what I understood, you want those functions just to be
>>> implementation
>>
>> details, not exported from libgfortran.so*.  Thus the test would do
>> something similar to what gcc/testsuite/lib/target-supports.exp
>> (check_effective_target_avx)
>> does, but of course in autoconf way, not in tcl.
>
>
> OK, that looks straightworward enough. I'll give it a shot.
>
>> Also, from what I see, target_clones just use IFUNCs, so you probably also
>> need some configure test whether ifuncs are supported (the
>> gcc.target/i386/mvc* tests use dg-require-ifunc, so you'd need something
>> similar again in configure.  But if so, then I have no idea why you use
>> a wrapper around the function, instead of using it on the exported APIs.
>
>
> As you wrote above, I wanted this as an implementation detail. I also
> wanted the ability to be able to add new instruction sets without
> breaking the ABI.
>
> Because the caller generates the ifunc, using a wrapper function seemed
> like the best way to do it.  The overhead is neglible (the function
> is one simple jump), especially considering that we only call the
> library function for larger matrices.
>
>>>> For matmul_i*, wouldn't it make more sense to use avx2 instead of avx,
>>>> or both avx and avx2 and maybe avx512f?
>>>
>>>
>>> I did a vdiff of the disassembled code generated or avx and avx2, and
>>> (somewhat to my surprise) there was no difference.  Maybe, with more
>>> unrolling, something more might have happened. I didn't check for
>>> AVX512f, but I can do that.
>>
>>
>> For the float/double code it wouldn't surprise me (assuming you don't need
>> gather insns and similar stuff).  But for integers generally most of the
>> avx instructions can only handle 128-bit vectors, while avx2 has 256-bit
>> ones,
>
>
> You're right - integer multiplication looks different.
>
> Nobody I know cares about integer matrix multiplication
> speed, whereas real has gotten a _lot_ of attention over
> the decades.  So, putting in AVX will make the code run
> faster on more machines, while putting in AVX2 will
> (IMHO) bloat the library for no good reason.  However,
> I am willing to stand corrected on this. Putting in AVX512f
> makes sense.
>
> I have also been trying to get target_clones to work on POWER
> to get Altivec instructions, but to no avail. I also cannot
> find any examples in the testsuite.
>
> Since a lot of supercomputers use POWER nodes, that might also
> be attractive.
>
> Regards
>
>         Thomas

Hi,

In order to reduce bloat, might it make sense to make the core blocked
gemm algorithm that Jerry committed a few days ago into a separate
static function, and then only do the target_clone stuff for that one?
The rest of the matmul function deals with all kinds of stuff like
setup, handling non-stride-1 cases, calling the external gemm function
for -fexternal-blas etc., none of which vectorizes anyway so
generating different versions of this code using different vector
instructions looks like a waste?

In that case I guess one could add the avx2 variant as well on the odd
chance that somebody for some reason cares about integer matmul.

-- 
Janne Blomqvist

Reply via email to