I looked at information about vector instructions. One thing,
it is a mess, available instructions are quite irregular.
Concerning availablity, it seems that to have 256-bit operations
we need AVX2, more precisly, we need 'vpaddq', 'vpmullq',
'vpminuq' (and 'vmovdqu' but it is probably available on all machines
that have first 3). AFAICS of machines that I have or can use at least
4 do not have AVX, one is newly bought mini-PC. 2 other machines
apparently have AVX, but miss AVX2. You can try the C code below
implementing 'vector_combination'. When you direct clang or gcc to
generate code for new high-end processors, and one give it
'-ftree-vectorize -O' (or '-O3'), then it generates 256 bit
vector instructions. However, the code is quite bulky and it is
not clear to me how well it will perform on moderate length vectors
which are typical in our use. One reason for size is that vector
operations need initial setup. Another is that they have substantial
latency and processor needs to have several in the pipeline to attain
good troughput. But then there is partial block at start or at the
end which needs separate handling.
When compiling for older high-end processors or low end ones compiler
uses 128-bit instructions or scalar code.
BTW1: Code would be smaller and faster with properly aligned and
padded buffers, but IIUC neither C compiler nor sbcl-simd have
good support for this (generated code uses unaligned loads which
are slower than aligned ones).
BTW2: Our current code is equvalent with replacing part after
initialization of s1 with 'res[i] = s1 % p;'. My measurements
indicate that on Core 2 this works essentially as fast as division
(IIUC all other computations can be done in parallel with division).
When using C compiler this is likely to hold also on newer processors
which have faster division.
BTW3: Agner Fogg in his tables claims that on Zen1 vector multiplication
works essentially at speed of scalar multiplication (processor can
do 1 scalar multiplication per clock and 1 vector multiplication
per 4 clocks). There should be gain from doing other operations
in vector way, but clearly on such processor gain from vector code
is limited.
BTW4: Ideally we should have tens of routines in similar spirit to
cover various use cases. When you multiply this by number of
processor architectures there is a combinatorial explosion.
#include <stdint.h>
void
vector_combination(uint32_t * v1, uint32_t c1, uint32_t * v2, uint32_t c2,
uint32_t * res, int n, uint32_t p, uint32_t q) {
int i;
for(i = 0; i < n; i++) {
uint64_t s1 = ((uint64_t)(v1[i]))*c1 + ((uint64_t)(v2[i]))*c2;
uint64_t hs1 = s1 >> 32;
uint64_t q1 = (hs1*q) >> 31;
uint64_t r1 = s1 - q1*p;
/* May need also second correction */
r1 = (r1 > p)?(r1 - p):r1;
// r1 = (r1 > p)?(r1 - p):r1;
res[i] = r1;
}
}
--
Waldek Hebisch
--
You received this message because you are subscribed to the Google Groups
"FriCAS - computer algebra system" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion visit
https://groups.google.com/d/msgid/fricas-devel/ZyrM5Q6PUQTahq2z%40fricas.org.