On Sunday, 7 March 2021 at 14:15:58 UTC, z wrote:
On Thursday, 25 February 2021 at 14:28:40 UTC, Guillaume Piolat
wrote:
On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
How does one optimize code to make full use of the CPU's SIMD
capabilities?
Is there any way to guarantee that "packed" versions of SIMD
instructions will be used?(e.g. vmulps, vsqrtps, etc...)
https://code.dlang.org/packages/intel-intrinsics
I'd try to use it but the platform i'm building on requires AVX
to get the most performance.
The code below might be worth a try on your AVX512 machine.
Unless you're looking for a combined result, you might need to
separate out the memory access overhead by running multiple
passes over a "known optimal for L2" data set.
Also note that I compiled with -preview=in. I don't know if that
matters.
import std.math : sqrt;
enum SIMDBits = 512; // 256 was tested, 512 was not
alias A = float[SIMDBits / (float.sizeof * 8)];
pragma(inline, true)
void soaEuclidean(ref A a0, in A a1, in A a2, in A a3, in A
b1, in A b2, in A b3)
{
alias V = __vector(A);
static V vsqrt(V v)
{
A a = cast(A) v;
static foreach (i; 0 .. A.length)
a[i] = sqrt(a[i]);
return cast(V)a;
}
static V sd(in A a, in A b)
{
V v = cast(V) b - cast(V) a;
return v * v;
}
auto v = sd(a1, b1) + sd(a2, b2) + sd(a3, b3);
a0[] = vsqrt(v)[];
}