On Sunday, 7 March 2021 at 14:15:58 UTC, z wrote:
On Thursday, 25 February 2021 at 14:28:40 UTC, Guillaume Piolat wrote:
On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
How does one optimize code to make full use of the CPU's SIMD capabilities? Is there any way to guarantee that "packed" versions of SIMD instructions will be used?(e.g. vmulps, vsqrtps, etc...)

https://code.dlang.org/packages/intel-intrinsics

I'd try to use it but the platform i'm building on requires AVX to get the most performance.

The code below might be worth a try on your AVX512 machine.

Unless you're looking for a combined result, you might need to separate out the memory access overhead by running multiple passes over a "known optimal for L2" data set.

Also note that I compiled with -preview=in. I don't know if that matters.



import std.math : sqrt;
enum SIMDBits = 512; // 256 was tested, 512 was not
alias A = float[SIMDBits / (float.sizeof * 8)];
pragma(inline, true)
void soaEuclidean(ref A a0, in A a1, in A a2, in A a3, in A b1, in A b2, in A b3)
{
    alias V = __vector(A);
    static V vsqrt(V v)
    {
        A a = cast(A) v;
        static foreach (i; 0 .. A.length)
            a[i] = sqrt(a[i]);
        return cast(V)a;
    }

    static V sd(in A a, in A b)
    {
        V v = cast(V) b - cast(V) a;
        return v * v;
    }

    auto v = sd(a1, b1) + sd(a2, b2) + sd(a3, b3);
    a0[] = vsqrt(v)[];
}


Reply via email to