Optimizing for SIMD: best practices?(i.e. what features are allowed?)

z via Digitalmars-d-learn Thu, 25 Feb 2021 03:31:02 -0800

How does one optimize code to make full use of the CPU's SIMDcapabilities?Is there any way to guarantee that "packed" versions of SIMDinstructions will be used?(e.g. vmulps, vsqrtps, etc...)To give some context, this is a sample of one of the functionsthat could benefit from better SIMD usage :

float euclideanDistanceFixedSizeArray(float[3] a, float[3] b) {
  float distance;
  a[] -= b[];
  a[] *= a[];
  static foreach(size_t i; 0 .. 3/+typeof(a).length+/){
      distance += a[i].abs;//abs required by the caller
  }
  return sqrt(distance);
}
vmovsd xmm0,qword ptr ds:[rdx]
vmovss xmm1,dword ptr ds:[rdx+8]
vmovsd xmm2,qword ptr ds:[rcx+4]
vsubps xmm0,xmm0,xmm2
vsubss xmm1,xmm1,dword ptr ds:[rcx+C]
vmulps xmm0,xmm0,xmm0
vmulss xmm1,xmm1,xmm1
vbroadcastss xmm2,dword ptr ds:[<__real@7fffffff>]
vandps xmm0,xmm0,xmm2
vpermilps xmm3,xmm0,F5
vaddss xmm0,xmm0,xmm3
vandps xmm1,xmm1,xmm2
vaddss xmm0,xmm0,xmm1
vsqrtss xmm0,xmm0,xmm0
vmovaps xmm6,xmmword ptr ss:[rsp+20]
add rsp,38
ret

I've tried to experiment with dynamic arrays of float[3] but theoutput assembly seemed to be worse.[1](in short, it's callinginternal D functions which use "vxxxss" instructions whileperforming many moves.)


Big thanks
[1] https://run.dlang.io/is/F3Xye3

Optimizing for SIMD: best practices?(i.e. what features are allowed?)

Reply via email to