How does one optimize code to make full use of the CPU's SIMD capabilities? Is there any way to guarantee that "packed" versions of SIMD instructions will be used?(e.g. vmulps, vsqrtps, etc...) To give some context, this is a sample of one of the functions that could benefit from better SIMD usage :
float euclideanDistanceFixedSizeArray(float[3] a, float[3] b) {
  float distance;
  a[] -= b[];
  a[] *= a[];
  static foreach(size_t i; 0 .. 3/+typeof(a).length+/){
      distance += a[i].abs;//abs required by the caller
  }
  return sqrt(distance);
}
vmovsd xmm0,qword ptr ds:[rdx]
vmovss xmm1,dword ptr ds:[rdx+8]
vmovsd xmm2,qword ptr ds:[rcx+4]
vsubps xmm0,xmm0,xmm2
vsubss xmm1,xmm1,dword ptr ds:[rcx+C]
vmulps xmm0,xmm0,xmm0
vmulss xmm1,xmm1,xmm1
vbroadcastss xmm2,dword ptr ds:[<__real@7fffffff>]
vandps xmm0,xmm0,xmm2
vpermilps xmm3,xmm0,F5
vaddss xmm0,xmm0,xmm3
vandps xmm1,xmm1,xmm2
vaddss xmm0,xmm0,xmm1
vsqrtss xmm0,xmm0,xmm0
vmovaps xmm6,xmmword ptr ss:[rsp+20]
add rsp,38
ret

I've tried to experiment with dynamic arrays of float[3] but the output assembly seemed to be worse.[1](in short, it's calling internal D functions which use "vxxxss" instructions while performing many moves.)

Big thanks
[1] https://run.dlang.io/is/F3Xye3

Reply via email to