Re: Optimizing for SIMD: best practices?(i.e. what features are allowed?)

Bruce Carneal via Digitalmars-d-learn Thu, 25 Feb 2021 21:55:33 -0800

On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:

How does one optimize code to make full use of the CPU's SIMDcapabilities?Is there any way to guarantee that "packed" versions of SIMDinstructions will be used?(e.g. vmulps, vsqrtps, etc...)To give some context, this is a sample of one of the functionsthat could benefit from better SIMD usage :
float euclideanDistanceFixedSizeArray(float[3] a, float[3] b) {
  float distance;
  a[] -= b[];
  a[] *= a[];
  static foreach(size_t i; 0 .. 3/+typeof(a).length+/){
      distance += a[i].abs;//abs required by the caller
  }
  return sqrt(distance);
}
vmovsd xmm0,qword ptr ds:[rdx]
vmovss xmm1,dword ptr ds:[rdx+8]
vmovsd xmm2,qword ptr ds:[rcx+4]
vsubps xmm0,xmm0,xmm2
vsubss xmm1,xmm1,dword ptr ds:[rcx+C]
vmulps xmm0,xmm0,xmm0
vmulss xmm1,xmm1,xmm1
vbroadcastss xmm2,dword ptr ds:[<__real@7fffffff>]
vandps xmm0,xmm0,xmm2
vpermilps xmm3,xmm0,F5
vaddss xmm0,xmm0,xmm3
vandps xmm1,xmm1,xmm2
vaddss xmm0,xmm0,xmm1
vsqrtss xmm0,xmm0,xmm0
vmovaps xmm6,xmmword ptr ss:[rsp+20]
add rsp,38
ret
I've tried to experiment with dynamic arrays of float[3] butthe output assembly seemed to be worse.[1](in short, it'scalling internal D functions which use "vxxxss" instructionswhile performing many moves.)
Big thanks
[1] https://run.dlang.io/is/F3Xye3

If you are developing for deployment to a platform that has aGPU, you might consider going SIMT (dcompute) rather than SIMD.SIMT is a lot easier on the eyes. More importantly, if you'retargetting an SoC or console, or have relatively chunkycomputations that allow you to work around the PCIe transitcosts, the path is open to very large performance improvements.I've only been using dcompute for a week or so but so far it'sbeen great.

If your algorithims are very branchy, or you decide to stick withmulti-core/SIMD for any of a number of other good reasons, hereare a few things I learned before decamping to dcompute land thatmay help:

1) LDC is pretty good at auto vectorization as you haveprobably observed. Definitely worth a few iterations to try andget the vectorizer engaged.

2) LDC auto vectorization was good but explicit __vectorprogramming is more predictable and was, at least for my tasks,much faster. I couldn't persuade the auto vectorizer to "do theright thing" throughout the hot path but perhaps you'll havebetter luck.

3) LDC does a good job of going between T[N] <==>__vector(T[N]) so using the static array types as yourinput/output types and the __vector types as your compute typesworks out well whenever you have to interface with an unalignedworld. LDC issues unaligned vector loads/stores for casts or fullarray assigns: v = cast(VT)sa[]; or v[] = sa[]; These are quitegood on modern CPUs. To calibrate note that Ethan recentlytalked about a 10% gain he experienced using full alignment,IIRC, so there's that.

4) LDC also does a good job of discovering SIMD equivalentsgiven static foreach unrolled loops with explicit complie-timeindexing of vector element operands. You can use those alongwith pragma(inline, true) to develop your own "intrinsics" thatsupplement other libs.

5) If you adopt the __vector approach you'll have to handle thepartials manually. (array length % vec length != 0 indicates apartial or tail fragment). If the classic (copying/padding)approaches to such fragmentation don't work for you I'd suggestusing nested static functions that take ref T[N] inputs andoutputs. The main loops become very simple and the tail handlingreduces to loading stack allocated T[N] variables explicitly,calling the static function, and unloading.


Good luck.

Re: Optimizing for SIMD: best practices?(i.e. what features are allowed?)

Reply via email to