On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
How does one optimize code to make full use of the CPU's SIMD capabilities? Is there any way to guarantee that "packed" versions of SIMD instructions will be used?(e.g. vmulps, vsqrtps, etc...) To give some context, this is a sample of one of the functions that could benefit from better SIMD usage :
float euclideanDistanceFixedSizeArray(float[3] a, float[3] b) {
  float distance;
  a[] -= b[];
  a[] *= a[];
  static foreach(size_t i; 0 .. 3/+typeof(a).length+/){
      distance += a[i].abs;//abs required by the caller
  }
  return sqrt(distance);
}
vmovsd xmm0,qword ptr ds:[rdx]
vmovss xmm1,dword ptr ds:[rdx+8]
vmovsd xmm2,qword ptr ds:[rcx+4]
vsubps xmm0,xmm0,xmm2
vsubss xmm1,xmm1,dword ptr ds:[rcx+C]
vmulps xmm0,xmm0,xmm0
vmulss xmm1,xmm1,xmm1
vbroadcastss xmm2,dword ptr ds:[<__real@7fffffff>]
vandps xmm0,xmm0,xmm2
vpermilps xmm3,xmm0,F5
vaddss xmm0,xmm0,xmm3
vandps xmm1,xmm1,xmm2
vaddss xmm0,xmm0,xmm1
vsqrtss xmm0,xmm0,xmm0
vmovaps xmm6,xmmword ptr ss:[rsp+20]
add rsp,38
ret

I've tried to experiment with dynamic arrays of float[3] but the output assembly seemed to be worse.[1](in short, it's calling internal D functions which use "vxxxss" instructions while performing many moves.)

Big thanks
[1] https://run.dlang.io/is/F3Xye3

If you are developing for deployment to a platform that has a GPU, you might consider going SIMT (dcompute) rather than SIMD. SIMT is a lot easier on the eyes. More importantly, if you're targetting an SoC or console, or have relatively chunky computations that allow you to work around the PCIe transit costs, the path is open to very large performance improvements. I've only been using dcompute for a week or so but so far it's been great.

If your algorithims are very branchy, or you decide to stick with multi-core/SIMD for any of a number of other good reasons, here are a few things I learned before decamping to dcompute land that may help:

1) LDC is pretty good at auto vectorization as you have probably observed. Definitely worth a few iterations to try and get the vectorizer engaged.

2) LDC auto vectorization was good but explicit __vector programming is more predictable and was, at least for my tasks, much faster. I couldn't persuade the auto vectorizer to "do the right thing" throughout the hot path but perhaps you'll have better luck.

3) LDC does a good job of going between T[N] <==> __vector(T[N]) so using the static array types as your input/output types and the __vector types as your compute types works out well whenever you have to interface with an unaligned world. LDC issues unaligned vector loads/stores for casts or full array assigns: v = cast(VT)sa[]; or v[] = sa[]; These are quite good on modern CPUs. To calibrate note that Ethan recently talked about a 10% gain he experienced using full alignment, IIRC, so there's that.

4) LDC also does a good job of discovering SIMD equivalents given static foreach unrolled loops with explicit complie-time indexing of vector element operands. You can use those along with pragma(inline, true) to develop your own "intrinsics" that supplement other libs.

5) If you adopt the __vector approach you'll have to handle the partials manually. (array length % vec length != 0 indicates a partial or tail fragment). If the classic (copying/padding) approaches to such fragmentation don't work for you I'd suggest using nested static functions that take ref T[N] inputs and outputs. The main loops become very simple and the tail handling reduces to loading stack allocated T[N] variables explicitly, calling the static function, and unloading.

Good luck.


Reply via email to