Re: core.simd woes

F i L Tue, 07 Aug 2012 18:50:29 -0700

Manu wrote:

I'm not sure why the performance would suffer when placing itin a struct.I suspect it's because the struct causes the vectors to becomeunaligned,and that impacts performance a LOT. Walter has recently madesome changesto expand the capability of align() to do most of the stuff youexpectshould be possible, including aligning structs, and propogatingalignmentfrom a struct member to its containing struct. This changemight actually
solve your problems...

I've tried all combinations with align() before and inside thestruct, with no luck. I'm using DMD 2.060, so unless there's anew syntax I'm unaware of, I don't think it's been adjusted tofix any alignment issues with SIMD stuff. It would be great to beable to wrap float4 into a struct, but for now I've come up withan easy and understandable alternative using SIMD types directly.

Another suggestion I might make, is to write DMD intrinsicsthat mirror theGDC code in std.simd and use that, then I'll sort out anyperformanceproblems as soon as I have all the tools I need to finish themodule :)

Sounds like a good idea. I'll try and keep my code inline withyours to make transitioning to it easier when it's complete.

And this is precisely what I suggest you don't do. x64-SSE isthe onlyarchitecture that can reasonably tolerate this (although it'sstill not themost efficient way). So if portability is important, you needto find
another way.

A 'proper' way to do this is something like:
float4 wideScalar = loadScalar(scalar); // this functionloads a float
into all 4 components. Note: this is a little slow, factor these
float->vector loads outside the hot loops as is practical.
float4 vecX = getX(vec); // we can make shorthand for this,like
'vec.xxxx' for instance...
vecX += wideScalar; // all 4 components maintain the samescalar value,
this is so you can apply them back to non-scalar vectors later:
With this, there are 2 typical uses, one is to scale anothervector by your
scalar, for instance:
someOtherVector *= vecX; // perform a scale of a full 4dvector by our
'wide' scalar
The other, less common operation, is that you may want todirectly set thescalar to a component of another vector, setting Y to locksomething to a
height map for instance:
someOtherVector = setY(someOtherVector, wideScalar); // note:it is stillimportant that you have a 'wide' scalar in this case forportability, sincedifferent architectures have very different interleaveoperations.
Something like '.x' can never appear in efficient code.
Sadly, most modern SIMD hardware is simply not able toefficiently expresswhat you as a programmer intuitively want as convenientoperations.Most SIMD hardware has absolutely no connection between the FPUand theSIMD unit, resulting in loads and stores to memory, and this inturn
introduces another set of performance hazards.
x64 is actually the only architecture that does allowinteraction betweenthe FPU and SIMD however, although it's still no less efficientto do it
how I describe, and as a bonus, your code will be portable.

Okay, that makes a lot of sense and is inline with what I wasreading last night about FPU/SSE assembly code. However I'm alsoa bit confused. At some point, like in your hightmap example, I'mgoing to need to do arithmetic work on single vector components.Is there some sort of SSE arithmetic/shuffle instruction whichuses "masking" that I should use to isolate and manipulatecomponents?

If not, and manipulating components is just bad for performancereasons, then I've figured out a solution to my original concern.By using this code:


@property @trusted pure nothrow
{
  auto ref x(T:float4)(auto ref T v) { return v.ptr[0]; }
  auto ref y(T:float4)(auto ref T v) { return v.ptr[1]; }
  auto ref z(T:float4)(auto ref T v) { return v.ptr[2]; }
  auto ref w(T:float4)(auto ref T v) { return v.ptr[3]; }

  void x(T:float4)(ref float4 v, float val) { v.ptr[0] = val; }
  void y(T:float4)(ref float4 v, float val) { v.ptr[1] = val; }
  void z(T:float4)(ref float4 v, float val) { v.ptr[2] = val; }
  void w(T:float4)(ref float4 v, float val) { v.ptr[3] = val; }
}

I am able to perform arithmetic on single components:

    auto vec = Vectors.float4(x, y, 0, 1); // factory
    vec.x += scalar; // += components

again, I'll abandon this approach if there's a better way tomanipulate single components, like you mentioned above. I'm justnot aware of how to do that using SSE instructions alone. I'll domore research, but would appreciate any insight you can give.

Inline asm is usually less efficient for large blocks of code,it requiresthat you hand-tune the opcode sequencing, which is very hard todo,
particularly for SSE.
Small inline asm blocks are also usually less efficient, sincemostcompilers can't rearrange other code within the function aroundthe asm
block, this leads to poor opcode sequencing.
I recommend avoiding inline asm where performance is desiredunless you'reconfident in writing the ENTIRE function/loop in asm, and handtuning the
opcode sequencing. But that's not portable...

Yes, after a bit of messing around with and researching ASMyesterday, I came to the conclusion that they're not a good fitfor this. DMD can't inline functions with ASM blocks right nowanyways (although LDC can), which would kill any performancegains SSE brings I'd imagine.


Plus, ASM is a pain in the ass. :-)

Android? But you're benchmarking x64-SSE right? I don't thinkit'sreasonable to expect that performance characteristics for onearchitecturesSIMD hardware will be any indicator at all of how anotherarchitecture may
perform.

I only meant that, since Mono C# is what were using for our gamecode on any platform besides Windows/WP7/Xbox, and since Androidhas been really the only performance PITA for our Mono C# code,that upgrading our Vector libraries to use Mono.Simd should yieldsignificant improvements there.

I'm just learning about SSE and proper vector utilization. In outlast game we actually used Vector3's everywhere :-V , which evenwe should have know not too, because you have to convert them tofloat4's anyways to pass them into shader constants... I'mguessing this was our main performance issue with SmartPhones..ahh, oh well.

Also, if you're doing any of the stuff I've been warningagainst above,NEON will suffer very hard, whereas x64-SSE will mostly shrugit off.
I'm very interested to hear your measurements when you try itout!

I'll let you know if changing over to proper Vector code makeshuge changes.

wtf indeed! O_o

Can you paste the disassembly?

I'm not sure how to do that with DMD. I remember GDC has aoutput-to-asm flag, but not DMD. Or is there an external tool youuse to look at .o/.obj files?

I can tell you this though, as soon as DMDs SIMD support isable to do themissing stuff I need to complete std.simd, I shall do that,along withintensive benchmarks where I'll be scrutinising the code-genvery closely.I expect performance peculiarities like you are seeing will befound and
fixed at that time...

For now I've come to terms with using core.simd.float4 typesdirectly have create acceptable solutions to my originalproblems. But I'm glad to here that in the future I'll have moreflexibility within my libraries.

Re: core.simd woes

Reply via email to