Manu wrote:
I'm not sure why the performance would suffer when placing it in a struct. I suspect it's because the struct causes the vectors to become unaligned, and that impacts performance a LOT. Walter has recently made some changes to expand the capability of align() to do most of the stuff you expect should be possible, including aligning structs, and propogating alignment from a struct member to its containing struct. This change might actually
solve your problems...

I've tried all combinations with align() before and inside the struct, with no luck. I'm using DMD 2.060, so unless there's a new syntax I'm unaware of, I don't think it's been adjusted to fix any alignment issues with SIMD stuff. It would be great to be able to wrap float4 into a struct, but for now I've come up with an easy and understandable alternative using SIMD types directly.


Another suggestion I might make, is to write DMD intrinsics that mirror the GDC code in std.simd and use that, then I'll sort out any performance problems as soon as I have all the tools I need to finish the module :)

Sounds like a good idea. I'll try and keep my code inline with yours to make transitioning to it easier when it's complete.


And this is precisely what I suggest you don't do. x64-SSE is the only architecture that can reasonably tolerate this (although it's still not the most efficient way). So if portability is important, you need to find
another way.

A 'proper' way to do this is something like:
float4 wideScalar = loadScalar(scalar); // this function loads a float
into all 4 components. Note: this is a little slow, factor these
float->vector loads outside the hot loops as is practical.

float4 vecX = getX(vec); // we can make shorthand for this, like
'vec.xxxx' for instance...
vecX += wideScalar; // all 4 components maintain the same scalar value,
this is so you can apply them back to non-scalar vectors later:

With this, there are 2 typical uses, one is to scale another vector by your
scalar, for instance:
someOtherVector *= vecX; // perform a scale of a full 4d vector by our
'wide' scalar

The other, less common operation, is that you may want to directly set the scalar to a component of another vector, setting Y to lock something to a
height map for instance:
someOtherVector = setY(someOtherVector, wideScalar); // note: it is still important that you have a 'wide' scalar in this case for portability, since different architectures have very different interleave operations.

Something like '.x' can never appear in efficient code.
Sadly, most modern SIMD hardware is simply not able to efficiently express what you as a programmer intuitively want as convenient operations. Most SIMD hardware has absolutely no connection between the FPU and the SIMD unit, resulting in loads and stores to memory, and this in turn
introduces another set of performance hazards.
x64 is actually the only architecture that does allow interaction between the FPU and SIMD however, although it's still no less efficient to do it
how I describe, and as a bonus, your code will be portable.

Okay, that makes a lot of sense and is inline with what I was reading last night about FPU/SSE assembly code. However I'm also a bit confused. At some point, like in your hightmap example, I'm going to need to do arithmetic work on single vector components. Is there some sort of SSE arithmetic/shuffle instruction which uses "masking" that I should use to isolate and manipulate components?

If not, and manipulating components is just bad for performance reasons, then I've figured out a solution to my original concern. By using this code:

@property @trusted pure nothrow
{
  auto ref x(T:float4)(auto ref T v) { return v.ptr[0]; }
  auto ref y(T:float4)(auto ref T v) { return v.ptr[1]; }
  auto ref z(T:float4)(auto ref T v) { return v.ptr[2]; }
  auto ref w(T:float4)(auto ref T v) { return v.ptr[3]; }

  void x(T:float4)(ref float4 v, float val) { v.ptr[0] = val; }
  void y(T:float4)(ref float4 v, float val) { v.ptr[1] = val; }
  void z(T:float4)(ref float4 v, float val) { v.ptr[2] = val; }
  void w(T:float4)(ref float4 v, float val) { v.ptr[3] = val; }
}

I am able to perform arithmetic on single components:

    auto vec = Vectors.float4(x, y, 0, 1); // factory
    vec.x += scalar; // += components

again, I'll abandon this approach if there's a better way to manipulate single components, like you mentioned above. I'm just not aware of how to do that using SSE instructions alone. I'll do more research, but would appreciate any insight you can give.


Inline asm is usually less efficient for large blocks of code, it requires that you hand-tune the opcode sequencing, which is very hard to do,
particularly for SSE.
Small inline asm blocks are also usually less efficient, since most compilers can't rearrange other code within the function around the asm
block, this leads to poor opcode sequencing.
I recommend avoiding inline asm where performance is desired unless you're confident in writing the ENTIRE function/loop in asm, and hand tuning the
opcode sequencing. But that's not portable...

Yes, after a bit of messing around with and researching ASM yesterday, I came to the conclusion that they're not a good fit for this. DMD can't inline functions with ASM blocks right now anyways (although LDC can), which would kill any performance gains SSE brings I'd imagine.

Plus, ASM is a pain in the ass. :-)


Android? But you're benchmarking x64-SSE right? I don't think it's reasonable to expect that performance characteristics for one architectures SIMD hardware will be any indicator at all of how another architecture may
perform.

I only meant that, since Mono C# is what were using for our game code on any platform besides Windows/WP7/Xbox, and since Android has been really the only performance PITA for our Mono C# code, that upgrading our Vector libraries to use Mono.Simd should yield significant improvements there.

I'm just learning about SSE and proper vector utilization. In out last game we actually used Vector3's everywhere :-V , which even we should have know not too, because you have to convert them to float4's anyways to pass them into shader constants... I'm guessing this was our main performance issue with SmartPhones.. ahh, oh well.


Also, if you're doing any of the stuff I've been warning against above, NEON will suffer very hard, whereas x64-SSE will mostly shrug it off.

I'm very interested to hear your measurements when you try it out!

I'll let you know if changing over to proper Vector code makes huge changes.


wtf indeed! O_o

Can you paste the disassembly?

I'm not sure how to do that with DMD. I remember GDC has a output-to-asm flag, but not DMD. Or is there an external tool you use to look at .o/.obj files?


I can tell you this though, as soon as DMDs SIMD support is able to do the missing stuff I need to complete std.simd, I shall do that, along with intensive benchmarks where I'll be scrutinising the code-gen very closely. I expect performance peculiarities like you are seeing will be found and
fixed at that time...

For now I've come to terms with using core.simd.float4 types directly have create acceptable solutions to my original problems. But I'm glad to here that in the future I'll have more flexibility within my libraries.

Reply via email to