Manu wrote:
I'm not sure why the performance would suffer when placing it
in a struct.
I suspect it's because the struct causes the vectors to become
unaligned,
and that impacts performance a LOT. Walter has recently made
some changes
to expand the capability of align() to do most of the stuff you
expect
should be possible, including aligning structs, and propogating
alignment
from a struct member to its containing struct. This change
might actually
solve your problems...
I've tried all combinations with align() before and inside the
struct, with no luck. I'm using DMD 2.060, so unless there's a
new syntax I'm unaware of, I don't think it's been adjusted to
fix any alignment issues with SIMD stuff. It would be great to be
able to wrap float4 into a struct, but for now I've come up with
an easy and understandable alternative using SIMD types directly.
Another suggestion I might make, is to write DMD intrinsics
that mirror the
GDC code in std.simd and use that, then I'll sort out any
performance
problems as soon as I have all the tools I need to finish the
module :)
Sounds like a good idea. I'll try and keep my code inline with
yours to make transitioning to it easier when it's complete.
And this is precisely what I suggest you don't do. x64-SSE is
the only
architecture that can reasonably tolerate this (although it's
still not the
most efficient way). So if portability is important, you need
to find
another way.
A 'proper' way to do this is something like:
float4 wideScalar = loadScalar(scalar); // this function
loads a float
into all 4 components. Note: this is a little slow, factor these
float->vector loads outside the hot loops as is practical.
float4 vecX = getX(vec); // we can make shorthand for this,
like
'vec.xxxx' for instance...
vecX += wideScalar; // all 4 components maintain the same
scalar value,
this is so you can apply them back to non-scalar vectors later:
With this, there are 2 typical uses, one is to scale another
vector by your
scalar, for instance:
someOtherVector *= vecX; // perform a scale of a full 4d
vector by our
'wide' scalar
The other, less common operation, is that you may want to
directly set the
scalar to a component of another vector, setting Y to lock
something to a
height map for instance:
someOtherVector = setY(someOtherVector, wideScalar); // note:
it is still
important that you have a 'wide' scalar in this case for
portability, since
different architectures have very different interleave
operations.
Something like '.x' can never appear in efficient code.
Sadly, most modern SIMD hardware is simply not able to
efficiently express
what you as a programmer intuitively want as convenient
operations.
Most SIMD hardware has absolutely no connection between the FPU
and the
SIMD unit, resulting in loads and stores to memory, and this in
turn
introduces another set of performance hazards.
x64 is actually the only architecture that does allow
interaction between
the FPU and SIMD however, although it's still no less efficient
to do it
how I describe, and as a bonus, your code will be portable.
Okay, that makes a lot of sense and is inline with what I was
reading last night about FPU/SSE assembly code. However I'm also
a bit confused. At some point, like in your hightmap example, I'm
going to need to do arithmetic work on single vector components.
Is there some sort of SSE arithmetic/shuffle instruction which
uses "masking" that I should use to isolate and manipulate
components?
If not, and manipulating components is just bad for performance
reasons, then I've figured out a solution to my original concern.
By using this code:
@property @trusted pure nothrow
{
auto ref x(T:float4)(auto ref T v) { return v.ptr[0]; }
auto ref y(T:float4)(auto ref T v) { return v.ptr[1]; }
auto ref z(T:float4)(auto ref T v) { return v.ptr[2]; }
auto ref w(T:float4)(auto ref T v) { return v.ptr[3]; }
void x(T:float4)(ref float4 v, float val) { v.ptr[0] = val; }
void y(T:float4)(ref float4 v, float val) { v.ptr[1] = val; }
void z(T:float4)(ref float4 v, float val) { v.ptr[2] = val; }
void w(T:float4)(ref float4 v, float val) { v.ptr[3] = val; }
}
I am able to perform arithmetic on single components:
auto vec = Vectors.float4(x, y, 0, 1); // factory
vec.x += scalar; // += components
again, I'll abandon this approach if there's a better way to
manipulate single components, like you mentioned above. I'm just
not aware of how to do that using SSE instructions alone. I'll do
more research, but would appreciate any insight you can give.
Inline asm is usually less efficient for large blocks of code,
it requires
that you hand-tune the opcode sequencing, which is very hard to
do,
particularly for SSE.
Small inline asm blocks are also usually less efficient, since
most
compilers can't rearrange other code within the function around
the asm
block, this leads to poor opcode sequencing.
I recommend avoiding inline asm where performance is desired
unless you're
confident in writing the ENTIRE function/loop in asm, and hand
tuning the
opcode sequencing. But that's not portable...
Yes, after a bit of messing around with and researching ASM
yesterday, I came to the conclusion that they're not a good fit
for this. DMD can't inline functions with ASM blocks right now
anyways (although LDC can), which would kill any performance
gains SSE brings I'd imagine.
Plus, ASM is a pain in the ass. :-)
Android? But you're benchmarking x64-SSE right? I don't think
it's
reasonable to expect that performance characteristics for one
architectures
SIMD hardware will be any indicator at all of how another
architecture may
perform.
I only meant that, since Mono C# is what were using for our game
code on any platform besides Windows/WP7/Xbox, and since Android
has been really the only performance PITA for our Mono C# code,
that upgrading our Vector libraries to use Mono.Simd should yield
significant improvements there.
I'm just learning about SSE and proper vector utilization. In out
last game we actually used Vector3's everywhere :-V , which even
we should have know not too, because you have to convert them to
float4's anyways to pass them into shader constants... I'm
guessing this was our main performance issue with SmartPhones..
ahh, oh well.
Also, if you're doing any of the stuff I've been warning
against above,
NEON will suffer very hard, whereas x64-SSE will mostly shrug
it off.
I'm very interested to hear your measurements when you try it
out!
I'll let you know if changing over to proper Vector code makes
huge changes.
wtf indeed! O_o
Can you paste the disassembly?
I'm not sure how to do that with DMD. I remember GDC has a
output-to-asm flag, but not DMD. Or is there an external tool you
use to look at .o/.obj files?
I can tell you this though, as soon as DMDs SIMD support is
able to do the
missing stuff I need to complete std.simd, I shall do that,
along with
intensive benchmarks where I'll be scrutinising the code-gen
very closely.
I expect performance peculiarities like you are seeing will be
found and
fixed at that time...
For now I've come to terms with using core.simd.float4 types
directly have create acceptable solutions to my original
problems. But I'm glad to here that in the future I'll have more
flexibility within my libraries.