On 22.09.2011 20:19, Marco Leise wrote:
Am 22.09.2011, 19:26 Uhr, schrieb Peter Alexander
<peter.alexander...@gmail.com>:

On 22/09/11 7:39 AM, Don wrote:
On 22.09.2011 05:24, a wrote:
How would one do something like this without intrinsics (the code is
c++ using
gcc vector extensions):

[snip]
At present, you can't do it without ultimately resorting to inline asm.
But, what we've done is to move SIMD into the machine model: the D
machine model assumes that float[4] + float[4] is a more efficient
operation than a loop.
Currently, only arithmetic operations are implemented, and on DMD at
least, they're still not proper intrinsics. So in the long term it'll be
possible to do it directly, but not yet.

At various times, several of us have implemented 'swizzle' using CTFE,
giving you a syntax like:

float[4] x, y;
x[] = y[].swizzle!"cdcd"();
// x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3]

which compiles to a single shufps instruction.

How can it compile into a single shufps? x and y would need to already
be in vector registers, and unless I've missed something, they won't
be. You'll need instructions for loading into registers (using the
slow movups because 16-byte alignment isn't guaranteed) then do the
shufps, then load back out again.

This is too slow for performance critical code.

Being stored in XMM registers from creation, passed and returned in
XMM registers to/from functions is a key requirement for this sort of
code. If you have to keep loading in and out of memory then you lose
all performance.

I thought about this. Either write long functions, so you don't have to
load and unload often or just make the functions assume that the
parameters are in registers without explicit declaration.

Yeah, at the moment you have to work at a higher level, you can't just do a single instruction on its own.

Reply via email to