Re: __restrict, architecture intrinsics vs asm, consoles, and other

Manu Evans Fri, 23 Sep 2011 15:50:56 -0700

== Quote from Andrei Alexandrescu (seewebsiteforem...@erdani.org)'s
article
> On 9/22/11 1:39 AM, Don wrote:
> > On 22.09.2011 05:24, a wrote:
> >> How would one do something like this without intrinsics (the
code is
> >> c++ using
> >> gcc vector extensions):
> >
> > [snip]
> > At present, you can't do it without ultimately resorting to
inline asm.
> > But, what we've done is to move SIMD into the machine model: the
D
> > machine model assumes that float[4] + float[4] is a more
efficient
> > operation than a loop.
> > Currently, only arithmetic operations are implemented, and on
DMD at
> > least, they're still not proper intrinsics. So in the long term
it'll be
> > possible to do it directly, but not yet.
> >
> > At various times, several of us have implemented 'swizzle' using
CTFE,
> > giving you a syntax like:
> >
> > float[4] x, y;
> > x[] = y[].swizzle!"cdcd"();
> > // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3]
> >
> > which compiles to a single shufps instruction.
> >
> > That "cdcd" string is really a tiny DSL: the language consists
of four
> > characters, each of which is a, b, c, or d.
> I think we should put swizzle in std.numeric once and for all. Is
anyone
> interested in taking up that task?
> > A couple of years ago I made a DSL compiler for BLAS1
operations. It was
> > capable of doing some pretty wild stuff, even then. (The DSL
looked like
> > normal D code).
> > But the compiler has improved enormously since that time. It's
now
> > perfectly feasible to make a DSL for the SIMD operations you
need.
> >
> > The really nice thing about this, compared to normal asm, is
that you
> > have access to the compiler's symbol table. This lets you add
> > compile-time error messages, for example.
> >
> > A funny thing about this, which I found after working on the DMD
> > back-end, is that is MUCH easier to write an optimizer/code
generator in
> > a DSL in D, than in a compiler back-end.
> A good argument for (a) moving stuff from the compiler into the
library,
> (b) continuing Don's great work on making CTFE a solid
proposition.
> Andrei


This sounds really dangerous to me.
I really like the idea where CTFE can be used to produce some pretty
powerful almost-like-intrinsics code, but applying it in this
context sounds like a really bad idea.

Firstly, so I'm not misunderstanding, is this suggestion building on
Don's previous post saying that float[4] is somehow intercepted and
special-cased by the compiler, reinterpreting as a candidate for
hardware vector operations?

I think that's a wrong decision in its self, and a poor foundation
for this approach.
Let me try and convince you that the language should have an
explicit hardware vector type, and not attempt to make use of any
clever language tricks...

If float[4] is considered a hardware vector by the compiler,
  - How to I define an ACTUAL float[4]?
  - How can I be confident that it actually WILL be a hardware
vector?

Hardware vectors are NOT float[4]'s, they are a reference to an
128bit hardware register upon which various vector operations may be
supported, they are probably aligned, and they are only accessible
in 128bit quantities. I think they should be explicitly defined as
such.

They may be float4, u/int4, u/short8, u/byte16, double2... All these
types are interchangeable within the one register, do you intend to
special case fixed length arrays of all those types to support the
hardware functionality for those?

Hardware vectors are NOT floats, they can not interact with the
floating point unit, dereferencing of this style 'float x =
myVector[0]' is NOT supported by the hardware and it should not be
exposed to the programmer as a trivial possibility. This seemingly
harmless line of code will undermine the entire reason for using
hardware vector hardware in the first place.

Allowing easy access of individual floats within a hardware vector breaks the 
languages stated premise that the path of least
resistance also be the 'correct' optimal choice, whereby a seemingly
simple line of code may ruin the entire function.

float[4] is not even a particularly conveniently sized vector for
most inexperienced programmers, the majority will want float[3].
This is NOT a trivial map to float[4], and programmers should be
well aware that there is inherent complexity in using the hardware
vector architecture, and forced to think it through.

Most inexperienced programmers think of results of operations like
dot product and magnitude as being scalar values, but they are not,
they are a scalar value repeated across all 4 components of a vector
4, and this should be explicit too.

....

I know I'm nobody around here, so I can't expect to be taken too
seriously, but I'm very excited about the language, so here's what I
would consider instead:

Add a hardware vector type, lets call it 'v128' for the exercise. It
is a primitive type, aligned by definition, and does not have any
members.
You may use this to refer to a hardware vector registers explicitly,
as vector register function arguments, or as arguments to inline asm
blocks.

Add some functions to the standard library (ideally implemented as
compiler intrinsics) which do very specific stuff to vectors, and
ideally expandable by hardware vendors or platform holders.

You might want to have some classes in the standard library which
wrap said v128, and expose the concept as a float4, int4, byte16,
etc. These classes would provide maths operators, comparisons,
initialisation and immediate assignment, and casts between various
vector types.

Different vector units support completely different methods of
permutation. I would be very careful about adding intrinsic support
into the library for generalised permutation. And if so, at least
leave the capability of implementing intrinsic architecture-specific
permutation but the hardware vendors/platform holders.

At the end of the day, it is imperative that the code generator and
optimiser still retain the concept of hardware vectors, and can
perform appropriate load/store elimination, apply hardware specific
optimisation to operations like permutes/swizzles, component
broadcasts, load immediates, etc.

....

The reason I am so adamant about this, is in almost all
architectures (SSE is the most tolerant by far), using the hardware
vector unit is an all-or-nothing choice. If you interact between the
vector and float registers, you will almost certainly result in
slower code than if you just used the float unit outright. Also,
since people usually use hardware vectors in areas of extreme
performance optimisation, it's not tolerable for the compiler to be
making mistakes. As a minimum the programmer needs to be able to
explicitly address the vector registers, pass it to and from
functions, and perform explicit (probably IHV supplied) functions on
them. The code generator and optimiser needs all the information
possible, and as explicit as possible so IHV's can implement the
best possible support for their architecture. The API should reflect
this, and not allow easy access to functionality that would violate
hardware support.

Ease of programming should be a SECONDARY goal, at which point
something like the typed wrapper classes I described would come in,
allowing maths operators, comparisons and all I mentioned above, ie,
making them look like a real mathematical type, but still keeping
their distance from primitive float/int types, to discourage
interaction at all costs.

I hope this doesn't sound too much like an overly long rant! :)
And hopefully I've managed to sell my point...

Don: I'd love to hear counter arguments to justify float[4] as a
reasonable solution. Currently no matter how I slice it, I just
can't see it.

Criticism welcome?

Cheers!
- Manu

Re: __restrict, architecture intrinsics vs asm, consoles, and other

Reply via email to