On 11/2/2012 3:50 AM, Jens Mueller wrote:
> Okay. For me they look the same. Can you elaborate, please? Assume I
> want to add two float vectors which is common in both games and
> scientific computing. The only difference is in games their length is
> usually 3 or 4 whereas in scientific computing they are of arbitrary
> length. Why do I need instrinsics to support the game setting?

Another excellent question.

Most languages have taken the "auto-vectorization" approach of reverse engineering loops to turn them into high level constructs, and then compiling the code into special SIMD instructions.

How to do this is explained in detail in the (rare) book "The Software Vectorization Handbook" by Bik, which I fortunately was able to obtain a copy of.

This struck me as a terrible approach, however. It just seemed stupid to try to teach the compiler to reverse engineer low level code into high level code. A better design would be to start with high level code. Hence, the appearance of D vector operations.

The trouble with D vector operations, however, is that they are too general purpose. The SIMD instructions are very quirky, and it's easy to unwittingly and silently cause the compiler to generate absolutely terribly slow code. The reasons for that are the alignment requirements, coupled with the SIMD instructions not being orthogonal - some operations work for some types and not for others, in a way that is unintuitive unless you're carefully reading the SIMD specs.

Just saying align(16) isn't good enough, as the vector ops work on slices and those slices aren't always aligned. So each one has to check alignment at runtime, which is murder on performance.

If a particular vector op for a particular type has no SIMD support, then the compiler has to generate workaround code. This can also have terrible performance consequences.

So the user writes vector code, benchmarks it, finds zero improvement, and the reasons why will be elusive to anyone but an expert SIMD programmer.

(Auto-vectorizing technology has similar issues, pretty much meaning you won't get fast code out of it unless you've got a habit of examining the assembler output and tweaking as necessary.)

Enter Manu, who has a lot of experience making SIMD work for games. His proposal was:

1. Have native SIMD types. This will guarantee alignment, and will guarantee a compile time error for SIMD types that are not supported by the CPU.

2. Have the compiler issue an error for SIMD operations that are not supported by the CPU, rather than silently generating inefficient workaround code.

3. There are all kinds of weird but highly useful SIMD instructions that don't have a straightforward representation in high level code, such as saturated arithmetic. Manu's answer was to expose these instructions via intrinsics, so the user can string them together, be sure that they will generate real SIMD instructions, while the compiler can deal with register allocation.

This approach works, is inlineable, generates code as good as hand-built assembler, and is useable by regular programmers.

I won't say there aren't better approaches, but this one we know works.

Reply via email to