Not to resurrect the dead, I just wanted to share an article I came across concerning SIMD with Manu..

http://www.gamasutra.com/view/feature/4248/designing_fast_crossplatform_simd_.php

QUOTE:

1. Returning results by value

By observing the intrisics interface a vector library must imitate that interface to maximize performance. Therefore, you must return the results by value and not by reference, as such:

    //correct
    inline Vec4 VAdd(Vec4 va, Vec4 vb)
    {
        return(_mm_add_ps(va, vb));
    };

On the other hand if the data is returned by reference the interface will generate code bloat. The incorrect version below:

    //incorrect (code bloat!)
    inline void VAddSlow(Vec4& vr, Vec4 va, Vec4 vb)
    {
        vr = _mm_add_ps(va, vb);
    };

The reason you must return data by value is because the quad-word (128-bit) fits nicely inside one SIMD register. And one of the key factors of a vector library is to keep the data inside these registers as much as possible. By doing that, you avoid unnecessary loads and stores operations from SIMD registers to memory or FPU registers. When combining multiple vector operations the "returned by value" interface allows the compiler to optimize these loads and stores easily by minimizing SIMD to FPU or memory transfers.

2. Data Declared "Purely"

Here, "pure data" is defined as data declared outside a "class" or "struct" by a simple "typedef" or "define". When I was researching various vector libraries before coding VMath, I observed one common pattern among all libraries I looked at during that time. In all cases, developers wrapped the basic quad-word type inside a "class" or "struct" instead of declaring it purely, as follows:

    class Vec4
    {   
        ...
    private:
        __m128 xyzw;
    };

This type of data encapsulation is a common practice among C++ developers to make the architecture of the software robust. The data is protected and can be accessed only by the class interface functions. Nonetheless, this design causes code bloat by many different compilers in different platforms, especially if some sort of GCC port is being used.

An approach that is much friendlier to the compiler is to declare the vector data "purely", as follows:

typedef __m128 Vec4;

ENDQUOTE;




The article is 2 years old, but It appears my earlier performance issue wasn't D related at all, but an issue with C as well. I think in this situation, it might be best (most optimized) to handle simd "the C way" by creating and alias or union of a simd intrinsic. D has a big advantage over C/C++ here because of UFCS, in that we can write external functions that appear no different to encapsulated object methods. That combined with public-aliasing means the end-user only sees our pretty functions, but we're not sacrificing performance at all.

Reply via email to