On Saturday, 29 June 2013 at 17:57:20 UTC, Jonathan Dunlap wrote:
I've updated the project with your suggestions at http://dpaste.dzfl.pl/fce2d93b but still get the same performance. Vectors defined in the benchmark function body, no function calling overhead, etc. See some of my comments below btw:

First of all, calcSIMD and calcScalar are virtual functions so they can't be inlined, which prevents any further optimization.

For the dlang docs: Member functions which are private or package are never virtual, and hence cannot be overridden.

So my guess is that the first four multiplications and the second four multiplications in calcScalar are done in parallel. ... The reason it's faster is that gdc replaces multiplication by 2 with addition and omits multiplication by 1.

I've changed the multiplies of 2 and 1 to 2.1 and 1.01 respectively. Still no performance difference between the two for me.

The multiples 2 and 1 were the reason why the scalar code performs a little bit better than SIMD code when compiled with GDC. The main reason why scalar code isn't much slower than SIMD code is instruction level parallelism. Because the first four operation in calcScalar are independent (none of them depends on the result of any of the other three) modern x86-64 processors can execute them in parallel. Because of that, the speed of your program is limited by instruction latency and not throughput. That's why it doesn't really make a difference that the scalar version does four times as many operations.

You can also make advantage of instruction level parallelism when using SIMD. For example, I get about the same number of iterations per second for the following two functions (when using GDC):

        import gcc.attribute;

        @attribute("forceinline") void calcSIMD1() {

                s0 = s0 * i0;

                s0 = s0 * d0;

                s1 = s1 * i1;

                s1 = s1 * d1;

                s2 = s2 * i2;

                s2 = s2 * d2;

                s3 = s3 * i3;

                s3 = s3 * d3;

        }

        @attribute("forceinline") void calcSIMD2() {

                s0 = s0 * i0;

                s0 = s0 * d0;
        }

By the way, if performance is very important to you, you should try GDC (or LDC, but I don't think LDC is currently fully usable on Windows).

Reply via email to