On Saturday, 29 June 2013 at 17:57:20 UTC, Jonathan Dunlap wrote:
I've updated the project with your suggestions at
http://dpaste.dzfl.pl/fce2d93b but still get the same
performance. Vectors defined in the benchmark function body, no
function calling overhead, etc. See some of my comments below
btw:
First of all, calcSIMD and calcScalar are virtual functions so
they can't be inlined, which prevents any further optimization.
For the dlang docs: Member functions which are private or
package are never virtual, and hence cannot be overridden.
So my guess is that the first four multiplications and the
second four multiplications in calcScalar are done in
parallel. ... The reason it's faster is that gdc replaces
multiplication by 2 with addition and omits multiplication by
1.
I've changed the multiplies of 2 and 1 to 2.1 and 1.01
respectively. Still no performance difference between the two
for me.
The multiples 2 and 1 were the reason why the scalar code
performs a little bit better than SIMD code when compiled with
GDC. The main reason why scalar code isn't much slower than SIMD
code is instruction level parallelism. Because the first four
operation in calcScalar are independent (none of them depends on
the result of any of the other three) modern x86-64 processors
can execute them in parallel. Because of that, the speed of your
program is limited by instruction latency and not throughput.
That's why it doesn't really make a difference that the scalar
version does four times as many operations.
You can also make advantage of instruction level parallelism when
using SIMD. For example, I get about the same number of
iterations per second for the following two functions (when using
GDC):
import gcc.attribute;
@attribute("forceinline") void calcSIMD1() {
s0 = s0 * i0;
s0 = s0 * d0;
s1 = s1 * i1;
s1 = s1 * d1;
s2 = s2 * i2;
s2 = s2 * d2;
s3 = s3 * i3;
s3 = s3 * d3;
}
@attribute("forceinline") void calcSIMD2() {
s0 = s0 * i0;
s0 = s0 * d0;
}
By the way, if performance is very important to you, you should
try GDC (or LDC, but I don't think LDC is currently fully usable
on Windows).