On Thursday, 19 May 2016 at 12:04:31 UTC, Andrei Alexandrescu
wrote:
On 5/19/16 4:12 AM, Jens Müller wrote:
What test data did you use?
An instance for benchmarking is generated as follows. Given nnz
which is the sum of non-zero indices in input vector a and b.
auto lengthA = uniform!"[]"(0, nnz, gen);
auto lengthB = nnz - lengthA;
auto a = iota(0, nnz).randomSample(lengthA, gen).map!(i =>
Pair(i, 10)).array();
auto b = iota(0, nnz).randomSample(lengthB, gen).map!(i =>
Pair(i, 10)).array();
So I take a random sample of (0, ..., nnz) for each input.
Any better idea? I've seen that people generate vectors with
larger gaps.
10%-20% win on dot product is significant because for many
algorithms dot product is a kernel and dominates everything
else. For those any win goes straight to the bottom line.
Sure. Still I wasn't sure whether I got the idea from your talk.
So maybe there is/was more.
The base line (dot1 in the graphs) is the straightforward
version
---
size_t i,j = 0;
double sum = 0;
while (i < a.length && j < b.length)
{
if (a[i].index < b[j].index) i++;
else if (a[i].index > b[j].index) j++;
else
{
assert(a[i].index == b[j].index);
sum += a[i].value * b[j].value;
i++;
j++;
}
}
return sum;
---
That does redundant checking. There's a better baseline:
---
if (a.length == 0 || b.length == 0)
return 0;
const amax = a.length - 1, bmax = b.length - 1;
size_t i,j = 0;
double sum = 0;
for (;;)
{
if (a[i].index < b[j].index) {
if (i++ == amax) break;
}
else if (a[i].index > b[j].index) {
bumpJ: if (j++ == bmax) break;
}
else
{
assert(a[i].index == b[j].index);
sum += a[i].value * b[j].value;
if (i++ == amax) break;
goto bumpJ;
}
}
return sum;
---
I check that.
Then if you add the sentinel you only need the bounds tests in
the third case.
I post the sentinel code later. Probably there is something to
improve there as well.
BTW the effects vary greatly for different compilers.
For example with dmd the optimized version is slowest. The
baseline is
best. Weird. With gdc the optimized is best and gdc's code is
always
faster than dmd's code. With ldc it's really strange. Slower
than dmd. I
assume I'm doing something wrong here.
Used compiler flags
dmd v2.071.0
-wi -dw -g -O -inline -release -noboundscheck
gdc (crosstool-NG 203be35 - 20160205-2.066.1-e95a735b97) 5.2.0
-Wall -g -O3 -fomit-frame-pointer -finline-functions -frelease
-fno-bounds-check -ffast-math
ldc (0.15.2-beta2) based on DMD v2.066.1 and LLVM 3.6.1
-wi -dw -g -O3 -enable-inlining -release -boundscheck=off
Am I missing some flags?
These look reasonable.
But ldc looks so bad.
Any comments from ldc users or developers? Because I see this in
many other measurements as well. I would love to have another
compiler producing efficient like gdc.
For example what's equivalent to gdc's -ffast-math in ldc.
I uploaded my plots.
- running time
https://www.scribd.com/doc/312951947/Running-Time
- speed up https://www.scribd.com/doc/312951964/Speedup
What is dot2? Could you please redo the experiments with the
modified code as well?
dot2 is an optimization for jumping over gaps more quickly
replacing the first two if statements with while statements.
But my benchmark tests have no large gaps but interestingly it
does make things worse.
Jens