On Thursday, 19 May 2016 at 12:04:31 UTC, Andrei Alexandrescu wrote:
On 5/19/16 4:12 AM, Jens Müller wrote:

What test data did you use?

An instance for benchmarking is generated as follows. Given nnz which is the sum of non-zero indices in input vector a and b.

auto lengthA = uniform!"[]"(0, nnz, gen);
auto lengthB = nnz - lengthA;

auto a = iota(0, nnz).randomSample(lengthA, gen).map!(i => Pair(i, 10)).array(); auto b = iota(0, nnz).randomSample(lengthB, gen).map!(i => Pair(i, 10)).array();

So I take a random sample of (0, ..., nnz) for each input.
Any better idea? I've seen that people generate vectors with larger gaps.

10%-20% win on dot product is significant because for many algorithms dot product is a kernel and dominates everything else. For those any win goes straight to the bottom line.

Sure. Still I wasn't sure whether I got the idea from your talk. So maybe there is/was more.

The base line (dot1 in the graphs) is the straightforward version

---
size_t i,j = 0;
double sum = 0;
while (i < a.length && j < b.length)
{
     if (a[i].index < b[j].index) i++;
     else if (a[i].index > b[j].index) j++;
     else
     {
         assert(a[i].index == b[j].index);
         sum += a[i].value * b[j].value;
         i++;
         j++;
     }
}
return sum;
---

That does redundant checking. There's a better baseline:

---
if (a.length == 0 || b.length == 0)
    return 0;
const amax = a.length - 1, bmax = b.length - 1;
size_t i,j = 0;
double sum = 0;
for (;;)
{
    if (a[i].index < b[j].index) {
        if (i++ == amax) break;
    }
    else if (a[i].index > b[j].index) {
        bumpJ: if (j++ == bmax) break;
    }
    else
    {
        assert(a[i].index == b[j].index);
        sum += a[i].value * b[j].value;
        if (i++ == amax) break;
        goto bumpJ;
    }
}
return sum;
---

I check that.

Then if you add the sentinel you only need the bounds tests in the third case.

I post the sentinel code later. Probably there is something to improve there as well.

BTW the effects vary greatly for different compilers.
For example with dmd the optimized version is slowest. The baseline is best. Weird. With gdc the optimized is best and gdc's code is always faster than dmd's code. With ldc it's really strange. Slower than dmd. I
assume I'm doing something wrong here.

Used compiler flags
dmd v2.071.0
-wi -dw -g -O -inline -release -noboundscheck
gdc (crosstool-NG 203be35 - 20160205-2.066.1-e95a735b97) 5.2.0
-Wall  -g -O3 -fomit-frame-pointer -finline-functions -frelease
-fno-bounds-check -ffast-math
ldc (0.15.2-beta2) based on DMD v2.066.1 and LLVM 3.6.1
-wi -dw -g -O3 -enable-inlining -release -boundscheck=off

Am I missing some flags?

These look reasonable.

But ldc looks so bad.
Any comments from ldc users or developers? Because I see this in many other measurements as well. I would love to have another compiler producing efficient like gdc.
For example what's equivalent to gdc's -ffast-math in ldc.

I uploaded my plots.
- running time https://www.scribd.com/doc/312951947/Running-Time
- speed up https://www.scribd.com/doc/312951964/Speedup

What is dot2? Could you please redo the experiments with the modified code as well?

dot2 is an optimization for jumping over gaps more quickly replacing the first two if statements with while statements. But my benchmark tests have no large gaps but interestingly it does make things worse.

Jens

Reply via email to