On Tuesday, 16 June 2015 at 16:37:35 UTC, John Colvin wrote:
If you want really fast exponentiation of an array though, you want to use SIMD. Something like http://www.yeppp.info would be easy to use from D.

I've been looking into SIMD a little. It turns out that core.simd only works for DMD on Linux machines. Not sure about the other compilers, but I was a bit stuck for a little on it. I read a little on SIMD as I had no real understanding of it before you mentioned it. At least I understand why all the types on core.simd were so small. My initial reaction was there's no way I would want to write a code just for float[4], but now I'm like "oh that's the whole point".

Anyway, I might try to put something together on my other machine one of these days, but I was able to make a little bit more progress with D's std.parallelism. The foreach loops work great, even on Windows, with little extra work required.

That being said, I'm not seeing any speed-up from parallel map. I put some code below doing some variations on std.algorithm.map and taskPool.map. The more the memory allocation (through .array) the longer everything takes. Keeping things as ranges seems to be much faster.

The most interesting result to me was that the taskPool.map was slower than std.algorithm.map in each case. Maybe a difference between being semi-eager versus lazy. The code below doesn't show it, but it seems like the parallel foreach loop is faster than std.algorithm.map or taskPool.map when doing everything with arrays.



import std.datetime;
import std.parallelism;
import std.conv : to;
import std.math : exp;
import std.stdio : writeln;
import std.array : array;
import std.range : iota;

enum real x_size = 100_000;

void f0()
{
        auto y = std.algorithm.map!(a => exp(a))(iota(x_size));
}

void f1()
{
        auto y = taskPool.map!exp(iota(x_size));
}

void f2()
{
        auto y = std.algorithm.map!(a => exp(a))(iota(x_size)).array;
}

void f3()
{
        auto y = taskPool.map!exp(iota(x_size)).array;
}

void f4()
{
        auto y = std.algorithm.map!(a => exp(a))(iota(x_size).array);
}

void f5()
{
        auto y = taskPool.map!exp(iota(x_size).array);
}

void f6()
{
auto y = std.algorithm.map!(a => exp(a))(iota(x_size).array).array;
}

void f7()
{
        auto y = taskPool.map!exp(iota(x_size).array).array;
}

void main() {
        auto r = benchmark!(f0, f1, f2, f3, f4, f5, f6, f7)(100);
        auto f0Result = to!Duration(r[0]);
        auto f1Result = to!Duration(r[1]);
        auto f2Result = to!Duration(r[2]);
        auto f3Result = to!Duration(r[3]);
        auto f4Result = to!Duration(r[4]);
        auto f5Result = to!Duration(r[5]);
        auto f6Result = to!Duration(r[6]);
        auto f7Result = to!Duration(r[7]);
        writeln(f0Result);                      //prints ~ 17us on my machine
        writeln(f1Result);                      //prints ~ 4.3ms on my machine
        writeln(f2Result);                      //prints ~ 1.7s on my machine
        writeln(f3Result);                      //prints ~ 3.5s on my machine
        writeln(f4Result);                      //prints ~ 471ms on my machine
        writeln(f5Result);                      //prints ~ 473ms on my machine
        writeln(f6Result);                      //prints ~ 1.9s on my machine
        writeln(f7Result);                      //prints ~ 3.9s on my machine
}

Reply via email to