On Tuesday, 16 June 2015 at 16:37:35 UTC, John Colvin wrote:
If you want really fast exponentiation of an array though, you
want to use SIMD. Something like http://www.yeppp.info would be
easy to use from D.
I've been looking into SIMD a little. It turns out that core.simd
only works for DMD on Linux machines. Not sure about the other
compilers, but I was a bit stuck for a little on it. I read a
little on SIMD as I had no real understanding of it before you
mentioned it. At least I understand why all the types on
core.simd were so small. My initial reaction was there's no way I
would want to write a code just for float[4], but now I'm like
"oh that's the whole point".
Anyway, I might try to put something together on my other machine
one of these days, but I was able to make a little bit more
progress with D's std.parallelism. The foreach loops work great,
even on Windows, with little extra work required.
That being said, I'm not seeing any speed-up from parallel map. I
put some code below doing some variations on std.algorithm.map
and taskPool.map. The more the memory allocation (through .array)
the longer everything takes. Keeping things as ranges seems to be
much faster.
The most interesting result to me was that the taskPool.map was
slower than std.algorithm.map in each case. Maybe a difference
between being semi-eager versus lazy. The code below doesn't show
it, but it seems like the parallel foreach loop is faster than
std.algorithm.map or taskPool.map when doing everything with
arrays.
import std.datetime;
import std.parallelism;
import std.conv : to;
import std.math : exp;
import std.stdio : writeln;
import std.array : array;
import std.range : iota;
enum real x_size = 100_000;
void f0()
{
auto y = std.algorithm.map!(a => exp(a))(iota(x_size));
}
void f1()
{
auto y = taskPool.map!exp(iota(x_size));
}
void f2()
{
auto y = std.algorithm.map!(a => exp(a))(iota(x_size)).array;
}
void f3()
{
auto y = taskPool.map!exp(iota(x_size)).array;
}
void f4()
{
auto y = std.algorithm.map!(a => exp(a))(iota(x_size).array);
}
void f5()
{
auto y = taskPool.map!exp(iota(x_size).array);
}
void f6()
{
auto y = std.algorithm.map!(a =>
exp(a))(iota(x_size).array).array;
}
void f7()
{
auto y = taskPool.map!exp(iota(x_size).array).array;
}
void main() {
auto r = benchmark!(f0, f1, f2, f3, f4, f5, f6, f7)(100);
auto f0Result = to!Duration(r[0]);
auto f1Result = to!Duration(r[1]);
auto f2Result = to!Duration(r[2]);
auto f3Result = to!Duration(r[3]);
auto f4Result = to!Duration(r[4]);
auto f5Result = to!Duration(r[5]);
auto f6Result = to!Duration(r[6]);
auto f7Result = to!Duration(r[7]);
writeln(f0Result); //prints ~ 17us on my machine
writeln(f1Result); //prints ~ 4.3ms on my machine
writeln(f2Result); //prints ~ 1.7s on my machine
writeln(f3Result); //prints ~ 3.5s on my machine
writeln(f4Result); //prints ~ 471ms on my machine
writeln(f5Result); //prints ~ 473ms on my machine
writeln(f6Result); //prints ~ 1.9s on my machine
writeln(f7Result); //prints ~ 3.9s on my machine
}