Hello,
I am summing up the first 1 billion integers in parallel and in a
single thread and I'm observing some curious results;
parallel sum : 499999999500000000, elapsed 102833 ms
single thread sum : 499999999500000000, elapsed 1667 ms
The parallel version is 60+ times slower on my i7-3770K CPU. I
think that maybe due to the CPU constantly flushing and reloading
the caches in the parallel version but I don't know for sure.
Here is the D code;
shared ulong sum = 0;
ulong iter = 1_000_000_000UL;
StopWatch sw;
sw.start();
foreach(i; parallel(iota(0, iter)))
{
atomicOp!"+="(sum, i);
}
sw.stop();
writefln("parallel sum : %s, elapsed %s ms", sum,
sw.peek().msecs);
sum = 0;
sw.reset();
sw.start();
for (ulong i = 0; i < iter; ++i)
{
sum += i;
}
sw.stop();
writefln("single thread sum : %s, elapsed %s ms", sum,
sw.peek().msecs);
Out of curiosity I tried the equivalent code in C# and I got this;
parallel sum : 499999999500000000, elapsed 20320 ms
single thread sum : 499999999500000000, elapsed 1901 ms
The C# parallel is about 3 times faster than the D parallel which
is strange on the exact same CPU.
And here is the C# code;
long sum = 0;
long iter = 1000000000L;
var sw = Stopwatch.StartNew();
Parallel.For(0, iter, i =>
{
Interlocked.Add(ref sum, i);
});
Console.WriteLine("parallel sum : {0}, elapsed {1} ms", sum,
sw.ElapsedMilliseconds);
sum = 0;
sw = Stopwatch.StartNew();
for (long i = 0; i < iter; ++i)
{
sum += i;
}
Console.WriteLine("single thread sum : {0}, elapsed {1} ms", sum,
sw.ElapsedMilliseconds);
Thoughts?