On Tuesday, 6 May 2014 at 15:56:11 UTC, Kapps wrote:
On Monday, 5 May 2014 at 22:11:39 UTC, Ali Çehreli wrote:
On 05/05/2014 02:38 PM, Kapps wrote:
> I think that the GC actually blocks when
> creating objects, and thus multiple threads creating
instances would not
> provide a significant speedup, possibly even a slowdown.
Wow! That is the case. :)
> You'd want to benchmark this to be certain it helps.
I did:
import std.range;
import std.parallelism;
class C
{}
void foo()
{
auto c = new C;
}
void main(string[] args)
{
enum totalElements = 10_000_000;
if (args.length > 1) {
foreach (i; iota(totalElements).parallel) {
foo();
}
} else {
foreach (i; iota(totalElements)) {
foo();
}
}
}
Typical run on my system for "-O -noboundscheck -inline":
$ time ./deneme parallel
real 0m4.236s
user 0m4.325s
sys 0m9.795s
$ time ./deneme
real 0m0.753s
user 0m0.748s
sys 0m0.003s
Ali
Huh, that's a much, much, higher impact than I'd expected.
I tried with GDC as well (the one in Debian stable, which is
unfortunately still 2.055...) and got similar results. I also
tried creating only totalCPUs threads and having each of them
create NUM_ELEMENTS / totalCPUs objects rather than risking
that each creation was a task, and it still seems to be the
same.
snip
I tried with using an allocator that never releases memory,
rounds up to a power of 2, and is lock-free. The results are
quite a bit better.
shardsoft:~$ ./test
1 sec, 47 ms, 474 μs, and 4 hnsecs
shardsoft:~$ ./test
1 sec, 43 ms, 588 μs, and 2 hnsecs
shardsoft:~$ ./test tasks
692 ms, 769 μs, and 8 hnsecs
shardsoft:~$ ./test tasks
692 ms, 686 μs, and 8 hnsecs
shardsoft:~$ ./test parallel
691 ms, 856 μs, and 9 hnsecs
shardsoft:~$ ./test parallel
690 ms, 22 μs, and 3 hnsecs
I get similar results on my laptop (which is much faster than the
results I got on it using DMD's malloc):
test
1 sec, 125 ms, and 847 ╬╝s
test
1 sec, 125 ms, 741 ╬╝s, and 6 hnsecs
test tasks
556 ms, 613 ╬╝s, and 8 hnsecs
test tasks
552 ms and 287 ╬╝s
test parallel
554 ms, 542 ╬╝s, and 6 hnsecs
test parallel
551 ms, 514 ╬╝s, and 9 hnsecs
Code:
http://pastie.org/9146326
Unfortunately it doesn't compile with the ancient version of gdc
available in Debian, so I couldn't test with that. The results
should be quite a bit better since core.atomic would be faster.
And frankly, I'm not sure if the allocator actually works
properly, but it's just for testing purposes anyways.