Hi,

I was playing with std.parallelism, and implemented a parallel NQueens code to count number of solutions to the classical NQueens problem. I do limited parallel recursion using taskPool.parallel foreach, and then switch at some level to serial algorithm. I use return values / atomic variables to count number of iterations / solutions, and then propagate that up using return values and aggregate. (I might switch later to the taskPool.reduce, with custom opAdd on some struct, or use taskPool.workerLocalStorage which I think is awesome concept).

Anyhow, everything was good on 1 or 2 threads, or maybe few more, on my laptop with old Dual Core CPU. I was able to speed it up exactly by a factor of 2x.

I wanted to try it out on bigger machine, so used Amazone AWS EC2 c4.8xlarge instance with 18 cores, 36 hyperthreads/vCPUs, and results were pretty terrible.

I hardly was able to utilize more than 3% of each CPU, the program actually significantly slower, which was surprising that there was no synchronization, false sharing or too fine parallelism in the program.

strace -f, was showing tons of futex operations, but I know that std.parallel.Foreach implementation doesn't use synchronization in the main loop, just some atomic variable reads in case another thread did a break or throwns an exception. It couldn't be a bottleneck.

I traced the problem to the array allocations in my implementation - because I only want to iterate over remaining possible rows, I keep int[] partialSolution and int[] availableRows, with invariant that intersection is empty, and the union contains integers 0 to N-1. (N = size of the problem).

Because partialSolution grows by 1 on each level of recursion, I just copy it and append single element, then pass it recursively (either to different thread or just function call, depending how deep in recursion we already in). It could be solve by using single-linked list and using common tail for all partialSolution in different branches, but then - list would be reversed, I would loose random access, which is little bit annoying (but shouldn't hurt performance, as I am going to scan entire partialSolution array anyway probably). And I would still need to allocate a new list node (which granted could be speed-up using thread local free-list).

Even bigger problem is availableRows, which I need to remove some elements from, which equates to allocating new array, and copying all elements but one. This cannot be done using list. And COW tree would be too expensive, and would still require some allocations and possibly rebalancing, etc.

I found that this is indeed a problem, because If I allocate int[64] x = void; on the stack, and then use std.internal.ScopeBuffer!int(&x) newAvailableRows;, (which is safe, because I wait for threads to finish before I exit the scope, and threads only read from this arrays, before making own copies), I am able to run my nqueens implementation at full system utilization (using 36 hyperthreads at 100% each, for a speed-up of about 21x), and it is able to solve N=18 in 7 seconds (compared to about 2.5 minutes), with parallel part up to level 8 of recursion (to improve load balancing).

So, the question is, why is D / DMD allocator so slow under heavy multithreading? The working set is pretty small (few megabytes at most), so I do not think this is an issue with GC scanning itself. Can I plug-in tcmalloc / jemalloc, to be used as the underlying allocator, instead of using glibc? Or is D runtime using mmap/srbk/etc directly?

Thanks.

Reply via email to