Poor memory allocation performance with a lot of threads on 36 core machine

Witek via Digitalmars-d Thu, 18 Feb 2016 05:06:18 -0800

Hi,

I was playing with std.parallelism, and implemented a parallelNQueens code to count number of solutions to the classicalNQueens problem. I do limited parallel recursion usingtaskPool.parallel foreach, and then switch at some level toserial algorithm. I use return values / atomic variables to countnumber of iterations / solutions, and then propagate that upusing return values and aggregate. (I might switch later to thetaskPool.reduce, with custom opAdd on some struct, or usetaskPool.workerLocalStorage which I think is awesome concept).

Anyhow, everything was good on 1 or 2 threads, or maybe few more,on my laptop with old Dual Core CPU. I was able to speed it upexactly by a factor of 2x.

I wanted to try it out on bigger machine, so used Amazone AWS EC2c4.8xlarge instance with 18 cores, 36 hyperthreads/vCPUs, andresults were pretty terrible.

I hardly was able to utilize more than 3% of each CPU, theprogram actually significantly slower, which was surprising thatthere was no synchronization, false sharing or too fineparallelism in the program.

strace -f, was showing tons of futex operations, but I know thatstd.parallel.Foreach implementation doesn't use synchronizationin the main loop, just some atomic variable reads in case anotherthread did a break or throwns an exception. It couldn't be abottleneck.

I traced the problem to the array allocations in myimplementation - because I only want to iterate over remainingpossible rows, I keep int[] partialSolution and int[]availableRows, with invariant that intersection is empty, and theunion contains integers 0 to N-1. (N = size of the problem).

Because partialSolution grows by 1 on each level of recursion, Ijust copy it and append single element, then pass it recursively(either to different thread or just function call, depending howdeep in recursion we already in). It could be solve by usingsingle-linked list and using common tail for all partialSolutionin different branches, but then - list would be reversed, I wouldloose random access, which is little bit annoying (but shouldn'thurt performance, as I am going to scan entire partialSolutionarray anyway probably). And I would still need to allocate a newlist node (which granted could be speed-up using thread localfree-list).

Even bigger problem is availableRows, which I need to remove someelements from, which equates to allocating new array, and copyingall elements but one. This cannot be done using list. And COWtree would be too expensive, and would still require someallocations and possibly rebalancing, etc.

I found that this is indeed a problem, because If I allocateint[64] x = void; on the stack, and then usestd.internal.ScopeBuffer!int(&x) newAvailableRows;, (which issafe, because I wait for threads to finish before I exit thescope, and threads only read from this arrays, before making owncopies), I am able to run my nqueens implementation at fullsystem utilization (using 36 hyperthreads at 100% each, for aspeed-up of about 21x), and it is able to solve N=18 in 7 seconds(compared to about 2.5 minutes), with parallel part up to level 8of recursion (to improve load balancing).

So, the question is, why is D / DMD allocator so slow under heavymultithreading? The working set is pretty small (few megabytes atmost), so I do not think this is an issue with GC scanningitself. Can I plug-in tcmalloc / jemalloc, to be used as theunderlying allocator, instead of using glibc? Or is D runtimeusing mmap/srbk/etc directly?


Thanks.

Poor memory allocation performance with a lot of threads on 36 core machine

Reply via email to