OK, well I now have so much data sitting in from of me I don't even know *what* I'm seeing any more. I have made several significant discoveries though...

Consider the following:

 msort [] = []
 msort [x] = [x]
 msort xs =
   let
     xs0 = msort (split0 xs)
     xs1 = msort (split1 xs)
   in merge xs0 xs1

This takes roughly 14 seconds to sort a list of one million Word32 values. If I now change the final line to read

 in listSeq rwhnf xs0 `seq` listSeq rwhnf xs1 `seq` merge xs0 xs1

it now takes 8 seconds to do the same job.

Notice that this is still completely sequential execution. It's just executing in a different order. (And, at first glance, doing slightly more work.) Of all the benchmarks I've performed, I have yet to find anything that goes faster than this. If I make it so that xs0 is computed in parallel with xs1 instead of in series, then it goes at roughly the same speed (but with more variation) if the number of real threads is 1. If you add more real threads, execution slows down. (Go figure!) I was expecting running parallel at just the top few levels and then switching to pure sequential for the lower levels to give the best performance. But the numbers I have seem to say that more parallel = slower, with 100% sequential giving me the fastest time of all.

The next insight happens when you look at the GC statistics. Both the unmarked and the explicitly sequential program are giving me roughly 55% GC time and 45% user time. (!!) Obviously this is a Very Bad Thing. I discovered that simply by adding -H200m to the command line, I can make both programs speed up by about 20% or so. (They then drop down to roughly 25% GC time. Adding more RAM doesn't seem to make any difference.)

I had assumed that the explicitly sequential program was faster because it was somehow demanding less GC time, but that doesn't appear to be the case - both GC time and user time are lower for the explicitly sequential version. And adding more heap space doesn't make the (large) speed difference go away. Is the strictness of the seq operator making GHC come up with different a Core implementation for this function or something? I have no idea.

With the extra heap space, the speed difference between the sequential and parallel programs becomes smaller. The sequential version *is* still faster, however. I have no explanation for why that might be. Adding more heap also seems to make the runtimes more variable. (I run each test 8 times. One test, the fastest run was 7 seconds and the slowest was 11 seconds. That's quite a variation. The sequential algorithm only varies by a few milliseconds each time.)

In short, it seems my little sorting algorithm test is *actually* just stressing out the GC engine, and I'm "really" benchmarking different GC settings. :-(

Questions:

1. Does running the GC force all threads to stop? I know some GC designs do this, but I have no idea how the one GHC implements works.

2. Is the GC still single-threaded? (GHC 6.8.2 here.)

3. Is there any way for a running Haskell program to find out how much heap space is currently allocated / used / free? I know you can find out how much wall time and CPU time you've used, but I couldn't find anything for RAM usage.

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to