
I'll just give a brief description right now, but one thing I've found in the 
past on a 2-core machine that was achieving much less than 2x speedup was 
memory bandwidth being the limiting factor.

Not all Clojure code allocates memory, but a lot does.  If the hardware in a 
system can write at rate X from a multicore processor to main memory, and a 
single-threaded Clojure program writes to memory at rate 0.5*X, then the most 
speedup you will ever get out of multicore execution of the same code on N 
cores will be 2x, no matter how large N is.

As one way to see if this is the problem, you could try changing your "burn" 
function so that instead of doing cons to build up a list result, first 
allocate a Java mutable array before the loop that is as large as you need it 
to be at the end, and write values into that.  You can convert it to some other 
Clojure type at the end of the loop if you prefer.

I have some C benchmark programs that test memory read and write bandwidth on 
single and multiple cores you can run on your Intel machine to see if that 
might be the issue.  If this is the issue, I would expect to see at least a 
little speedup from 1 core to multiple cores, but capped at some maximum 
speedup that is determined by the memory bandwidth, not the number of cores you 
run in parallel.

I don't currently have any guess about what might be happening with the AMD 
multicore machine.  If you are interested in wild guessing, perhaps there could 
be some kind of multicore cache coherency protocol that is badly configured, 
causing cache lines to be frequently invalidated when multiple cores are 
sharing memory?  That would make more sense if multiple cores were reading from 
and writing to the same cache lines, which doesn't seem terribly likely for a 
typical Clojure program.

Let me know if you are interested and I will find those C programs for you to 
try out.  I got them from somewhere on the Internet and may have tweaked them a 
little bit.


On Dec 7, 2012, at 5:25 PM, Lee Spector wrote:

> I've been running compute intensive (multi-day), highly parallelizable 
> Clojure processes on high-core-count machines and blithely assuming that 
> since I saw near maximal CPU utilization in "top" and the like that I was 
> probably getting good speedups. 
> But a colleague recently did some tests and the results are really quite 
> alarming. 
> On intel machines we're seeing speedups but much less than I expected -- 
> about a 2x speedup going from 1 to 8 cores.
> But on AMD processors we're seeing SLOWDOWNS, with the same tests taking 
> almost twice as long on 8 cores as on 1.
> I'm baffled, and unhappy that my runs are probably going slower on 48-core 
> and 64-core nodes than on single-core nodes. 
> It's possible that I'm just doing something wrong in the way that I dispatch 
> the tasks, or that I've missed some Clojure or JVM setting... but right now 
> I'm mystified and would really appreciate some help.
> I'm aware that there's overhead for multicore distribution and that one can 
> expect slowdowns if the computations that are being distributed are fast 
> relative to the dispatch overhead, but this should not be the case here. 
> We're distributing computations that take seconds or minutes, and not huge 
> numbers of them (at least in our tests while trying to figure out what's 
> going on).
> I'm also aware that the test that produced the data I give below, insofar as 
> it uses pmap to do the distribution, may leave cores idle for a bit if some 
> tasks take a lot longer than others, because of the way that pmap allocates 
> cores to threads. But that also shouldn't be a big issue here because for 
> this test all of the threads are doing the exact same computation. And I also 
> tried using an agent-based dispatch approach that shouldn't have the pmap 
> thread allocation issue, and the results were about the same.
> Note also that all of the computations in this test are purely functional and 
> independent -- there shouldn't be any resource contention issues.
> The test: I wrote a time-consuming function that just does a bunch of math 
> and list manipulation (which is what takes a lot of time in my real 
> applications):
> (defn burn 
>  ([] (loop [i 0
>             value '()]
>        (if (>= i 10000)
>          (count (last (take 10000 (iterate reverse value))))
>          (recur (inc i)
>                 (cons 
>                   (* (int i) 
>                      (+ (float i) 
>                         (- (int i) 
>                            (/ (float i) 
>                               (inc (int i))))))
>                   value)))))
>  ([_] (burn)))
> Then I have a main function like this:
> (defn -main 
>  [& args]
>  (time (doall (pmap burn (range 8))))
>  (System/exit 0))
> We run it with "lein run" (we've tried both leingingen 1.7.1 and 
> 2.0.0-preview10) with Java 1.7.0_03 Java HotSpot(TM) 64-Bit Server VM. We 
> also tried Java 1.6.0_22. We've tried various JVM memory options (via 
> :jvm-opts with -Xmx and -Xms settings) and also with and without 
> -XX:+UseParallelGC. None of this seems to change the picture substantially.
> The results that we get generally look like this:
> - On an Intel Core i7 3770K with 8 cores and 16GB of RAM, running the code 
> above, it takes about 45 seconds (and all cores appear to be fully loaded as 
> it does so). If we change the pmap to just plain map, so that we use only a 
> single core, the time goes up to about 1 minute and 36 seconds. So the 
> speedup for 8 cores is just about 2x, even though there are 8 completely 
> independent tasks. So that's pretty depressing.
> - But much worse: on a 4 x Opteron 6272 with 48 cores and 32GB of RAM, 
> running the same test (with pmap) takes about 4 minutes and 2 seconds. That's 
> really slow! Changing the pmap to map here produces a runtime of about 2 
> minutes and 20 seconds. So it's quite a bit faster on one core than on 8! And 
> all of these times are terrible compared to those on the intel.
> Another strange observation is that we can run multiple instances of the test 
> on the same machine and (up to some limit, presumably) they don't seem to 
> slow each other down, even though just one instance of the test appears to be 
> maxing out all of the CPU according to "top". I suppose that means that "top" 
> isn't telling me what I thought -- my colleague says it can mean that 
> something is blocked in some way with a full instruction queue. But I'm not 
> interested in running multiple instances. I have single computations that 
> involve multiple expensive but independent subcomputations, and I want to 
> farm those subcomputations out to multiple cores -- and get speedups as a 
> result. My subcomputations are so completely independent that I think I 
> should be able to get speedups approaching a factor of n for n cores, but 
> what I see is a factor of only about 2 on intel machines, and a bizarre 
> factor of about 1/2 on AMD machines.
> Any help would be greatly appreciated!
> Thanks,
> -Lee
> --
> Lee Spector, Professor of Computer Science
> Cognitive Science, Hampshire College
> 893 West Street, Amherst, MA 01002-3359
> lspec...@hampshire.edu, http://hampshire.edu/lspector/
> Phone: 413-559-5352, Fax: 413-559-5438
> -- 
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with your 
> first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en

You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
For more options, visit this group at

Reply via email to