I'm glad somebody else can duplicate our findings! I get results similar to this on Intel hardware. On AMD hardware, the disparity is bigger, and multiple threads of a single JVM invocation on AMD hardware consistently gives me slowdowns as compared to a single thread. Also, your results are on MacOS and my results are on linux, so it begs the question: Is this generally true of Java, or is it something about clojure?
On Saturday, December 8, 2012 3:31:25 PM UTC-5, Andy Fingerhut wrote: > > I haven't analyzed your results in detail, but here are some results I had > on my 2GHz 4-core Intel core i7 MacBook Pro vintage 2011. > > When running multiple threads within a single JVM invocation, I never got > a speedup of even 2. The highest speedup I measured was 1.82 speedup when > I ran 8 threads using -XX:+UseParallelGC. I tried with -XX:+UseParNewGC > but never got a speedup over 1.45 (with 4 threads in parallel -- it was > lower with 8 threads). > > When running multiple invocations of "lein2 run" in parallel as separate > processes, I was able to achieve a speedup of 1.88 with 2 processes, 3.40 > with 4 processes, and 5.34 with 8 processes (it went over 4 I think because > of 2 hyperthreads per each of the 4 cores). > > This is a strong indication that the issue is some kind of interference > between multiple threads in the same JVM, not the hardware, at least on my > hardware and OS (OS was Mac OS X 10.6.8, JVM was Apple/Oracle Java > 1.6.0_37). > > My first guess would be that even with -XX:+UseParallelGC or > -XX:+UseParNewGC, there is either some kind of interference with garbage > collection, or perhaps there is even some kind of interference between them > when allocating memory? Should JVM memory allocations be completely > parallel with no synchronization when running multiple threads, or do > memory allocations sometimes lock a shared data structure? > > Andy > > > On Dec 8, 2012, at 11:10 AM, Wm. Josiah Erikson wrote: > > Hi guys - I'm the colleague Lee speaks of. Because Jim mentioned running > things on a 4-core Phenom II, I did some benchmarking on a Phenom II X4 > 945, and found some very strange results, which I shall post here, after I > explain a little function that Lee wrote that is designed to get improved > results over pmap. It looks like this: > > (defn pmapall > "Like pmap but: 1) coll should be finite, 2) the returned sequence > will not be lazy, 3) calls to f may occur in any order, to maximize > multicore processor utilization, and 4) takes only one coll so far." > [f coll] > (let [agents (map agent coll)] > (dorun (map #(send % f) agents)) > (apply await agents) > (doall (map deref agents)))) > > Refer to Lee's first post for the benchmarking routine we're running. > > I figured that, in order to figure out if it was Java's multithreading > that was the problem (as opposed to memory bandwidth, or the OS, or > whatever), I'd compare ( doall( pmapall burn (range 8))) to running 8 > concurrent copies of (burn (rand-int 8) or even just (burn 2) or 4 copies > of ( doall( map burn (range 2))) or whatever. Does this make sense? I THINK > it does. If it doesn't, then that's cool - just let me know why and I'll > feel less crazy, because I am finding my results rather confounding. > > On said Phenom II X4 945 with 16GB of RAM, it takes 2:31 to do ( doall( > pmap burn (range 8))), 1:29 to do ( doall( map burn (range 8))), and 1:48 > to do ( doall( pmapall burn (range 8))). > > So that's weird, because although we do see decreased slowdown from using > pmapall, we still don't see a speedup compared to map. Watching processor > utilization while these are going on shows that map is using one core, and > both pmap and pmapall are using all four cores fully, as they should. So, > maybe the OS or the hardware just can't deal with running that many copies > of burn at once? Maybe there's a memory bottleneck? > > Now here's the weird part: it takes around 29 seconds to do four > concurrent copies of ( doall( map burn (range 2))), around 33 seconds to > run 8 copies of (burn 2). Yes. Read that again. What? Watching top while > this is going on shows what you would expect to see: When I run four > concurrent copies, I've got four copies of Java using 100% of a core each, > and when I run eight concurrent copies, I see eight copies of Java, all > using around 50% of the processor each. > > Also, by the way, it takes 48 seconds to run two concurrent copies of ( > doall( map burn (range 4))) and 1:07 to run two concurrent copies of ( > doall( pmap burn (range 4))). > > What is going on here? Is Java's multithreading really THAT bad? This > appears to me to prove that Java, or clojure, has something very seriously > wrong with it, or has outrageous amounts of overhead when spawning a new > thread. No? > > all run with :jvm-opts ["-Xmx1g" "-Xms1g" "-XX:+AggressiveOpts"] and > clojure 1.5.0-beta1 > (I tried increasing the memory allowed for the pmap and pmapall runs, even > to 8g, and it doesn't help at all) > Java(TM) SE Runtime Environment (build 1.7.0_03-b04) > Java HotSpot(TM) 64-Bit Server VM (build 22.1-b02, mixed mode): > > on ROCKS 6.0 (CentOS 6.2) with kernel 2.6.32-220.13.1.el6.x86_64 #1 SMP > > > Any thoughts or ideas? > > There's more weirdness, too, in case anybody in interested. I'm getting > results that vary strangely from other benchmarks that are available, and > make no sense to me. Check this out (these are incomplete, because I > decided to dig deeper with the above benchmarks, but you'll see, I think, > why this is so confusing, if you know how fast these processors are > "supposed" to be): > > all run with :jvm-opts ["-Xmx1g" "-Xms1g" "-XX:+AggressiveOpts"] and > clojure 1.5.0-beta1 > Java(TM) SE Runtime Environment (build 1.7.0_03-b04) > Java HotSpot(TM) 64-Bit Server VM (build 22.1-b02, mixed mode): > > Key: 1. (pmap range 8) : > 2. (map range 8) : > 3. (8 concurrent copies of pmap range 8) : > 4. (8 concurrent copies of map range 8) : > 5. pmapall range 8: > > 4x AMD Opteron 6168: > 1. 4:02.06 > 2. 2:20.29 > 3. > 4. > > AMD Phenom II X4 945: > 1. 2:31.65 > 2. 1:29.90 > 3. 3:32.60 > 4. 3:08.97 > 5. 1:48.36 > > AMD Phenom II X6 1100T: > 1. 2:03.71 > 2. 1:14.76 > 3. 2:20.14 > 4. 1:57.38 > 5. 2:14.43 > > AMD FX 8120: > 1. 4:50.06 > 2. 1:25.04 > 3. 5:55.84 > 4. 2:46.94 > 5. 4:36.61 > > AMD FX 8350: > 1. 3:42.35 > 2. 1:13.94 > 3. 3:00.46 > 4. 2:06.18 > 5. 3:56.95 > > Intel Core i7 3770K: > 1. 0:44 > 2. 1:37.18 > 3. 2:29.41 > 4. 2:16.05 > 5. 0:44.42 > > 2 x Intel Paxville DP Xeon: > 1. 6:26.112 > 2. 3:20.149 > 3. 8:09.85 > 4. 7:06.52 > 5. 5:55.29 > > > > On Saturday, December 8, 2012 9:36:56 AM UTC-5, Marshall > Bockrath-Vandegrift wrote: >> >> Lee Spector <lspe...@hampshire.edu> writes: >> >> > I'm also aware that the test that produced the data I give below, >> > insofar as it uses pmap to do the distribution, may leave cores idle >> > for a bit if some tasks take a lot longer than others, because of the >> > way that pmap allocates cores to threads. >> >> Although it doesn’t impact your benchmark, `pmap` may be further >> adversely affecting the performance of your actual program. There’s a >> open bug regarding `pmap` and chunked seqs: >> >> http://dev.clojure.org/jira/browse/CLJ-862 >> >> The impact is that `pmap` with chunked seq input will spawn futures for >> its function applications in flights of 32, spawning as many flights as >> necessary to reach or exceed #CPUS + 2. On a 48-way system, it will >> initially launch 64 futures, then spawn an additional 32 every time the >> number of active unrealized futures drops below 50, leading to >> significant contention for a CPU-bound application. >> >> I hope it can be made useful in a future version of Clojure, but right >> now `pmap` is more of an attractive nuisance than anything else. >> >> -Marshall >> >> > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clo...@googlegroups.com <javascript:> > Note that posts from new members are moderated - please be patient with > your first post. > To unsubscribe from this group, send email to > clojure+u...@googlegroups.com <javascript:> > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en > > > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en