Re: abysmal multicore performance, especially on AMD processors

Wm. Josiah Erikson Sat, 08 Dec 2012 13:43:10 -0800

I'm glad somebody else can duplicate our findings! I get results similar to 
this on Intel hardware. On AMD hardware, the disparity is bigger, and 
multiple threads of a single JVM invocation on AMD hardware consistently 
gives me slowdowns as compared to a single thread. Also, your results are 
on MacOS and my results are on linux, so it begs the question: Is this 
generally true of Java, or is it something about clojure?


On Saturday, December 8, 2012 3:31:25 PM UTC-5, Andy Fingerhut wrote:
>
> I haven't analyzed your results in detail, but here are some results I had 
> on my 2GHz 4-core Intel core i7 MacBook Pro vintage 2011.
>
> When running multiple threads within a single JVM invocation, I never got 
> a speedup of even 2.  The highest speedup I measured was 1.82 speedup when 
> I ran 8 threads using -XX:+UseParallelGC.  I tried with -XX:+UseParNewGC 
> but never got a speedup over 1.45 (with 4 threads in parallel -- it was 
> lower with 8 threads).
>
> When running multiple invocations of "lein2 run" in parallel as separate 
> processes, I was able to achieve a speedup of 1.88 with 2 processes, 3.40 
> with 4 processes, and 5.34 with 8 processes (it went over 4 I think because 
> of 2 hyperthreads per each of the 4 cores).
>
> This is a strong indication that the issue is some kind of interference 
> between multiple threads in the same JVM, not the hardware, at least on my 
> hardware and OS (OS was Mac OS X 10.6.8, JVM was Apple/Oracle Java 
> 1.6.0_37).
>
> My first guess would be that even with -XX:+UseParallelGC or 
> -XX:+UseParNewGC, there is either some kind of interference with garbage 
> collection, or perhaps there is even some kind of interference between them 
> when allocating memory?  Should JVM memory allocations be completely 
> parallel with no synchronization when running multiple threads, or do 
> memory allocations sometimes lock a shared data structure?
>
> Andy
>
>
> On Dec 8, 2012, at 11:10 AM, Wm. Josiah Erikson wrote:
>
> Hi guys - I'm the colleague Lee speaks of. Because Jim mentioned running 
> things on a 4-core Phenom II, I did some benchmarking on a Phenom II X4 
> 945, and found some very strange results, which I shall post here, after I 
> explain a little function that Lee wrote that is designed to get improved 
> results over pmap. It looks like this:
>
> (defn pmapall
>   "Like pmap but: 1) coll should be finite, 2) the returned sequence
>    will not be lazy, 3) calls to f may occur in any order, to maximize
>    multicore processor utilization, and 4) takes only one coll so far."
>   [f coll]
>   (let [agents (map agent coll)]
>     (dorun (map #(send % f) agents))
>     (apply await agents)
>     (doall (map deref agents))))
>
> Refer to Lee's first post for the benchmarking routine we're running.
>
> I figured that, in order to figure out if it was Java's multithreading 
> that was the problem (as opposed to memory bandwidth, or the OS, or 
> whatever), I'd compare ( doall( pmapall burn (range 8))) to running 8 
> concurrent copies of (burn (rand-int 8) or even just (burn 2) or 4 copies 
> of ( doall( map burn (range 2))) or whatever. Does this make sense? I THINK 
> it does. If it doesn't, then that's cool - just let me know why and I'll 
> feel less crazy, because I am finding my results rather confounding.
>
> On said Phenom II X4 945 with 16GB of RAM, it takes 2:31 to do ( doall( 
> pmap burn (range 8))), 1:29 to do ( doall( map burn (range 8))), and 1:48 
> to do ( doall( pmapall burn (range 8))).
>
> So that's weird, because although we do see decreased slowdown from using 
> pmapall, we still don't see a speedup compared to map. Watching processor 
> utilization while these are going on shows that map is using one core, and 
> both pmap and pmapall are using all four cores fully, as they should. So, 
> maybe the OS or the hardware just can't deal with running that many copies 
> of burn at once? Maybe there's a memory bottleneck?
>
> Now here's the weird part: it takes around 29 seconds to do four 
> concurrent copies of ( doall( map burn (range 2))), around 33 seconds to 
> run 8 copies of (burn 2). Yes. Read that again. What? Watching top while 
> this is going on shows what you would expect to see: When I run four 
> concurrent copies, I've got four copies of Java using 100% of a core each, 
> and when I run eight concurrent copies, I see eight copies of Java, all 
> using around 50% of the processor each.
>
> Also, by the way, it takes 48 seconds to run two concurrent copies of ( 
> doall( map burn (range 4))) and 1:07 to run two concurrent copies of ( 
> doall( pmap burn (range 4))). 
>
> What is going on here? Is Java's multithreading really THAT bad? This 
> appears to me to prove that Java, or clojure, has something very seriously 
> wrong with it, or has outrageous amounts of overhead when spawning a new 
> thread. No?
>
> all run with :jvm-opts ["-Xmx1g" "-Xms1g" "-XX:+AggressiveOpts"] and 
> clojure 1.5.0-beta1
> (I tried increasing the memory allowed for the pmap and pmapall runs, even 
> to 8g, and it doesn't help at all)
> Java(TM) SE Runtime Environment (build 1.7.0_03-b04)
> Java HotSpot(TM) 64-Bit Server VM (build 22.1-b02, mixed mode):
>
> on ROCKS 6.0 (CentOS 6.2) with kernel 2.6.32-220.13.1.el6.x86_64 #1 SMP
>
>
> Any thoughts or ideas?
>
> There's more weirdness, too, in case anybody in interested. I'm getting 
> results that vary strangely from other benchmarks that are available, and 
> make no sense to me. Check this out (these are incomplete, because I 
> decided to dig deeper with the above benchmarks, but you'll see, I think, 
> why this is so confusing, if you know how fast these processors are 
> "supposed" to be):
>
> all run with :jvm-opts ["-Xmx1g" "-Xms1g" "-XX:+AggressiveOpts"] and 
> clojure 1.5.0-beta1
> Java(TM) SE Runtime Environment (build 1.7.0_03-b04)
> Java HotSpot(TM) 64-Bit Server VM (build 22.1-b02, mixed mode):
>
> Key:    1. (pmap range 8) : 
>     2. (map range 8) : 
>     3. (8 concurrent copies of pmap range 8) : 
>     4. (8 concurrent copies of map range 8) :
>     5. pmapall range 8:
>
> 4x AMD Opteron 6168:
>     1. 4:02.06
>     2. 2:20.29
>     3.
>     4.
>
> AMD Phenom II X4 945:
>     1. 2:31.65
>     2. 1:29.90
>     3. 3:32.60
>     4. 3:08.97
>     5. 1:48.36
>
> AMD Phenom II X6 1100T:
>     1. 2:03.71
>     2. 1:14.76
>     3. 2:20.14
>     4. 1:57.38
>     5. 2:14.43
>
> AMD FX 8120:
>     1. 4:50.06
>     2. 1:25.04
>     3. 5:55.84
>     4. 2:46.94
>     5. 4:36.61
>
> AMD FX 8350:
>     1. 3:42.35
>     2. 1:13.94
>     3. 3:00.46
>     4. 2:06.18
>     5. 3:56.95
>
> Intel Core i7 3770K:
>     1. 0:44
>     2. 1:37.18
>     3. 2:29.41
>     4. 2:16.05
>     5. 0:44.42
>
> 2 x Intel Paxville DP Xeon:
>     1. 6:26.112
>     2. 3:20.149
>     3. 8:09.85
>     4. 7:06.52
>     5. 5:55.29
>
>
>
> On Saturday, December 8, 2012 9:36:56 AM UTC-5, Marshall 
> Bockrath-Vandegrift wrote:
>>
>> Lee Spector <lspe...@hampshire.edu> writes: 
>>
>> > I'm also aware that the test that produced the data I give below, 
>> > insofar as it uses pmap to do the distribution, may leave cores idle 
>> > for a bit if some tasks take a lot longer than others, because of the 
>> > way that pmap allocates cores to threads. 
>>
>> Although it doesn’t impact your benchmark, `pmap` may be further 
>> adversely affecting the performance of your actual program.  There’s a 
>> open bug regarding `pmap` and chunked seqs: 
>>
>>     http://dev.clojure.org/jira/browse/CLJ-862 
>>
>> The impact is that `pmap` with chunked seq input will spawn futures for 
>> its function applications in flights of 32, spawning as many flights as 
>> necessary to reach or exceed #CPUS + 2.  On a 48-way system, it will 
>> initially launch 64 futures, then spawn an additional 32 every time the 
>> number of active unrealized futures drops below 50, leading to 
>> significant contention for a CPU-bound application. 
>>
>> I hope it can be made useful in a future version of Clojure, but right 
>> now `pmap` is more of an attractive nuisance than anything else. 
>>
>> -Marshall 
>>
>>
> -- 
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clo...@googlegroups.com <javascript:>
> Note that posts from new members are moderated - please be patient with 
> your first post.
> To unsubscribe from this group, send email to
> clojure+u...@googlegroups.com <javascript:>
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
>
>
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

Reply via email to