Re: abysmal multicore performance, especially on AMD processors

Andy Fingerhut Thu, 13 Dec 2012 10:53:34 -0800

I'm not saying that I know this will help, but if you are open to trying a 
different JVM that has had a lot of work done on it to optimize it for high 
concurrency, Azul's Zing JVM may be worth a try, to see if it increases 
parallelism for a single Clojure instance in a single JVM, with lots of threads.


It costs $$, but I'm guessing they might have steep discounts for educational 
institutions.  They have free trials, too.

http://www.azulsystems.com/products/zing/whatisit

Andy

On Dec 13, 2012, at 10:41 AM, Wm. Josiah Erikson wrote:

> OK, I did something a little bit different, but I think it proves the same 
> thing we were shooting for. 
> 
> On a 48-way 4 x Opteron 6168 with 32GB of RAM. This is Tom's "Bowling" 
> benchmark: 
> 
> 1: multithreaded. Average of 10 runs: 14:00.9 
> 2. singlethreaded. Average of 10 runs: 23:35.3 
> 3. singlethreaded, 8 simultaneous copies. Average run time of said 
> concurrently running copies: 23:31.5 
> 
> So we see a speedup of less than 2x running multithreaded in a single JVM 
> instance. By contrast, running 8 simultaneous copies in 8 separate JVM's 
> gives us a perfect 8 x speedup over running a single instance of the same 
> singlethreaded benchmark. This proves pretty conclusively that it's not a 
> hardware limitation, it seems to me.... unless the problem is that it's 
> trying to spawn 48 threads, and that creates contention. 
> 
> I don't think so though, because on an 8-way FX-8120 with 16GB of RAM, we see 
> a very similar lack of speedup going from singlethreaded to multithreaded 
> (and it will only be trying to use 8 threads, right?), and then we see a much 
> better speedup (around 4x - we're doing 8 times the work in twice the amount 
> of time) going to 8 concurrent copies of the same thing in separate JVM's 
> (even though I had to decrease RAM usage on the 8 concurrent copies to avoid 
> swapping, thereby possibly slowing this down a bit): 
> 1. 9:00.6 
> 2. 14:15.6 
> 3. 27:35.1 
> 
> We're probably getting a better speedup with the concurrent copies on the 
> 48-way node because of higher memory bandwidth, bigger caches (and more of 
> them), and more memory. 
> 
> Does this help? Should I do something else as well? I'm curious to try 
> running like, say 16 concurrent copies on the 48-way node.... 
> 
> 
>     -Josiah 
> 
> On Wed, Dec 12, 2012 at 10:03 AM, Andy Fingerhut <andy.finger...@gmail.com> 
> wrote:
> Lee:
> 
> I believe you said that with your benchmarking code achieved good speedup 
> when run as separate JVMs that were each running a single thread, even before 
> making the changes to the implementation of reverse found by Marshall.  I 
> confirmed that on my own machine as well.
> 
> Have you tried running your real application in a single thread in a JVM, and 
> then run multiple JVMs in parallel, to see if there is any speedup?  If so, 
> that would again help determine whether it is multiple threads in a single 
> JVM causing the slowdown, or something to do with the hardware or OS that is 
> the limiting factor.
> 
> Andy
> 
> 
> On Dec 11, 2012, at 4:37 PM, Lee Spector wrote:
> 
> >
> > On Dec 11, 2012, at 1:06 PM, Marshall Bockrath-Vandegrift wrote:
> >> So I think if you replace your calls to `reverse` and any `conj` loops
> >> you have in your own code, you should see a perfectly reasonable
> >> speedup.
> >
> > Tantalizing, but on investigation I see that our real application actually 
> > does very little explicitly with reverse or conj, and I don't actually 
> > think that we're getting reasonable speedups (which is what led me to try 
> > that benchmark). So while I'm not sure of the source of the problem in our 
> > application I think there can be a problem even if one avoids direct calls 
> > to reverse and conj. Andy's recent tests also seem to confirm this.
> >
> > BTW benchmarking our real application (https://github.com/lspector/Clojush) 
> > is a bit tricky because it's riddled with random number generator calls 
> > that can have big effects, but we're going to look into working around 
> > that. Recent postings re: seedable RNGs may help, although changing all of 
> > the RNG code may be a little involved because we use thread-local RNGs (to 
> > avoid contention and get good multicore speedups... we thought!).
> >
> > -Lee
> 
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with your 
> first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> 
> 
> -- 
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with your 
> first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

Reply via email to