Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-20 Thread David Iba
Andy: Heh, glad to hear that I'm not the only one facing this issue, and I appreciate the encouragement since it's been kicking my ass the past week :) On the bright side, as someone coming from more of a math background, this has forced me to learn a lot about how cpus/threads/memory/etc.

Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-19 Thread Herwig Hochleitner
This reminds me of another thread, where performance issues related to concurrent allocation were explored in depth: https://groups.google.com/d/topic/clojure/48W2eff3caU/discussion The main takeaway for me was, that Hotspot will slow down pretty dramatically, as soon as there are two threads

Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-19 Thread Andy Fingerhut
David: No new suggestions to add right now. Herwig's suggestion that it could be the Java allocator has some evidence for it given your results. I'm not sure whether this StackOverflow Q on TLAB is fully accurate, but it may provide some useful info:

Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-19 Thread Fluid Dynamics
On Thursday, November 19, 2015 at 1:36:59 AM UTC-5, David Iba wrote: > > OK, have a few updates to report: > >- Oracle vs OpenJDK did not make a difference >- Whenever I run N>1 threads calling any of these functions with >swap/vswap, there is some overhead compared to running 18

Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-19 Thread David Iba
Yeah, I actually tried using aset as well, and was still seeing these "rogue" threads taking much longer (although the ones that did finish in a normal amount of time had very similar completion times to those running in their own process.) Herwig: I will try those suggestions when I get a

Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-18 Thread David Iba
Timothy: Each thread (call of f2) creates its own "local" atom, so I don't think there should be any swap retries. Gianluca: Good idea! I've only tried OpenJDK, but I will look into trying Oracle and report back. Andy: jvisualvm was showing pretty much all of the memory allocated in the

Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-18 Thread David Iba
No worries. Thanks, I'll give that a try as well! On Thursday, November 19, 2015 at 1:04:04 AM UTC+9, tbc++ wrote: > > Oh, then I completely mis-understood the problem at hand here. If that's > the case then do the following: > > Change "atom" to "volatile!" and "swap!" to "vswap!". See if that

Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-18 Thread gianluca torta
by the way, have you tried both Oracle and Open JDK with the same results? Gianluca On Tuesday, November 17, 2015 at 8:28:49 PM UTC+1, Andy Fingerhut wrote: > > David, you say "Based on jvisualvm monitoring, doesn't seem to be > GC-related". > > What is jvisualvm showing you related to GC and/or

Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-18 Thread Timothy Baldridge
This sort of code is somewhat the worst case situation for atoms (or really for CAS). Clojure's swap! is based off the "compare-and-swap" or CAS operation that most x86 CPUs have as an instruction. If we expand swap! it looks something like this: (loop [old-val @x*] (let [new-val (assoc old-val

Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-18 Thread Timothy Baldridge
Oh, then I completely mis-understood the problem at hand here. If that's the case then do the following: Change "atom" to "volatile!" and "swap!" to "vswap!". See if that changes anything. Timothy On Wed, Nov 18, 2015 at 9:00 AM, David Iba wrote: > Timothy: Each thread

Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-18 Thread David Iba
OK, have a few updates to report: - Oracle vs OpenJDK did not make a difference - Whenever I run N>1 threads calling any of these functions with swap/vswap, there is some overhead compared to running 18 separate single-run processes in parallel. This overhead seems to increase as N

Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-17 Thread Niels van Klaveren
Could you also show how you are running these functions in parallel and time them ? The way you start the functions can have as much impact as the functions themselves. Regards, Niels On Tuesday, November 17, 2015 at 6:38:39 AM UTC+1, David Iba wrote: > > I have functions f1 and f2 below, and

Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-17 Thread David Iba
Andy: Interesting. Thanks for educating me on the fact that atom swap's don't use the STM. Your theory seems plausible... I will try those tests next time I launch the 18-core instance, but yeah, not sure how illuminating the results will be. Niels: along the lines of this (so that each

Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-17 Thread David Iba
correction: that "do" should be a "doall". (My actual test code was a bit different, but each run printed some info when it started so it doesn't have to do with delayed evaluation of lazy seq's or anything). On Tuesday, November 17, 2015 at 6:49:16 PM UTC+9, David Iba wrote: > > Andy:

Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-17 Thread Andy Fingerhut
David, you say "Based on jvisualvm monitoring, doesn't seem to be GC-related". What is jvisualvm showing you related to GC and/or memory allocation when you tried the 18-core version with 18 threads in the same process? Even memory allocation could become a point of contention, depending upon

Poor parallelization performance across 18 cores (but not 4)

2015-11-16 Thread David Iba
I have functions f1 and f2 below, and let's say they run in T1 and T2 amount of time when running a single instance/thread. The issue I'm facing is that parallelizing f2 across 18 cores takes anywhere from 2-5X T2, and for more complex funcs takes absurdly long. 1. (defn f1 [] 2.

Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-16 Thread Andy Fingerhut
There is no STM involved if you only have atoms, and no refs, so it can't be STM-related. I have a conjecture, but don't yet have a suggestion for an experiment that would prove or disprove it. The JVM memory model requires that changes to values that should be visible to all threads, like swap!