Re: Poor parallelization performance across 18 cores (but not 4)

David Iba Wed, 18 Nov 2015 08:08:54 -0800

No worries.  Thanks, I'll give that a try as well!

On Thursday, November 19, 2015 at 1:04:04 AM UTC+9, tbc++ wrote:
>
> Oh, then I completely mis-understood the problem at hand here. If that's 
> the case then do the following:
>
> Change "atom" to "volatile!" and "swap!" to "vswap!". See if that changes 
> anything. 
>
> Timothy
>
>
> On Wed, Nov 18, 2015 at 9:00 AM, David Iba <davi...@gmail.com 
> <javascript:>> wrote:
>
>> Timothy:  Each thread (call of f2) creates its own "local" atom, so I 
>> don't think there should be any swap retries.
>>
>> Gianluca:  Good idea!  I've only tried OpenJDK, but I will look into 
>> trying Oracle and report back.
>>
>> Andy:  jvisualvm was showing pretty much all of the memory allocated in 
>> the eden space and a little in the first survivor (no major/full GC's), and 
>> total GC Time was very minimal.
>>
>> I'm in the middle of running some more tests and will report back when I 
>> get a chance today or tomorrow.  Thanks for all the feedback on this!
>>
>> On Thursday, November 19, 2015 at 12:38:55 AM UTC+9, tbc++ wrote:
>>>
>>> This sort of code is somewhat the worst case situation for atoms (or 
>>> really for CAS). Clojure's swap! is based off the "compare-and-swap" or CAS 
>>> operation that most x86 CPUs have as an instruction. If we expand swap! it 
>>> looks something like this:
>>>
>>> (loop [old-val @x*]
>>>   (let [new-val (assoc old-val :k i)]
>>>     (if (compare-and-swap x* old-val new-val)
>>>        new-val
>>>        (recur @x*)))
>>>
>>> Compare-and-swap can be defined as "updates the content of the reference 
>>> to new-val only if the current value of the reference is equal to the 
>>> old-val). 
>>>
>>> So in essence, only one core can be modifying the contents of an atom at 
>>> a time, if the atom is modified during the execution of the swap! call, 
>>> then swap! will continue to re-run your function until it's able to update 
>>> the atom without it being modified during the function's execution. 
>>>
>>> So let's say you have some super long task that you need to integrate 
>>> into a ref, he's one way to do it, but probably not the best:
>>>
>>> (let [a (atom 0)]
>>>   (dotimes [x 18]
>>>     (future
>>>         (swap! a long-operation-on-score some-param))))
>>>
>>>
>>> In this case long-operation-on-score will need to be re-run every time a 
>>> thread modifies the atom. However if our function only needs the state of 
>>> the ref to add to it, then we can do something like this instead:
>>>
>>> (let [a (atom 0)]
>>>   (dotimes [x 18]
>>>     (future
>>>         (let [score (long-operation-on-score some-param)
>>>           (swap! a + score)))))
>>>
>>> Now we only have a simple addition inside the swap! and we will have 
>>> less contention between the CPUs because they will most likely be spending 
>>> more time inside 'long-operation-on-score' instead of inside the swap.
>>>
>>> *TL;DR*: do as little work as possible inside swap! the more you have 
>>> inside swap! the higher chance you will have of throwing away work due to 
>>> swap! retries. 
>>>
>>> Timothy
>>>
>>> On Wed, Nov 18, 2015 at 8:13 AM, gianluca torta <giat...@gmail.com> 
>>> wrote:
>>>
>>>> by the way, have you tried both Oracle and Open JDK with the same 
>>>> results?
>>>> Gianluca
>>>>
>>>> On Tuesday, November 17, 2015 at 8:28:49 PM UTC+1, Andy Fingerhut wrote:
>>>>>
>>>>> David, you say "Based on jvisualvm monitoring, doesn't seem to be 
>>>>> GC-related".
>>>>>
>>>>> What is jvisualvm showing you related to GC and/or memory allocation 
>>>>> when you tried the 18-core version with 18 threads in the same process?
>>>>>
>>>>> Even memory allocation could become a point of contention, depending 
>>>>> upon how the memory allocation works with many threads.  e.g. Depends on 
>>>>> whether a thread gets a large chunk of memory on a global lock, and then 
>>>>> locally carves it up into the small pieces it needs for each individual 
>>>>> Java 'new' allocation, or gets a global lock for every 'new'.  The latter 
>>>>> would give terrible performance as # cores increase, but I don't know how 
>>>>> to tell whether that is the case, except by knowing more about how the 
>>>>> memory allocator is implemented in your JVM.  Maybe digging through 
>>>>> OpenJDK 
>>>>> source code in the right place would tell?
>>>>>
>>>>> Andy
>>>>>
>>>>> On Tue, Nov 17, 2015 at 2:00 AM, David Iba <davi...@gmail.com> wrote:
>>>>>
>>>>>> correction: that "do" should be a "doall".  (My actual test code was 
>>>>>> a bit different, but each run printed some info when it started so it 
>>>>>> doesn't have to do with delayed evaluation of lazy seq's or anything).
>>>>>>
>>>>>>
>>>>>> On Tuesday, November 17, 2015 at 6:49:16 PM UTC+9, David Iba wrote:
>>>>>>>
>>>>>>> Andy:  Interesting.  Thanks for educating me on the fact that atom 
>>>>>>> swap's don't use the STM.  Your theory seems plausible... I will try 
>>>>>>> those 
>>>>>>> tests next time I launch the 18-core instance, but yeah, not sure how 
>>>>>>> illuminating the results will be.
>>>>>>>
>>>>>>> Niels: along the lines of this (so that each thread prints its time 
>>>>>>> as well as printing the overall time):
>>>>>>>
>>>>>>>    1.   (time
>>>>>>>    2.    (let [f f1
>>>>>>>    3.          n-runs 18
>>>>>>>    4.          futs (do (for [i (range n-runs)]
>>>>>>>    5.                     (future (time (f)))))]
>>>>>>>    6.      (doseq [fut futs]
>>>>>>>    7.        @fut)))
>>>>>>>    
>>>>>>>
>>>>>>> On Tuesday, November 17, 2015 at 5:33:01 PM UTC+9, Niels van 
>>>>>>> Klaveren wrote:
>>>>>>>>
>>>>>>>> Could you also show how you are running these functions in parallel 
>>>>>>>> and time them ? The way you start the functions can have as much 
>>>>>>>> impact as 
>>>>>>>> the functions themselves.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Niels
>>>>>>>>
>>>>>>>> On Tuesday, November 17, 2015 at 6:38:39 AM UTC+1, David Iba wrote:
>>>>>>>>>
>>>>>>>>> I have functions f1 and f2 below, and let's say they run in T1 and 
>>>>>>>>> T2 amount of time when running a single instance/thread.  The issue 
>>>>>>>>> I'm 
>>>>>>>>> facing is that parallelizing f2 across 18 cores takes anywhere from 
>>>>>>>>> 2-5X 
>>>>>>>>> T2, and for more complex funcs takes absurdly long.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    1. (defn f1 []
>>>>>>>>>    2.   (apply + (range 2e9)))
>>>>>>>>>    3.  
>>>>>>>>>    4. ;; Note: each call to (f2) makes its own x* atom, so the 
>>>>>>>>>    'swap!' should never retry.
>>>>>>>>>    5. (defn f2 []
>>>>>>>>>    6.   (let [x* (atom {})]
>>>>>>>>>    7.     (loop [i 1e9]
>>>>>>>>>    8.       (when-not (zero? i)
>>>>>>>>>    9.         (swap! x* assoc :k i)
>>>>>>>>>    10.         (recur (dec i))))))
>>>>>>>>>    
>>>>>>>>>
>>>>>>>>> Of note:
>>>>>>>>> - On a 4-core machine, both f1 and f2 parallelize well (roungly T1 
>>>>>>>>> and T2 for 4 runs in parallel)
>>>>>>>>> - running 18 f1's in parallel on the 18-core machine also 
>>>>>>>>> parallelizes well.
>>>>>>>>> - Disabling hyperthreading doesn't help.
>>>>>>>>> - Based on jvisualvm monitoring, doesn't seem to be GC-related
>>>>>>>>> - also tried on dedicated 18-core ec2 instance with same issues, 
>>>>>>>>> so not shared-tenancy-related
>>>>>>>>> - if I make a jar that runs a single f2 and launch 18 in parallel, 
>>>>>>>>> it parallelizes well (so I don't think it's machine/aws-related)
>>>>>>>>>
>>>>>>>>> Could it be that the 18 f2's in parallel on a single JVM instance 
>>>>>>>>> is overworking the STM with all the swap's?  Any other theories?
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "Clojure" group.
>>>>>> To post to this group, send email to clo...@googlegroups.com
>>>>>> Note that posts from new members are moderated - please be patient 
>>>>>> with your first post.
>>>>>> To unsubscribe from this group, send email to
>>>>>> clojure+u...@googlegroups.com
>>>>>> For more options, visit this group at
>>>>>> http://groups.google.com/group/clojure?hl=en
>>>>>> --- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "Clojure" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to clojure+u...@googlegroups.com.
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>> -- 
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Clojure" group.
>>>> To post to this group, send email to clo...@googlegroups.com
>>>> Note that posts from new members are moderated - please be patient with 
>>>> your first post.
>>>> To unsubscribe from this group, send email to
>>>> clojure+u...@googlegroups.com
>>>> For more options, visit this group at
>>>> http://groups.google.com/group/clojure?hl=en
>>>> --- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "Clojure" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to clojure+u...@googlegroups.com.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>>
>>> -- 
>>> “One of the main causes of the fall of the Roman Empire was that–lacking 
>>> zero–they had no way to indicate successful termination of their C 
>>> programs.”
>>> (Robert Firth) 
>>>
>> -- 
>> You received this message because you are subscribed to the Google
>> Groups "Clojure" group.
>> To post to this group, send email to clo...@googlegroups.com 
>> <javascript:>
>> Note that posts from new members are moderated - please be patient with 
>> your first post.
>> To unsubscribe from this group, send email to
>> clojure+u...@googlegroups.com <javascript:>
>> For more options, visit this group at
>> http://groups.google.com/group/clojure?hl=en
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "Clojure" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to clojure+u...@googlegroups.com <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> -- 
> “One of the main causes of the fall of the Roman Empire was that–lacking 
> zero–they had no way to indicate successful termination of their C 
> programs.”
> (Robert Firth) 
>


-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Poor parallelization performance across 18 cores (but not 4)

Reply via email to