Re: abysmal multicore performance, especially on AMD processors

2013-11-06 Thread Timothy Baldridge
You should also specify how many cores you plan on devoting to your application. Notice that most of this discussion has been about JVM apps running on machines with >32 cores. Systems like this aren't exactly common in my line of work (where we tend to run greater numbers of smaller servers using

Re: abysmal multicore performance, especially on AMD processors

2013-11-06 Thread Michael Klishin
2013/11/6 Dave Tenny > (To contrast the lengthy discussion and analysis of this topic that is > *hopefully* the exception and not the rule) Some of the comments reveal that part of the problem is in part with JVM memory allocator which has its throughput limits. There are known large commercia

Re: abysmal multicore performance, especially on AMD processors

2013-11-06 Thread László Török
Hi, I believe Clojure's original mission has been giving you tools for handling concurrency[1] in your programs in a sane way. However, with the advent of Reducers[2], the landscape is changing quite a bit. If you're interested in the concurrency vs. parallelism terminology and what language const

Re: abysmal multicore performance, especially on AMD processors

2013-11-06 Thread Dave Tenny
As a person who has recently been dabbling with clojure for evaluation purposes I wondered if anybody wanted to post some links about parallel clojure apps that have been clear and easy parallelism wins for the types of applications that clojure was designed for. (To contrast the lengthy discu

Re: abysmal multicore performance, especially on AMD processors

2013-11-05 Thread Wm. Josiah Erikson
Neat, thanks for that. I skimmed it and don't know enough about Java to be able to tell quickly how easily we can use this to our advantage, but perhaps somebody else on the list will know. The disruptor project from LMAX has wrestled with these sort of issues at length and achieved astounding leve

Re: abysmal multicore performance, especially on AMD processors

2013-09-27 Thread Neale Swinnerton
The disruptor project from LMAX has wrestled with these sort of issues at length and achieved astounding levels of performance on the JVM Martin Thompson, the original author of the disruptor, is a leading light in the JVM performance space, his mechanical sympathy blog is a goldmine of informatio

Re: abysmal multicore performance, especially on AMD processors

2013-09-27 Thread Wm. Josiah Erikson
Interesting! If that is true of Java (I don't know Java at all), then your argument seems plausible. Cache-to-main-memory writes still take many more CPU cycles (an order of magnitude more, last I knew) than processor-to-cache. I don't think it's so much a bandwidth issue as latency, AFAIK. Thanks

Re: abysmal multicore performance, especially on AMD processors

2013-09-26 Thread Andy Fingerhut
Adding to this thread from almost a year ago. I don't have conclusive proof with experiments to show right now, but I do have some experiments that have led me to what I think is a plausible cause of not just Clojure programs running more slowly when multi-threaded than when single-threaded, but a

Re: multicore list processing (was Re: abysmal multicore performance, especially on AMD processors)

2013-01-31 Thread Lee Spector
On Jan 31, 2013, at 10:15 AM, Chas Emerick wrote: >> >> Then Wm. Josiah posted a full-application benchmark, which appears to >> have entirely different performance problems from the synthetic `burn` >> benchmark. I’d rejected GC as the cause for the slowdown there too, but >> ATM can’t recall w

Re: multicore list processing (was Re: abysmal multicore performance, especially on AMD processors)

2013-01-31 Thread Chas Emerick
On Jan 31, 2013, at 9:23 AM, Marshall Bockrath-Vandegrift wrote: > Chas Emerick writes: > >> The nature of the `burn` program is such that I'm skeptical of the >> ability of any garbage-collected runtime (lispy or not) to scale its >> operation across multiple threads. > > Bringing you up to sp

Re: multicore list processing (was Re: abysmal multicore performance, especially on AMD processors)

2013-01-31 Thread Marshall Bockrath-Vandegrift
Chas Emerick writes: > Keeping the discussion here would make sense, esp. in light of > meetup.com's horrible "discussion board". Excellent. Solves the problem of deciding the etiquette of jumping on the meetup board for a meetup one has never been involved in. :-) > The nature of the `burn`

multicore list processing (was Re: abysmal multicore performance, especially on AMD processors)

2013-01-31 Thread Chas Emerick
Keeping the discussion here would make sense, esp. in light of meetup.com's horrible "discussion board". I don't have a lot to offer on the JVM/Clojure-specific problem beyond what I wrote in that meetup thread, but Lee's challenge(s) were too hard to resist: > "Would your conclusion be somethi

Re: abysmal multicore performance, especially on AMD processors

2013-01-30 Thread Andy Fingerhut
Josiah mentioned requesting a free trial of the ZIng JVM. Did you ever get access to that, and were able to try your code running on that? Again, I have no direct experience with their product to guarantee you better results -- just that I've heard good things about their ability to handle con

Re: abysmal multicore performance, especially on AMD processors

2013-01-30 Thread Lee Spector
FYI we had a bit of a discussion about this at a meetup in Amherst MA yesterday, and while I'm not sufficiently on top of the JVM or system issues to have briefed everyone on all of the details there has been a little of followup since the discussion, including results of some different experim

Re: abysmal multicore performance, especially on AMD processors

2013-01-30 Thread Marshall Bockrath-Vandegrift
"Wm. Josiah Erikson" writes: > Am I reading this right that this is actually a Java problem, and not > clojure-specific? Wouldn't the rest of the Java community have noticed > this? Or maybe massive parallelism in this particular way isn't > something commonly done with Java in the industry? > >

Re: abysmal multicore performance, especially on AMD processors

2013-01-10 Thread Wm. Josiah Erikson
Am I reading this right that this is actually a Java problem, and not clojure-specific? Wouldn't the rest of the Java community have noticed this? Or maybe massive parallelism in this particular way isn't something commonly done with Java in the industry? Thanks for the patches though - it's nice

Re: abysmal multicore performance, especially on AMD processors

2012-12-30 Thread cameron
I've posted a patch with some changes here (https://gist.github.com/4416803), it includes the record change here and a small change to interpret-instruction, the benchmark runs > 2x the default as it did for Marshall. The patch also modifies the main loop to use a thread pool instead of agents

Re: abysmal multicore performance, especially on AMD processors

2012-12-28 Thread cameron
No, it's not the context switching, changing isArray (a native method) to getAnnotations (a normal jvm method) gives the same time for both the parallel and serial version. Cameron. On Saturday, December 29, 2012 10:34:42 AM UTC+11, Leonardo Borges wrote: > > In that case isn't context switchin

Re: abysmal multicore performance, especially on AMD processors

2012-12-28 Thread Leonardo Borges
In that case isn't context switching dominating your test? .isArray isn't expensive enough to warrant the use of pmap Leonardo Borges www.leonardoborges.com On Dec 29, 2012 10:29 AM, "cameron" wrote: > Hi Lee, > I've done some more digging and seem to have found the root of the > problem, > i

Re: abysmal multicore performance, especially on AMD processors

2012-12-28 Thread cameron
Hi Lee, I've done some more digging and seem to have found the root of the problem, it seems that java native methods are much slower when called in parallel. The following code illustrates the problem: (letfn [(time-native [f] (let [c (class [])] (time (dorun (

Re: abysmal multicore performance, especially on AMD processors

2012-12-24 Thread cameron
I've been moving house for the last week or so but I'll also give the benchmark another look. My initial profiling seemed to show that the parallel version was spending a significant amount of time in java.lang.isArray, clojush.pushstate/stack-ref is calling nth on the result of cons, since it i

Re: abysmal multicore performance, especially on AMD processors

2012-12-22 Thread Lee Spector
On Dec 21, 2012, at 6:59 PM, Meikel Brandmeyer wrote: > >> Is there a much simpler way that I overlooked? > > I'm not sure it's simpler, but it's more straight-forward, I'd say. > Thanks Marshall and Mikel on the struct->record conversion code. I'll definitely make a change along those lines.

Re: abysmal multicore performance, especially on AMD processors

2012-12-21 Thread Meikel Brandmeyer
Hi, Am 22.12.12 00:37, schrieb Lee Spector: > ;; this is defined elsewhere, and I want push-states to have fields for each > push-type that's defined here > (def push-types '(:exec :integer :float :code :boolean :string :zip > :tag :auxiliary :return :environment) > > (d

Re: abysmal multicore performance, especially on AMD processors

2012-12-21 Thread Marshall Bockrath-Vandegrift
Lee Spector writes: > FWIW I used records for push-states at one point but did not observe a > speedup and it required much messier code, so I reverted to > struct-maps. But maybe I wasn't doing the right timings. I'm curious > about how you changed to records without the messiness. I'll include

Re: abysmal multicore performance, especially on AMD processors

2012-12-21 Thread Lee Spector
On Dec 21, 2012, at 5:22 PM, Marshall Bockrath-Vandegrift wrote: > Not to the bottom of things yet, but found some low-hanging fruit – > switching the `push-state` from a struct-map to a record gives a flat > ~2x speedup in all configurations I tested. So, that’s good? I really appreciate your a

Re: abysmal multicore performance, especially on AMD processors

2012-12-21 Thread Marshall Bockrath-Vandegrift
"Wm. Josiah Erikson" writes: > I hope this helps people get to the bottom of things. Not to the bottom of things yet, but found some low-hanging fruit – switching the `push-state` from a struct-map to a record gives a flat ~2x speedup in all configurations I tested. So, that’s good? I have how

Re: abysmal multicore performance, especially on AMD processors

2012-12-19 Thread Tassilo Horn
"Wm. Josiah Erikson" writes: > Then run, for instance: /usr/bin/time -f %E lein run > clojush.examples.benchmark-bowling > > and then, when that has finished, edit > src/clojush/examples/benchmark_bowling.clj and uncomment > ":use-single-thread true" and run it again. I think this is a > succinct

Re: abysmal multicore performance, especially on AMD processors

2012-12-19 Thread Wm. Josiah Erikson
I tried redefining the few places in the code (string_reverse, I think) that used reverse to use the same version of reverse that I got such great speedups with in your code, and it made no difference. There are not any explicit calls to conj in the code that I could find. On Wed, Dec 19, 2012 at

Re: abysmal multicore performance, especially on AMD processors

2012-12-19 Thread Lee Spector
On Dec 19, 2012, at 11:57 AM, Wm. Josiah Erikson wrote: > I think this is a succinct, deterministic benchmark that clearly > demonstrates the problem and also doesn't use conj or reverse. Clarification: it's not just a tight loop involving reverse/conj, as our previous benchmark was. It's our

Re: abysmal multicore performance, especially on AMD processors

2012-12-19 Thread Wm. Josiah Erikson
Whoops, sorry about the link. It should be able to be found here: http://gibson.hampshire.edu/~josiah/clojush/ On Wed, Dec 19, 2012 at 11:57 AM, Wm. Josiah Erikson wrote: > So here's what we came up with that clearly demonstrates the problem. Lee > provided the code and I tweaked it until I belie

Re: abysmal multicore performance, especially on AMD processors

2012-12-19 Thread Wm. Josiah Erikson
So here's what we came up with that clearly demonstrates the problem. Lee provided the code and I tweaked it until I believe it shows the problem clearly and succinctly. I have put together a .tar.gz file that has everything needed to run it, except lein. Grab it here: clojush_bowling_benchmark.ta

Re: abysmal multicore performance, especially on AMD processors

2012-12-16 Thread Lee Spector
On Dec 14, 2012, at 10:41 PM, cameron wrote: > Until Lee has a representative benchmark for his application it's difficult > to tell if he's > experiencing the same problem but there would seem to be a case for changing > the PersistentList > implementation in clojure.lang. We put together a ve

Re: abysmal multicore performance, especially on AMD processors

2012-12-16 Thread Lee Spector
On Dec 15, 2012, at 1:14 AM, cameron wrote: > > Originally I was using ECJ (http://cs.gmu.edu/~eclab/projects/ecj/) in java > for my GP work but for the last few years it's been GEVA with a clojure > wrapper I wrote (https://github.com/cdorrat/geva-clj). Ah yes -- I've actually downloaded and

Re: abysmal multicore performance, especially on AMD processors

2012-12-14 Thread cameron
> > I'd be interested in seeing your GP system. The one we're using evolves > "Push" programs and I suspect that whatever's triggering this problem with > multicore utilization is stemming from something in the inner loop of my > Push interpreter (https://github.com/lspector/Clojush)... but I

Re: abysmal multicore performance, especially on AMD processors

2012-12-14 Thread cameron
Thanks Herwig, I used your plugin with the following 2 burn variants: (defn burn-slow [& _] (count (last (take 1000 (iterate #(reduce conj '() %) (range 1)) (defn burn-fast [& _] (count (last (take 1000 (iterate #(reduce conj* (list nil) %) (range 1)) Where conj* is just a

Re: abysmal multicore performance, especially on AMD processors

2012-12-14 Thread Herwig Hochleitner
I've created a test harness for this as a leiningen plugin: https://github.com/bendlas/lein-partest You can just put :plugins [[net.bendlas/lein-partest "0.1.0"]] into your project and run lein partest your.ns/testfn 6 to run 6 threads/processes in parallel The plugin then runs the fu

Re: abysmal multicore performance, especially on AMD processors

2012-12-13 Thread Lee Spector
On Dec 13, 2012, at 4:21 PM, cameron wrote: > > Have you made any progress on a small deterministic benchmark that reflects > your applications behaviour (ie. the RNG seed work you were discussing)? I'm > keen to help, but I don't have time to look at benchmarks that take hours to > run. > >

Re: abysmal multicore performance, especially on AMD processors

2012-12-13 Thread cameron
On Friday, December 14, 2012 5:41:59 AM UTC+11, Wm. Josiah Erikson wrote: > > Does this help? Should I do something else as well? I'm curious to try > running like, say 16 concurrent copies on the 48-way node > > Have you made any progress on a small deterministic benchmark that reflects

Re: abysmal multicore performance, especially on AMD processors

2012-12-13 Thread Wm. Josiah Erikson
Cool. I've requested a free trial. On Thu, Dec 13, 2012 at 1:53 PM, Andy Fingerhut wrote: > I'm not saying that I know this will help, but if you are open to trying a > different JVM that has had a lot of work done on it to optimize it for high > concurrency, Azul's Zing JVM may be worth a try, t

Re: abysmal multicore performance, especially on AMD processors

2012-12-13 Thread Andy Fingerhut
I'm not saying that I know this will help, but if you are open to trying a different JVM that has had a lot of work done on it to optimize it for high concurrency, Azul's Zing JVM may be worth a try, to see if it increases parallelism for a single Clojure instance in a single JVM, with lots of t

Re: abysmal multicore performance, especially on AMD processors

2012-12-13 Thread Wm. Josiah Erikson
Ah. We'll look into running several clojures in one JVM too. Thanks. On Thu, Dec 13, 2012 at 1:41 PM, Wm. Josiah Erikson wrote: > OK, I did something a little bit different, but I think it proves the same > thing we were shooting for. > > On a 48-way 4 x Opteron 6168 with 32GB of RAM. This is Tom

Re: abysmal multicore performance, especially on AMD processors

2012-12-13 Thread Wm. Josiah Erikson
OK, I did something a little bit different, but I think it proves the same thing we were shooting for. On a 48-way 4 x Opteron 6168 with 32GB of RAM. This is Tom's "Bowling" benchmark: 1: multithreaded. Average of 10 runs: 14:00.9 2. singlethreaded. Average of 10 runs: 23:35.3 3. singlethreaded,

Re: abysmal multicore performance, especially on AMD processors

2012-12-12 Thread Christophe Grand
See https://github.com/flatland/classlojure for a, nearly, ready-made solution to running several Clojures in one JVM. On Wed, Dec 12, 2012 at 5:20 PM, Lee Spector wrote: > > On Dec 12, 2012, at 10:45 AM, Christophe Grand wrote: > > Lee, while you are at benchmarking, would you mind running sev

Re: abysmal multicore performance, especially on AMD processors

2012-12-12 Thread cameron
On Thursday, December 13, 2012 12:51:57 AM UTC+11, Marshall Bockrath-Vandegrift wrote: > > cameron > writes: > > > the megamorphic call site hypothesis does sound plausible but I'm > > not sure where the following test fits in. > > ... > > > I was toying with the idea of replacing the Empt

Re: abysmal multicore performance, especially on AMD processors

2012-12-12 Thread Lee Spector
On Dec 12, 2012, at 10:45 AM, Christophe Grand wrote: > Lee, while you are at benchmarking, would you mind running several threads in > one JVM with one clojure instance per thread? Thus each thread should get > JITted independently. I'm not actually sure how to do that. We're starting runs wit

Re: abysmal multicore performance, especially on AMD processors

2012-12-12 Thread Christophe Grand
Lee, while you are at benchmarking, would you mind running several threads in one JVM with one clojure instance per thread? Thus each thread should get JITted independently. Christophe On Wed, Dec 12, 2012 at 4:11 PM, Lee Spector wrote: > > On Dec 12, 2012, at 10:03 AM, Andy Fingerhut wrote: >

Re: abysmal multicore performance, especially on AMD processors

2012-12-12 Thread Lee Spector
On Dec 12, 2012, at 10:03 AM, Andy Fingerhut wrote: > > Have you tried running your real application in a single thread in a JVM, and > then run multiple JVMs in parallel, to see if there is any speedup? If so, > that would again help determine whether it is multiple threads in a single > JVM

Re: abysmal multicore performance, especially on AMD processors

2012-12-12 Thread Andy Fingerhut
Lee: I believe you said that with your benchmarking code achieved good speedup when run as separate JVMs that were each running a single thread, even before making the changes to the implementation of reverse found by Marshall. I confirmed that on my own machine as well. Have you tried runnin

Re: abysmal multicore performance, especially on AMD processors

2012-12-12 Thread Marshall Bockrath-Vandegrift
cameron writes: >   the megamorphic call site hypothesis does sound plausible but I'm > not sure where the following test fits in. ... > I was toying with the idea of replacing the EmptyList class with a > PersistsentList instance to mitigate the problem > in at least one common case, however i

Re: abysmal multicore performance, especially on AMD processors

2012-12-12 Thread Marshall Bockrath-Vandegrift
Andy Fingerhut writes: > I'm not practiced in recognizing megamorphic call sites, so I could be > missing some in the example code below, modified from Lee's original > code. It doesn't use reverse or conj, and as far as I can tell > doesn't use PersistentList, either, only Cons. ... > Can you

Re: abysmal multicore performance, especially on AMD processors

2012-12-12 Thread cameron
Hi Marshall, the megamorphic call site hypothesis does sound plausible but I'm not sure where the following test fits in. If I understand correctly we believe that it's the fact that the base case (an PersistentList$EmptyList instance) and the normal case (an PersistsentList instance) have dif

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Lee Spector
On Dec 11, 2012, at 1:06 PM, Marshall Bockrath-Vandegrift wrote: > So I think if you replace your calls to `reverse` and any `conj` loops > you have in your own code, you should see a perfectly reasonable > speedup. Tantalizing, but on investigation I see that our real application actually does

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Wm. Josiah Erikson
Hm. Interesting. For the record, the exact code I'm running right now that I'm seeing great parallelism with is this: (defn reverse-recursively [coll] (loop [[r & more :as all] (seq coll) acc '()] (if all (recur more (cons r acc)) acc))) (defn burn ([] (loop [i 0

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Wm. Josiah Erikson
...and, suddenly, the high-core-count Opterons show us what we wanted and hoped for. If I increase that range statement to 100 and run it on the 48-core node, it takes 50 seconds (before it took 50 minutes), while the FX-8350 takes 3:31.89 and the 3770K takes 3:48.95. Thanks Marshall! I think you m

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Andy Fingerhut
Marshall: I'm not practiced in recognizing megamorphic call sites, so I could be missing some in the example code below, modified from Lee's original code. It doesn't use reverse or conj, and as far as I can tell doesn't use PersistentList, either, only Cons. (defn burn-cons [size] (let [si

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Wm. Josiah Erikson
And, interestingly enough, suddenly the AMD FX-8350 beats the Intel Core i7 3770K, when before it was very very much not so. So for some reason, this bug was tickled more dramatically on AMD multicore processors than on Intel ones. On Tue, Dec 11, 2012 at 2:54 PM, Wm. Josiah Erikson wrote: > OK W

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Wm. Josiah Erikson
OK WOW. You hit the nail on the head. It's "reverse" being called in a pmap that does it. When I redefine my own version of reverse (I totally cheated and just stole this) like this: (defn reverse-recursively [coll] (loop [[r & more :as all] (seq coll) acc '()] (if all (recur

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Marshall Bockrath-Vandegrift
Lee Spector writes: > If the application does lots of "list processing" but does so with a > mix of Clojure list and sequence manipulation functions, then one > would have to write private, list/cons-only versions of all of these > things? That is -- overstating it a bit, to be sure, but perhaps

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Lee Spector
On Dec 11, 2012, at 11:40 AM, Marshall Bockrath-Vandegrift wrote: > >> Or have I missed a currently-available work-around among the many >> suggestions? > > You can specialize your application to avoid megamodal call sites in > tight loops. If you are working with `Cons`-order sequences, just u

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Marshall Bockrath-Vandegrift
Lee Spector writes: > Is the following a fair characterization pending further developments? > > If you have a cons-intensive task then even if it can be divided into > completely independent, long-running subtasks, there is currently no > known way to get significant speedups by running the subt

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Gary Johnson
Lee, My reading of this thread is not quite as pessimistic as yours. Here is my synthesis for the practical application developer in Clojure from reading and re-reading all of the posts above. Marshall and Cameron, please feel free to correct me if I screw anything up here royally. ;-) When

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Lee Spector
On Dec 11, 2012, at 4:37 AM, Marshall Bockrath-Vandegrift wrote: > I’m not sure what the next steps are. Open a bug on the JVM? This is > something one can attempt to circumvent on a case-by-case basis, but > IHMO has significant negative implications for Clojure’s concurrency > story. I've gott

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Marshall Bockrath-Vandegrift
"nicolas.o...@gmail.com" writes: > What happens if your run it a third time at the end?  (The question > is related to the fact that there appears to be transition states > between monomorphic and megamorphic call sites,  which might lead to > an explanation.) Same results, but your comment jog

Re: abysmal multicore performance, especially on AMD processors

2012-12-10 Thread Wm. Josiah Erikson
Interesting. I tried the following: :jvm-opts ["-Xmx10g" "-Xms10g" "-XX:+AggressiveOpts" "-server" "-XX:+TieredCompilation" "-XX:ReservedCodeCacheSize=256m" "-XX:TLABSize=1G" "-XX:+PrintGCDetails" "-XX:+PrintGCTimeStamps" "-XX:+UseParNewGC" "-XX:+ResizeTLAB" "-XX:+UseTLAB"] I got a slight slowdown

Re: abysmal multicore performance, especially on AMD processors

2012-12-10 Thread Marshall Bockrath-Vandegrift
"Wm. Josiah Erikson" writes: > Aha. Not only do I get a lot of "made not entrant", I get a lot of > "made zombie". However, I get this for both runs with map and with > pmap (and with pmapall as well) I’m not sure this is all that enlightening. From what I can gather, “made not entrant” just me

Re: abysmal multicore performance, especially on AMD processors

2012-12-10 Thread Wm. Josiah Erikson
I tried some more performance tuning options in Java, just for kicks, and didn't get any advantages from them: "-server" "-XX:+TieredCompilation" "-XX:ReservedCodeCacheSize=256m" Also, in case it's informative: [josiah@compute-1-17 benchmark]$ grep entrant compilerOutputCompute-1-1.txt | wc -l 17

Re: abysmal multicore performance, especially on AMD processors

2012-12-10 Thread Wm. Josiah Erikson
Aha. Not only do I get a lot of "made not entrant", I get a lot of "made zombie". However, I get this for both runs with map and with pmap (and with pmapall as well) For instance, from a pmapall run: 33752 159 clojure.lang.Cons::next (10 bytes) made zombie 33752 164

Re: abysmal multicore performance, especially on AMD processors

2012-12-10 Thread meteorfox
> > - Parallel allocation of `Cons` and `PersistentList` instances through > a Clojure `conj` function remains fast as long as the function only > ever returns objects of a single concrete type A possible explanation for this could be JIT Deoptimization. Deoptimization happens when

Re: abysmal multicore performance, especially on AMD processors

2012-12-10 Thread Marshall Bockrath-Vandegrift
cameron writes: > There does seem to be something unusual about conj and > clojure.lang.PersistentList in this parallel test case and I don't > think it's related to the JVMs memory allocation. I’ve got a few more data-points, but still no handle on what exactly is going on. My last benchmark s

Re: abysmal multicore performance, especially on AMD processors

2012-12-10 Thread Marko Topolnik
The main GC feature here are the Thread-Local Allocation Buffers. They are on by default and are "automatically sized according to allocation patterns". The size can also be fine-tuned with the -XX:TLABSize=nconfiguration option. You may consider tweaking this setting to optimize runtime. Basic

Re: abysmal multicore performance, especially on AMD processors

2012-12-10 Thread cameron
Hi Marshall, I think we're definitely on the right track. If I replace the reverse call with the following function I get a parallel speedup of ~7.3 on an 8 core machine. (defn copy-to-java-list [coll] (let [lst (java.util.LinkedList.)] (doseq [x coll] (.addFirst lst x)) lst))

Re: abysmal multicore performance, especially on AMD processors

2012-12-09 Thread Softaddicts
There's no magic here, everyone tuning their app hit this wall eventually, tweaking the JVM memory options :) Luc > > On Dec 9, 2012, at 6:25 AM, Softaddicts wrote: > > > If the number of object allocation mentioned earlier in this thread are > > real, > > yes vm heap management can be a bott

Re: abysmal multicore performance, especially on AMD processors

2012-12-09 Thread Andy Fingerhut
On Dec 9, 2012, at 6:25 AM, Softaddicts wrote: > If the number of object allocation mentioned earlier in this thread are real, > yes vm heap management can be a bottleneck. There has to be some > locking done somewhere otherwise the heap would corrupt :) > > The other bottleneck can come from ga

Re: abysmal multicore performance, especially on AMD processors

2012-12-09 Thread Marshall Bockrath-Vandegrift
Andy Fingerhut writes: > My current best guess is the JVM's memory allocator, not Clojure code. I didn’t mean to imply the problem was in Clojure itself, but I don’t believe the issue is in the memory allocator either. I now believe the problem is in a class of JIT optimization HotSpot is perfo

Re: abysmal multicore performance, especially on AMD processors

2012-12-09 Thread Andy Fingerhut
On Dec 9, 2012, at 4:48 AM, Marshall Bockrath-Vandegrift wrote: > > It’s like there’s a lock of some sort sneaking in on the `conj` path. > Any thoughts on what that could be? My current best guess is the JVM's memory allocator, not Clojure code. Andy -- You received this message because you

Re: abysmal multicore performance, especially on AMD processors

2012-12-09 Thread Andy Fingerhut
On Dec 8, 2012, at 9:37 PM, Lee Spector wrote: > > On Dec 8, 2012, at 10:19 PM, meteorfox wrote: >> >> Now if you run vmstat 1 while running your benchmark you'll notice that the >> run queue will be most of the time at 8, meaning that 8 "processes" are >> waiting for CPU, and this is due to m

Re: abysmal multicore performance, especially on AMD processors

2012-12-09 Thread Softaddicts
If the number of object allocation mentioned earlier in this thread are real, yes vm heap management can be a bottleneck. There has to be some locking done somewhere otherwise the heap would corrupt :) The other bottleneck can come from garbage collection which has to freeze object allocation com

Re: abysmal multicore performance, especially on AMD processors

2012-12-09 Thread Marshall Bockrath-Vandegrift
cameron writes: > Interesting problem, the slowdown seems to being caused by the reverse > call (actually the calls to conj with a list argument). Excellent analysis, sir! I think this points things in the right direction. > fast-reverse    : map-ms: 3.3, pmap-ms 0.7, speedup 4.97 > list-cons 

Re: abysmal multicore performance, especially on AMD processors

2012-12-09 Thread Jim - FooBar();
Hi Lee, Would it be difficult to try the following version of 'pmap'? It doesn't use futures but executors instead so at least this could help narrow the problem down... If the problem is due to the high number of futures spawned by pmap then this should fix it... (defn- with-thread-pool* [

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread cameron
I forgot to mention, I cut the number of reverse iterations down to 1000 (not 1) so I wouldn't have to wait too long for criterium, the speedup numbers are representative of the full test though. Cameron. On Sunday, December 9, 2012 6:26:16 PM UTC+11, cameron wrote: > > > Interesting probl

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread cameron
Interesting problem, the slowdown seems to being caused by the reverse call (actually the calls to conj with a list argument). Calling conj in a multi-threaded environment seems to have a significant performance impact when using lists I created some alternate reverse implementations (the fastes

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Lee Spector
On Dec 8, 2012, at 10:19 PM, meteorfox wrote: > > Now if you run vmstat 1 while running your benchmark you'll notice that the > run queue will be most of the time at 8, meaning that 8 "processes" are > waiting for CPU, and this is due to memory accesses (in this case, since this > is not true

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Lee Spector
On Dec 8, 2012, at 8:16 PM, Marek Šrank wrote: > > Yep, reducers, don't use lazy seqs. But they return just sth. like > transformed functions, that will be applied when building the collection. So > you can use them like this: > > (into [] (r/map burn (doall (range 4) > > See > http:

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread meteorfox
Correction regarding the run-queue, this is not completely correct, :S . But the stalled cycles and memory accesses still holds. Sorry for the misinformation. On Friday, December 7, 2012 8:25:14 PM UTC-5, Lee wrote: > > > I've been running compute intensive (multi-day), highly parallelizable >

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread meteorfox
Lee: I ran Linux perf and also watched the run queue (with vmstat) and your bottleneck is basically memory access. The CPUs are idle 80% of the time by stalled cycles. Here's what I got on my machine. Intel Core i7 4 cores with Hyper thread (8 virtual processors) 16 GiB of Memory Oracle JVM an

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread meteorfox
Lee: So I ran On Friday, December 7, 2012 8:25:14 PM UTC-5, Lee wrote: > > > I've been running compute intensive (multi-day), highly parallelizable > Clojure processes on high-core-count machines and blithely assuming that > since I saw near maximal CPU utilization in "top" and the like that I

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Marek Šrank
> Just tried, my first foray into reducers, but I must not be understanding > something correctly: > > (time (r/map burn (doall (range 4 > > returns in less than a second on my macbook pro, whereas > > (time (doall (map burn (range 4 > > takes nearly a minute. > > This feels lik

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Wm. Josiah Erikson
I'm glad somebody else can duplicate our findings! I get results similar to this on Intel hardware. On AMD hardware, the disparity is bigger, and multiple threads of a single JVM invocation on AMD hardware consistently gives me slowdowns as compared to a single thread. Also, your results are on

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Andy Fingerhut
One more possibility to consider: Single-threaded versions are more likely to keep the working set in the processor's largest cache, whereas parallel versions that use N times the working set for N times the parallelism can cause that same cache to thrash to main memory. Andy -- You received

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Lee Spector
On Dec 8, 2012, at 3:42 PM, Andy Fingerhut wrote: > > I'm hoping you realize that (take 1 (iterate reverse value)) is reversing > a linked list 1 times, each time allocating 1 cons cells (or > Clojure's equivalent of a cons cell)? For a total of around 100,000,000 > memory allocat

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Andy Fingerhut
On Dec 7, 2012, at 5:25 PM, Lee Spector wrote: > The test: I wrote a time-consuming function that just does a bunch of math > and list manipulation (which is what takes a lot of time in my real > applications): > > (defn burn > ([] (loop [i 0 > value '()] >(if (>= i 1

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Andy Fingerhut
I haven't analyzed your results in detail, but here are some results I had on my 2GHz 4-core Intel core i7 MacBook Pro vintage 2011. When running multiple threads within a single JVM invocation, I never got a speedup of even 2. The highest speedup I measured was 1.82 speedup when I ran 8 threa

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Wm. Josiah Erikson
Andy: The short answer is yes, and we saw huge speedups. My latest post, as well as Lee's, has details. On Friday, December 7, 2012 9:42:03 PM UTC-5, Andy Fingerhut wrote: > > > On Dec 7, 2012, at 5:25 PM, Lee Spector wrote: > > > > > Another strange observation is that we can run multiple inst

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Wm. Josiah Erikson
Hi guys - I'm the colleague Lee speaks of. Because Jim mentioned running things on a 4-core Phenom II, I did some benchmarking on a Phenom II X4 945, and found some very strange results, which I shall post here, after I explain a little function that Lee wrote that is designed to get improved r

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Lee Spector
On Dec 8, 2012, at 1:28 PM, Paul deGrandis wrote: > My experiences in the past are similar to the numbers that Jim is reporting. > > I have recently been centering most of my crunching code around reducers. > Is it possible for you to cook up a small representative test using > reducers+fork/joi

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Paul deGrandis
My experiences in the past are similar to the numbers that Jim is reporting. I have recently been centering most of my crunching code around reducers. Is it possible for you to cook up a small representative test using reducers+fork/join (and potentially primitives in the intermediate steps)? Pe

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Lee Spector
On Dec 8, 2012, at 9:36 AM, Marshall Bockrath-Vandegrift wrote: > > Although it doesn’t impact your benchmark, `pmap` may be further > adversely affecting the performance of your actual program. There’s a > open bug regarding `pmap` and chunked seqs: > >http://dev.clojure.org/jira/browse/CL

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Lee Spector
On Dec 7, 2012, at 9:42 PM, Andy Fingerhut wrote: > > > When you say "we can run multiple instances of the test on the same machine", > do you mean that, for example, on an 8 core machine you run 8 different JVMs > in parallel, each doing a single-threaded 'map' in your Clojure code and not >

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Marshall Bockrath-Vandegrift
Lee Spector writes: > I'm also aware that the test that produced the data I give below, > insofar as it uses pmap to do the distribution, may leave cores idle > for a bit if some tasks take a lot longer than others, because of the > way that pmap allocates cores to threads. Although it doesn’t i

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Jim - FooBar();
Even though this is very surprising (and sad) to hear, I'm afraid I've got different experiences... My reducer-based parallel minimax is about 3x faster than the serial one, on my 4-core AMD phenom II and a tiny bit faster on my girlfriend's intel i5 (2 physical cores + 2 virtual). I'm suspecti

  1   2   >