Re: abysmal multicore performance, especially on AMD processors
As a person who has recently been dabbling with clojure for evaluation purposes I wondered if anybody wanted to post some links about parallel clojure apps that have been clear and easy parallelism wins for the types of applications that clojure was designed for. (To contrast the lengthy discussion and analysis of this topic that is *hopefully* the exception and not the rule). Any good links here? Any high profile stuff? -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: abysmal multicore performance, especially on AMD processors
Hi, I believe Clojure's original mission has been giving you tools for handling concurrency[1] in your programs in a sane way. However, with the advent of Reducers[2], the landscape is changing quite a bit. If you're interested in the concurrency vs. parallelism terminology and what language constructs supporting them are available, [3] should give you a very good overview. I understand, I haven't actually answered you're question, but I wasn't sure whether you're referring to concurrency or parallelism. :) Las [1] http://clojure.org/concurrent_programming [2] http://clojure.com/blog/2012/05/08/reducers-a-library-and-model-for-collection-processing.html [3] http://clojure-doc.org/articles/language/concurrency_and_parallelism.html 2013/11/6 Dave Tenny dave.te...@gmail.com As a person who has recently been dabbling with clojure for evaluation purposes I wondered if anybody wanted to post some links about parallel clojure apps that have been clear and easy parallelism wins for the types of applications that clojure was designed for. (To contrast the lengthy discussion and analysis of this topic that is *hopefully* the exception and not the rule). Any good links here? Any high profile stuff? -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out. -- László Török -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: abysmal multicore performance, especially on AMD processors
2013/11/6 Dave Tenny dave.te...@gmail.com (To contrast the lengthy discussion and analysis of this topic that is *hopefully* the exception and not the rule) Some of the comments reveal that part of the problem is in part with JVM memory allocator which has its throughput limits. There are known large commercial systems entirely in Clojure that demonstrate excellent latency (and, likely, throughput). I believe Factual, Runa/Stables Labs and Prismatic posted various numbers here but don't have the links handy. Feel free to start a separate thread to collect some feedback. -- MK http://github.com/michaelklishin http://twitter.com/michaelklishin -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: abysmal multicore performance, especially on AMD processors
You should also specify how many cores you plan on devoting to your application. Notice that most of this discussion has been about JVM apps running on machines with 32 cores. Systems like this aren't exactly common in my line of work (where we tend to run greater numbers of smaller servers using virtualization), but perhaps they are in yours. Timothy Baldridge On Wed, Nov 6, 2013 at 8:45 AM, Michael Klishin michael.s.klis...@gmail.com wrote: 2013/11/6 Dave Tenny dave.te...@gmail.com (To contrast the lengthy discussion and analysis of this topic that is *hopefully* the exception and not the rule) Some of the comments reveal that part of the problem is in part with JVM memory allocator which has its throughput limits. There are known large commercial systems entirely in Clojure that demonstrate excellent latency (and, likely, throughput). I believe Factual, Runa/Stables Labs and Prismatic posted various numbers here but don't have the links handy. Feel free to start a separate thread to collect some feedback. -- MK http://github.com/michaelklishin http://twitter.com/michaelklishin -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out. -- “One of the main causes of the fall of the Roman Empire was that–lacking zero–they had no way to indicate successful termination of their C programs.” (Robert Firth) -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: abysmal multicore performance, especially on AMD processors
Interesting! If that is true of Java (I don't know Java at all), then your argument seems plausible. Cache-to-main-memory writes still take many more CPU cycles (an order of magnitude more, last I knew) than processor-to-cache. I don't think it's so much a bandwidth issue as latency, AFAIK. Thanks for thinking about this more, so long after the fact. We still see the issue. On Sep 26, 2013 11:43 PM, Andy Fingerhut andy.finger...@gmail.com wrote: Adding to this thread from almost a year ago. I don't have conclusive proof with experiments to show right now, but I do have some experiments that have led me to what I think is a plausible cause of not just Clojure programs running more slowly when multi-threaded than when single-threaded, but any programs running on JVM's memory model doing so. Qualification: My explanation would be true only for multi-threaded programs running on the JVM that store significant amounts of data in memory, even if the data written by each thread is only read by that thread, and there is no locking or other inter-thread communication, i.e. for embarrasingly parallel problems. Outline of the argument: When you start a thread, and then wait for the thread to complete, the JVM memory model requires all loads and stores to satisfy certain restrictions. One of these is that any store done before the thread is created should 'happen before' the thread start, and thus the updated stored values must be visible to the new thread. 'Visible' here means that the thread doing the store must cause the CPU it is running on to update main memory from whatever locally modified values it has written into its local cache. That rule isn't so relevant to my argument. The one that is relevant is that any store performed by the thread is considered to 'happen before' a join operation on the thread. Thus any store done by a thread must be written back to main memory, *even if the store is to a JVM object that later becomes garbage*. So imagine a single-threaded program that creates X bytes of garbage while it runs. Those X bytes will definitely be written to the CPU's local cache, but they will only be written to main memory if the cache space runs out before the garbage collector does its work and allows that memory to be reused for allocations. The CPU-to-local-cache bandwidth in many modern systems is significantly faster than local-cache-to-main-memory bandwidth. Now take that same program and spread its work across 2 or more threads, with a join at the end of each one. For the sake of example, say that each thread will write X/N bytes of data while it runs. Even if the only data needed later in the rest of the program is a single Long object, for example, all of those X/N bytes of data will be copied from the local cache to main memory (if that did not already happen before the thread terminated). If the number of threads is large enough, the amount of data written from all local caches to main memory can be higher in the multi-threaded case than in the single-threaded case. Anyway, that is my hypothesis about what could be happening here. It isn't Clojure-specific, but it can be exacerbated by the common behavior of a lot of Clojure code to allocate significant amounts of memory that becomes garbage. Andy On Wed, Jan 30, 2013 at 6:20 PM, Lee Spector lspec...@hampshire.eduwrote: FYI we had a bit of a discussion about this at a meetup in Amherst MA yesterday, and while I'm not sufficiently on top of the JVM or system issues to have briefed everyone on all of the details there has been a little of followup since the discussion, including results of some different experiments by Chas Emerick, at: http://www.meetup.com/Functional-Programming-Connoisseurs/messages/boards/thread/30946382 -Lee On Jan 30, 2013, at 8:39 PM, Marshall Bockrath-Vandegrift wrote: Apologies for my very-slow reply here. I keep thinking that I’ll have more time to look into this issue, and keep having other things requiring my attention. And on top of that, I’ve temporarily lost the many-way AMD system I was using as a test-bed. I very much want to see if I can get my hands on an Intel system to compare to. My AMD system is in theory 32-way – two physical CPUs, each with 16 cores. However, Linux reports (via /proc/cpuinfo) the cores in groups of 8 (“cpu cores : 8” etc). And something very strange happens when extending parallelism beyond 8-way... I ran several experiments using a version of your whole-application benchmark I modified to control the level of parallelism. At parallelism 9+, the real time it takes to complete the benchmark hardly budges, but the user/CPU time increases linearly with the level of parallelism! As far as I can tell, multi-processor AMD *is* a NUMA architecture, which might potentially explain things. But enabling the JVM NUMA options doesn’t seem to affect the benchmark. I
Re: abysmal multicore performance, especially on AMD processors
The disruptor project from LMAX has wrestled with these sort of issues at length and achieved astounding levels of performance on the JVM Martin Thompson, the original author of the disruptor, is a leading light in the JVM performance space, his mechanical sympathy blog is a goldmine of information and a must read for anyone wanting to understand the JVM / hardware interface. Find more info at: http://mechanical-sympathy.blogspot.co.uk/ http://lmax-exchange.github.io/disruptor/ *Neale Swinnerton* {t: @sw1nn https://twitter.com/#!/sw1nn, w: sw1nn.com } On 27 September 2013 12:29, Wm. Josiah Erikson wmjos...@gmail.com wrote: Interesting! If that is true of Java (I don't know Java at all), then your argument seems plausible. Cache-to-main-memory writes still take many more CPU cycles (an order of magnitude more, last I knew) than processor-to-cache. I don't think it's so much a bandwidth issue as latency, AFAIK. Thanks for thinking about this more, so long after the fact. We still see the issue. On Sep 26, 2013 11:43 PM, Andy Fingerhut andy.finger...@gmail.com wrote: Adding to this thread from almost a year ago. I don't have conclusive proof with experiments to show right now, but I do have some experiments that have led me to what I think is a plausible cause of not just Clojure programs running more slowly when multi-threaded than when single-threaded, but any programs running on JVM's memory model doing so. Qualification: My explanation would be true only for multi-threaded programs running on the JVM that store significant amounts of data in memory, even if the data written by each thread is only read by that thread, and there is no locking or other inter-thread communication, i.e. for embarrasingly parallel problems. Outline of the argument: When you start a thread, and then wait for the thread to complete, the JVM memory model requires all loads and stores to satisfy certain restrictions. One of these is that any store done before the thread is created should 'happen before' the thread start, and thus the updated stored values must be visible to the new thread. 'Visible' here means that the thread doing the store must cause the CPU it is running on to update main memory from whatever locally modified values it has written into its local cache. That rule isn't so relevant to my argument. The one that is relevant is that any store performed by the thread is considered to 'happen before' a join operation on the thread. Thus any store done by a thread must be written back to main memory, *even if the store is to a JVM object that later becomes garbage*. So imagine a single-threaded program that creates X bytes of garbage while it runs. Those X bytes will definitely be written to the CPU's local cache, but they will only be written to main memory if the cache space runs out before the garbage collector does its work and allows that memory to be reused for allocations. The CPU-to-local-cache bandwidth in many modern systems is significantly faster than local-cache-to-main-memory bandwidth. Now take that same program and spread its work across 2 or more threads, with a join at the end of each one. For the sake of example, say that each thread will write X/N bytes of data while it runs. Even if the only data needed later in the rest of the program is a single Long object, for example, all of those X/N bytes of data will be copied from the local cache to main memory (if that did not already happen before the thread terminated). If the number of threads is large enough, the amount of data written from all local caches to main memory can be higher in the multi-threaded case than in the single-threaded case. Anyway, that is my hypothesis about what could be happening here. It isn't Clojure-specific, but it can be exacerbated by the common behavior of a lot of Clojure code to allocate significant amounts of memory that becomes garbage. Andy On Wed, Jan 30, 2013 at 6:20 PM, Lee Spector lspec...@hampshire.eduwrote: FYI we had a bit of a discussion about this at a meetup in Amherst MA yesterday, and while I'm not sufficiently on top of the JVM or system issues to have briefed everyone on all of the details there has been a little of followup since the discussion, including results of some different experiments by Chas Emerick, at: http://www.meetup.com/Functional-Programming-Connoisseurs/messages/boards/thread/30946382 -Lee On Jan 30, 2013, at 8:39 PM, Marshall Bockrath-Vandegrift wrote: Apologies for my very-slow reply here. I keep thinking that I’ll have more time to look into this issue, and keep having other things requiring my attention. And on top of that, I’ve temporarily lost the many-way AMD system I was using as a test-bed. I very much want to see if I can get my hands on an Intel system to compare to. My AMD system is in theory 32-way – two physical CPUs, each with 16 cores. However,
Re: abysmal multicore performance, especially on AMD processors
Adding to this thread from almost a year ago. I don't have conclusive proof with experiments to show right now, but I do have some experiments that have led me to what I think is a plausible cause of not just Clojure programs running more slowly when multi-threaded than when single-threaded, but any programs running on JVM's memory model doing so. Qualification: My explanation would be true only for multi-threaded programs running on the JVM that store significant amounts of data in memory, even if the data written by each thread is only read by that thread, and there is no locking or other inter-thread communication, i.e. for embarrasingly parallel problems. Outline of the argument: When you start a thread, and then wait for the thread to complete, the JVM memory model requires all loads and stores to satisfy certain restrictions. One of these is that any store done before the thread is created should 'happen before' the thread start, and thus the updated stored values must be visible to the new thread. 'Visible' here means that the thread doing the store must cause the CPU it is running on to update main memory from whatever locally modified values it has written into its local cache. That rule isn't so relevant to my argument. The one that is relevant is that any store performed by the thread is considered to 'happen before' a join operation on the thread. Thus any store done by a thread must be written back to main memory, *even if the store is to a JVM object that later becomes garbage*. So imagine a single-threaded program that creates X bytes of garbage while it runs. Those X bytes will definitely be written to the CPU's local cache, but they will only be written to main memory if the cache space runs out before the garbage collector does its work and allows that memory to be reused for allocations. The CPU-to-local-cache bandwidth in many modern systems is significantly faster than local-cache-to-main-memory bandwidth. Now take that same program and spread its work across 2 or more threads, with a join at the end of each one. For the sake of example, say that each thread will write X/N bytes of data while it runs. Even if the only data needed later in the rest of the program is a single Long object, for example, all of those X/N bytes of data will be copied from the local cache to main memory (if that did not already happen before the thread terminated). If the number of threads is large enough, the amount of data written from all local caches to main memory can be higher in the multi-threaded case than in the single-threaded case. Anyway, that is my hypothesis about what could be happening here. It isn't Clojure-specific, but it can be exacerbated by the common behavior of a lot of Clojure code to allocate significant amounts of memory that becomes garbage. Andy On Wed, Jan 30, 2013 at 6:20 PM, Lee Spector lspec...@hampshire.edu wrote: FYI we had a bit of a discussion about this at a meetup in Amherst MA yesterday, and while I'm not sufficiently on top of the JVM or system issues to have briefed everyone on all of the details there has been a little of followup since the discussion, including results of some different experiments by Chas Emerick, at: http://www.meetup.com/Functional-Programming-Connoisseurs/messages/boards/thread/30946382 -Lee On Jan 30, 2013, at 8:39 PM, Marshall Bockrath-Vandegrift wrote: Apologies for my very-slow reply here. I keep thinking that I’ll have more time to look into this issue, and keep having other things requiring my attention. And on top of that, I’ve temporarily lost the many-way AMD system I was using as a test-bed. I very much want to see if I can get my hands on an Intel system to compare to. My AMD system is in theory 32-way – two physical CPUs, each with 16 cores. However, Linux reports (via /proc/cpuinfo) the cores in groups of 8 (“cpu cores : 8” etc). And something very strange happens when extending parallelism beyond 8-way... I ran several experiments using a version of your whole-application benchmark I modified to control the level of parallelism. At parallelism 9+, the real time it takes to complete the benchmark hardly budges, but the user/CPU time increases linearly with the level of parallelism! As far as I can tell, multi-processor AMD *is* a NUMA architecture, which might potentially explain things. But enabling the JVM NUMA options doesn’t seem to affect the benchmark. I think next steps are two-fold: (1) examine parallelism vs real CPU time on an Intel system, and (2) attempt to reproduce the observed behavior in pure Java. I’m keeping my fingers crossed that I’ll have some time to look at this more soon, but I’m honestly not very hopeful. In the mean time, I hope you’ve managed to exploit multi-process parallelism to run more efficiently? -Marshall -- -- You received this message because you are subscribed to the Google Groups Clojure
multicore list processing (was Re: abysmal multicore performance, especially on AMD processors)
Keeping the discussion here would make sense, esp. in light of meetup.com's horrible discussion board. I don't have a lot to offer on the JVM/Clojure-specific problem beyond what I wrote in that meetup thread, but Lee's challenge(s) were too hard to resist: Would your conclusion be something like 'Intensive list processing can't currently be done in Java (or Clojure/JVM) in a way that takes reasonable advantage of multiple cores.'? I see that rewrite without lists might be the only way out, but that'd be a bummer. Clojure is a multicore Lisp; if you can't use it for multicore list processing then... sigh. The nature of the `burn` program is such that I'm skeptical of the ability of any garbage-collected runtime (lispy or not) to scale its operation across multiple threads. It generates a huge amount of garbage that is held for a very long time (in GC terms), thus producing a great deal of contention on the heap between the active job threads and the GC implementation constantly having to clean up after them, compact the results, etc. The workload is not CPU-bound, so I don't see it being parallelized effectively. I'd be very interested in seeing an implementation for a runtime that proves me wrong, i.e. can attain significant performance benefits by running the e.g. 8 `burn` jobs across multiple threads. In an attempt to do just that (prove myself wrong), I thought I'd golf the `burn` program (again; onlookers can go look at the Java implementation I knocked out a few days ago here: https://gist.github.com/4682554) on Racket, presumably as well-suited to multicore list processing as any other. The results: https://gist.github.com/4682453 Please forgive my perhaps pitiful Racket-fu. I'm sure there's threadpool and CAS libraries available, but it was entertaining to hack away with the threading and semaphore primitives. If I've goofed something up badly (it's been some time since I've schemed), please speak up. The Racket and Java implementations are effectively equivalent: `java -server -Xmx2g cemerick.burn $THREAD_COUNT` (oracle java 1.7.0_09) 1 thread: 61s 2 threads: 126s `java -server -Xmx2g cemerick.burn $THREAD_COUNT` (oracle java 1.7.0_09) 1 thread: 47s 2 threads: 92s `time ./racket -t ~/burn.scm -m $THREAD_COUNT` (racket 5.3.1) 1 thread: 45s 2 threads: 126s The above was run on OS X 10.7.5 with 4GB of physical memory. If someone knows of ways to get better perf on racket, let me know. I tried running from the results of `raco make` and such, but saw no improvements; I guess it is strictly producing bytecode that is later JIT'ed, and not doing any kind of additional AOT optimizations... It'd be interesting to see timings from different operating systems, and very interesting to see timings from running burn.scm on other schemes, or a Common Lisp implementation and timings. Cheers, - Chas On Jan 30, 2013, at 9:20 PM, Lee Spector wrote: FYI we had a bit of a discussion about this at a meetup in Amherst MA yesterday, and while I'm not sufficiently on top of the JVM or system issues to have briefed everyone on all of the details there has been a little of followup since the discussion, including results of some different experiments by Chas Emerick, at: http://www.meetup.com/Functional-Programming-Connoisseurs/messages/boards/thread/30946382 -Lee On Jan 30, 2013, at 8:39 PM, Marshall Bockrath-Vandegrift wrote: Apologies for my very-slow reply here. I keep thinking that I’ll have more time to look into this issue, and keep having other things requiring my attention. And on top of that, I’ve temporarily lost the many-way AMD system I was using as a test-bed. I very much want to see if I can get my hands on an Intel system to compare to. My AMD system is in theory 32-way – two physical CPUs, each with 16 cores. However, Linux reports (via /proc/cpuinfo) the cores in groups of 8 (“cpu cores : 8” etc). And something very strange happens when extending parallelism beyond 8-way... I ran several experiments using a version of your whole-application benchmark I modified to control the level of parallelism. At parallelism 9+, the real time it takes to complete the benchmark hardly budges, but the user/CPU time increases linearly with the level of parallelism! As far as I can tell, multi-processor AMD *is* a NUMA architecture, which might potentially explain things. But enabling the JVM NUMA options doesn’t seem to affect the benchmark. I think next steps are two-fold: (1) examine parallelism vs real CPU time on an Intel system, and (2) attempt to reproduce the observed behavior in pure Java. I’m keeping my fingers crossed that I’ll have some time to look at this more soon, but I’m honestly not very hopeful. In the mean time, I hope you’ve managed to exploit multi-process parallelism to run more efficiently? -Marshall -- -- You received this message because you are subscribed
Re: multicore list processing (was Re: abysmal multicore performance, especially on AMD processors)
Chas Emerick c...@cemerick.com writes: Keeping the discussion here would make sense, esp. in light of meetup.com's horrible discussion board. Excellent. Solves the problem of deciding the etiquette of jumping on the meetup board for a meetup one has never been involved in. :-) The nature of the `burn` program is such that I'm skeptical of the ability of any garbage-collected runtime (lispy or not) to scale its operation across multiple threads. Bringing you up to speed on this very long thread, I don’t believe the original `burn` benchmark is a GC issue, due to results cameron first reported here: https://groups.google.com/d/msg/clojure/48W2eff3caU/83uXZjLi3iAJ I that narrowed to what I still believe to be the explanation here: https://groups.google.com/d/msg/clojure/48W2eff3caU/jd8dpmzEtEYJ And have more compact demonstration benchmark results here: https://groups.google.com/d/msg/clojure/48W2eff3caU/tCCkjXxTUMEJ I haven’t been able to produce those results in a minimal Java-only test case, though. Then Wm. Josiah posted a full-application benchmark, which appears to have entirely different performance problems from the synthetic `burn` benchmark. I’d rejected GC as the cause for the slowdown there too, but ATM can’t recall why or what I tested, so GC may definitely be a candidate to re-examine: https://groups.google.com/d/msg/clojure/48W2eff3caU/K224Aqwkn5YJ Quite looking forward to additional insight... -Marshall -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: multicore list processing (was Re: abysmal multicore performance, especially on AMD processors)
On Jan 31, 2013, at 9:23 AM, Marshall Bockrath-Vandegrift wrote: Chas Emerick c...@cemerick.com writes: The nature of the `burn` program is such that I'm skeptical of the ability of any garbage-collected runtime (lispy or not) to scale its operation across multiple threads. Bringing you up to speed on this very long thread, I don’t believe the original `burn` benchmark is a GC issue, due to results cameron first reported here: https://groups.google.com/d/msg/clojure/48W2eff3caU/83uXZjLi3iAJ I that narrowed to what I still believe to be the explanation here: https://groups.google.com/d/msg/clojure/48W2eff3caU/jd8dpmzEtEYJ And have more compact demonstration benchmark results here: https://groups.google.com/d/msg/clojure/48W2eff3caU/tCCkjXxTUMEJ I haven’t been able to produce those results in a minimal Java-only test case, though. Then Wm. Josiah posted a full-application benchmark, which appears to have entirely different performance problems from the synthetic `burn` benchmark. I’d rejected GC as the cause for the slowdown there too, but ATM can’t recall why or what I tested, so GC may definitely be a candidate to re-examine: https://groups.google.com/d/msg/clojure/48W2eff3caU/K224Aqwkn5YJ Quite looking forward to additional insight... Yeah, the thread is far too large for me to reasonably digest (and there's inevitably a lot of noise in rounds of microbenchmarking golf ;-) BTW, I just realized I copy/pasted the wrong command for the faster of the Java implementation benchmarks in the previous mail; it used -XX:-UseParallelGC: `java -server -Xmx2g -XX:-UseParallelGC cemerick.burn $THREAD_COUNT` I've not looked at clojush at all; it is AFAICT a complex application in its own right, and I wouldn't know where to start. My understanding is that the `burn` function is considered representative of the bulk of operations in clojush, but I'm happy to assume that that's not the case. There's likely a number of things that go into the results of the `burn` benchmark, nevermind the apparent performance of clojush. Here's what I know: * Two separate implementations (Java and Racket) exhibit very similar multicore performance characteristics. In particular, the Java implementation would not suffer from any particular megamorphic inefficiencies, since no such calls are made in that program. Racket is far more of a black box to me than the JVM, but it does not provide much of any polymorphism at all, so JIT-related issues are presumably not in scope there. * Broadly speaking, heap allocation and its evil twin, GC, sabotages parallelism and performance in general. Functional programming with immutable values has been made possible in large part by the gains registered by persistent data structures, structural sharing, lazy evaluation, and so on; without them, we're back to copy-on-write, which is roughly what `burn` represents in grand style. Now, all this digital ink spilled, and just for kicks, I go back to run the Clojure `burn` again (https://gist.github.com/4683413 in a 1.5.0-RC2 REPL started with jvm args of `-Xmx2g -XX:-UseParallelGC`): 1 thread: 38s 2 threads: 32s 4 threads: 23s Hardly a 2x or 4x improvement, but I had previously obtained badly degrading timings corresponding to those in my previous mail. Microbenchmarking sucks. At least I got to twiddle with Racket again. :-P Cheers, - Chas -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: multicore list processing (was Re: abysmal multicore performance, especially on AMD processors)
On Jan 31, 2013, at 10:15 AM, Chas Emerick wrote: Then Wm. Josiah posted a full-application benchmark, which appears to have entirely different performance problems from the synthetic `burn` benchmark. I’d rejected GC as the cause for the slowdown there too, but ATM can’t recall why or what I tested, so GC may definitely be a candidate to re-examine: I've not looked at clojush at all; it is AFAICT a complex application in its own right, and I wouldn't know where to start. My understanding is that the `burn` function is considered representative of the bulk of operations in clojush, but I'm happy to assume that that's not the case. There's likely a number of things that go into the results of the `burn` benchmark, nevermind the apparent performance of clojush. FWIW I wrote the burn benchmark because lots of intensive list manipulation is at the heart of our real applications, and that seemed to be a nicely minimal way to see how that could scale across cores. Our real applications may indeed raise other issues too, but getting lots of intensive list manipulation to scale well is bound to be good. FWIW the full-application benchmark that Josiah posted was made deterministic in a way that means that big chunks of our code won't actually be executing. The system generates and interprets lots of Push programs, and if I recall correctly the benchmark hardcodes all of the Push programs to be the same constant program, meaning that not all of the Push instructions will run, etc. It'd be trickier to get the real mix of instructions in a way that makes for an easily reproducible benchmark, and this benchmark is at least testing the basic interpreter and population-level infrastructure. If we can get this to scale reasonably then we could try to see how much that helps under more-completely real conditions. -Lee -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: abysmal multicore performance, especially on AMD processors
Wm. Josiah Erikson wmjos...@gmail.com writes: Am I reading this right that this is actually a Java problem, and not clojure-specific? Wouldn't the rest of the Java community have noticed this? Or maybe massive parallelism in this particular way isn't something commonly done with Java in the industry? Thanks for the patches though - it's nice to see some improvement... I'll be fascinated to see how this turns out in the end. Have we found a large Java bug? Apologies for my very-slow reply here. I keep thinking that I’ll have more time to look into this issue, and keep having other things requiring my attention. And on top of that, I’ve temporarily lost the many-way AMD system I was using as a test-bed. I very much want to see if I can get my hands on an Intel system to compare to. My AMD system is in theory 32-way – two physical CPUs, each with 16 cores. However, Linux reports (via /proc/cpuinfo) the cores in groups of 8 (“cpu cores : 8” etc). And something very strange happens when extending parallelism beyond 8-way... I ran several experiments using a version of your whole-application benchmark I modified to control the level of parallelism. At parallelism 9+, the real time it takes to complete the benchmark hardly budges, but the user/CPU time increases linearly with the level of parallelism! As far as I can tell, multi-processor AMD *is* a NUMA architecture, which might potentially explain things. But enabling the JVM NUMA options doesn’t seem to affect the benchmark. I think next steps are two-fold: (1) examine parallelism vs real CPU time on an Intel system, and (2) attempt to reproduce the observed behavior in pure Java. I’m keeping my fingers crossed that I’ll have some time to look at this more soon, but I’m honestly not very hopeful. In the mean time, I hope you’ve managed to exploit multi-process parallelism to run more efficiently? -Marshall -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: abysmal multicore performance, especially on AMD processors
FYI we had a bit of a discussion about this at a meetup in Amherst MA yesterday, and while I'm not sufficiently on top of the JVM or system issues to have briefed everyone on all of the details there has been a little of followup since the discussion, including results of some different experiments by Chas Emerick, at: http://www.meetup.com/Functional-Programming-Connoisseurs/messages/boards/thread/30946382 -Lee On Jan 30, 2013, at 8:39 PM, Marshall Bockrath-Vandegrift wrote: Apologies for my very-slow reply here. I keep thinking that I’ll have more time to look into this issue, and keep having other things requiring my attention. And on top of that, I’ve temporarily lost the many-way AMD system I was using as a test-bed. I very much want to see if I can get my hands on an Intel system to compare to. My AMD system is in theory 32-way – two physical CPUs, each with 16 cores. However, Linux reports (via /proc/cpuinfo) the cores in groups of 8 (“cpu cores : 8” etc). And something very strange happens when extending parallelism beyond 8-way... I ran several experiments using a version of your whole-application benchmark I modified to control the level of parallelism. At parallelism 9+, the real time it takes to complete the benchmark hardly budges, but the user/CPU time increases linearly with the level of parallelism! As far as I can tell, multi-processor AMD *is* a NUMA architecture, which might potentially explain things. But enabling the JVM NUMA options doesn’t seem to affect the benchmark. I think next steps are two-fold: (1) examine parallelism vs real CPU time on an Intel system, and (2) attempt to reproduce the observed behavior in pure Java. I’m keeping my fingers crossed that I’ll have some time to look at this more soon, but I’m honestly not very hopeful. In the mean time, I hope you’ve managed to exploit multi-process parallelism to run more efficiently? -Marshall -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: abysmal multicore performance, especially on AMD processors
Josiah mentioned requesting a free trial of the ZIng JVM. Did you ever get access to that, and were able to try your code running on that? Again, I have no direct experience with their product to guarantee you better results -- just that I've heard good things about their ability to handle concurrent workloads with different GC than most JVMs. Andy On Jan 30, 2013, at 6:20 PM, Lee Spector wrote: FYI we had a bit of a discussion about this at a meetup in Amherst MA yesterday, and while I'm not sufficiently on top of the JVM or system issues to have briefed everyone on all of the details there has been a little of followup since the discussion, including results of some different experiments by Chas Emerick, at: http://www.meetup.com/Functional-Programming-Connoisseurs/messages/boards/thread/30946382 -Lee On Jan 30, 2013, at 8:39 PM, Marshall Bockrath-Vandegrift wrote: Apologies for my very-slow reply here. I keep thinking that I’ll have more time to look into this issue, and keep having other things requiring my attention. And on top of that, I’ve temporarily lost the many-way AMD system I was using as a test-bed. I very much want to see if I can get my hands on an Intel system to compare to. My AMD system is in theory 32-way – two physical CPUs, each with 16 cores. However, Linux reports (via /proc/cpuinfo) the cores in groups of 8 (“cpu cores : 8” etc). And something very strange happens when extending parallelism beyond 8-way... I ran several experiments using a version of your whole-application benchmark I modified to control the level of parallelism. At parallelism 9+, the real time it takes to complete the benchmark hardly budges, but the user/CPU time increases linearly with the level of parallelism! As far as I can tell, multi-processor AMD *is* a NUMA architecture, which might potentially explain things. But enabling the JVM NUMA options doesn’t seem to affect the benchmark. I think next steps are two-fold: (1) examine parallelism vs real CPU time on an Intel system, and (2) attempt to reproduce the observed behavior in pure Java. I’m keeping my fingers crossed that I’ll have some time to look at this more soon, but I’m honestly not very hopeful. In the mean time, I hope you’ve managed to exploit multi-process parallelism to run more efficiently? -Marshall -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: abysmal multicore performance, especially on AMD processors
Am I reading this right that this is actually a Java problem, and not clojure-specific? Wouldn't the rest of the Java community have noticed this? Or maybe massive parallelism in this particular way isn't something commonly done with Java in the industry? Thanks for the patches though - it's nice to see some improvement... I'll be fascinated to see how this turns out in the end. Have we found a large Java bug? On Sun, Dec 30, 2012 at 9:28 PM, cameron cdor...@gmail.com wrote: I've posted a patch with some changes here ( https://gist.github.com/4416803), it includes the record change here and a small change to interpret-instruction, the benchmark runs 2x the default as it did for Marshall. The patch also modifies the main loop to use a thread pool instead of agents and allows you to set the number of threads, this might help diagnosing the parallel performance issue. On the modified benchmark I'm seeing ~4x speedup with the parallel version on an 8 core machine and the profiler reports that the parallel version is using twice as much cpu time. I also had another look at the native calls issue modified the clojure runtime to avoid most to the calls the profiler said were taking significantly more time in the parallel version, it did speed things up but only by ~6%, not the large margin the profiling results had led me to believe were possible, it looks like the profiler overstates these methods times. The modified clojure 1.5 is available here https://github.com/cdorrat/clojure/commit/dfb5f99eb5d0a45165978e079284bab1f25bd79fif anyone's interested YourKit is reporting that a number of clojure.core functions are taking longer in the parallel version than the serial and they all seem to be methods that have one or more instanceof or instance? calls but given the results above I'm not sure how much weight to give this. It's seems the elephant is still in the room and responsible to ~50% of the cpu time :) Cameron. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
I've posted a patch with some changes here (https://gist.github.com/4416803), it includes the record change here and a small change to interpret-instruction, the benchmark runs 2x the default as it did for Marshall. The patch also modifies the main loop to use a thread pool instead of agents and allows you to set the number of threads, this might help diagnosing the parallel performance issue. On the modified benchmark I'm seeing ~4x speedup with the parallel version on an 8 core machine and the profiler reports that the parallel version is using twice as much cpu time. I also had another look at the native calls issue modified the clojure runtime to avoid most to the calls the profiler said were taking significantly more time in the parallel version, it did speed things up but only by ~6%, not the large margin the profiling results had led me to believe were possible, it looks like the profiler overstates these methods times. The modified clojure 1.5 is available here https://github.com/cdorrat/clojure/commit/dfb5f99eb5d0a45165978e079284bab1f25bd79f if anyone's interested YourKit is reporting that a number of clojure.core functions are taking longer in the parallel version than the serial and they all seem to be methods that have one or more instanceof or instance? calls but given the results above I'm not sure how much weight to give this. It's seems the elephant is still in the room and responsible to ~50% of the cpu time :) Cameron. On Saturday, December 22, 2012 10:57:28 AM UTC+11, Marshall Bockrath-Vandegrift wrote: Lee Spector lspe...@hampshire.edu javascript: writes: FWIW I used records for push-states at one point but did not observe a speedup and it required much messier code, so I reverted to struct-maps. But maybe I wasn't doing the right timings. I'm curious about how you changed to records without the messiness. I'll include below my sig the way that I had to do it... maybe you can show me what you did instead. I just double-checked, and I definitely see a 2x speedup on Josiah’s benchmark. That may still be synthetic, of course. Here’s what I did: (eval `(defrecord ~'PushState [~'trace ~@(map (comp symbol name) push-types)])) (let [empty-state (map-PushState {})] (defn make-push-state Returns an empty push state. [] empty-state)) Still, I guess the gorilla in the room, which is eating the multicore performance, hasn't yet been found. No, not yet... I’ve become obsessed with figuring it out though, so still slogging at it. -Marshall -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
Hi Lee, I've done some more digging and seem to have found the root of the problem, it seems that java native methods are much slower when called in parallel. The following code illustrates the problem: (letfn [(time-native [f] (let [c (class [])] (time (dorun (f (fn [_] (.isArray c)) (range 1000))] (println Sequential Test:) (time-native map) (println Parallel Test:) (time-native pmap)) On a dual quad-core xeon box I get the following results: Sequential Test: Elapsed time: 1054.807725 msecs Parallel Test: Elapsed time: 15302.605697 msecs ie. the code executed in parallel was 15 times slower than the sequential version. The same can be seen with isInstance and isArray members of java.lang.Class. I'm not sure where to go from here, these functions are frequently used by clojure.core Perhaps someone with more JVM implementation knowledge can help? Cameron. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
In that case isn't context switching dominating your test? .isArray isn't expensive enough to warrant the use of pmap Leonardo Borges www.leonardoborges.com On Dec 29, 2012 10:29 AM, cameron cdor...@gmail.com wrote: Hi Lee, I've done some more digging and seem to have found the root of the problem, it seems that java native methods are much slower when called in parallel. The following code illustrates the problem: (letfn [(time-native [f] (let [c (class [])] (time (dorun (f (fn [_] (.isArray c)) (range 1000))] (println Sequential Test:) (time-native map) (println Parallel Test:) (time-native pmap)) On a dual quad-core xeon box I get the following results: Sequential Test: Elapsed time: 1054.807725 msecs Parallel Test: Elapsed time: 15302.605697 msecs ie. the code executed in parallel was 15 times slower than the sequential version. The same can be seen with isInstance and isArray members of java.lang.Class. I'm not sure where to go from here, these functions are frequently used by clojure.core Perhaps someone with more JVM implementation knowledge can help? Cameron. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
No, it's not the context switching, changing isArray (a native method) to getAnnotations (a normal jvm method) gives the same time for both the parallel and serial version. Cameron. On Saturday, December 29, 2012 10:34:42 AM UTC+11, Leonardo Borges wrote: In that case isn't context switching dominating your test? .isArray isn't expensive enough to warrant the use of pmap Leonardo Borges www.leonardoborges.com On Dec 29, 2012 10:29 AM, cameron cdo...@gmail.com javascript: wrote: Hi Lee, I've done some more digging and seem to have found the root of the problem, it seems that java native methods are much slower when called in parallel. The following code illustrates the problem: (letfn [(time-native [f] (let [c (class [])] (time (dorun (f (fn [_] (.isArray c)) (range 1000))] (println Sequential Test:) (time-native map) (println Parallel Test:) (time-native pmap)) On a dual quad-core xeon box I get the following results: Sequential Test: Elapsed time: 1054.807725 msecs Parallel Test: Elapsed time: 15302.605697 msecs ie. the code executed in parallel was 15 times slower than the sequential version. The same can be seen with isInstance and isArray members of java.lang.Class. I'm not sure where to go from here, these functions are frequently used by clojure.core Perhaps someone with more JVM implementation knowledge can help? Cameron. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clo...@googlegroups.comjavascript: Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+u...@googlegroups.com javascript: For more options, visit this group at http://groups.google.com/group/clojure?hl=en -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
I've been moving house for the last week or so but I'll also give the benchmark another look. My initial profiling seemed to show that the parallel version was spending a significant amount of time in java.lang.isArray, clojush.pushstate/stack-ref is calling nth on the result of cons, since it isn't an instance of clojure.lang.Indexed nth resorts to a series of tests before returning the value (including isArray). I changed clojush.pushstate/push-item implementation to (assoc state type (vec (cons value (type state ;; vec added to ensure result is indexable This slowed down my single threaded version a bit but improved my parallel speedup from 2x to 3x on an 8 physical core machine. We could easily improve this by replacing the cons with conj and updating the code that pops the state or by implementing a more efficient indexable stack with deftype. After the change above clojush.interpreter/execute_instruction is looking like a hotspot with clojure.lang.ArraySeq creation seeming to spend more time in java.lang.Class.getComponentType() in the parallel version than the serial one. Cameron. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
On Dec 21, 2012, at 6:59 PM, Meikel Brandmeyer wrote: Is there a much simpler way that I overlooked? I'm not sure it's simpler, but it's more straight-forward, I'd say. Thanks Marshall and Mikel on the struct-record conversion code. I'll definitely make a change along those lines. -Lee -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
Wm. Josiah Erikson wmjos...@gmail.com writes: I hope this helps people get to the bottom of things. Not to the bottom of things yet, but found some low-hanging fruit – switching the `push-state` from a struct-map to a record gives a flat ~2x speedup in all configurations I tested. So, that’s good? I have however eliminated to my satisfaction the possibility that this is something orthogonal to your system. I do see some speedup improvements when I lower the level of concurrency to the number of actual physical cores on my system, but each call to the `error-function` still takes ~10x as long to complete when run in parallel vs in serial. For a while I had some hope that atom-access in the main interpreter loop was the problem, due to causing extraneous fetches to main memory. But removing all the atoms from the system didn’t have any appreciable impact. Anyway, still poking at it. -Marshall -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
On Dec 21, 2012, at 5:22 PM, Marshall Bockrath-Vandegrift wrote: Not to the bottom of things yet, but found some low-hanging fruit – switching the `push-state` from a struct-map to a record gives a flat ~2x speedup in all configurations I tested. So, that’s good? I really appreciate your attention to this Marshall. FWIW I used records for push-states at one point but did not observe a speedup and it required much messier code, so I reverted to struct-maps. But maybe I wasn't doing the right timings. I'm curious about how you changed to records without the messiness. I'll include below my sig the way that I had to do it... maybe you can show me what you did instead. Still, I guess the gorilla in the room, which is eating the multicore performance, hasn't yet been found. -Lee --- tl;dr: the structs vs. records thing --- Here's how I do what I need to do with struct-maps: ;; this is defined elsewhere, and I want push-states to have fields for each push-type that's defined here (def push-types '(:exec :integer :float :code :boolean :string :zip :tag :auxiliary :return :environment) ;; so I use a macro to produce the struct-map definition (defmacro define-push-state-structure [] `(defstruct push-state ~@push-types)) ;; call the macro to define the struct (define-push-state-structure) ;; and then when I want a new push-state I can just call struct-map: (defn make-push-state Returns an empty push state. [] (struct-map push-state)) In order to do the same thing with records I had to resort to this: (defn keyword-symbol [kwd] Returns the symbol obtained by removing the : from a keyword. (read-string (name kwd))) (defmacro define-push-state-record-type [] Defines the pushstate record type. The odd trick with read-string was a hack to avoid namespace qualification on the pushstate symbol. `(defrecord ~(read-string pushstate) [~@(map keyword-symbol push-types)])) (define-push-state-record-type) (defmacro make-push-state Returns an empty push state. [] `(pushstate. ~@(map (fn [_] nil) push-types))) Is there a much simpler way that I overlooked? -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
Lee Spector lspec...@hampshire.edu writes: FWIW I used records for push-states at one point but did not observe a speedup and it required much messier code, so I reverted to struct-maps. But maybe I wasn't doing the right timings. I'm curious about how you changed to records without the messiness. I'll include below my sig the way that I had to do it... maybe you can show me what you did instead. I just double-checked, and I definitely see a 2x speedup on Josiah’s benchmark. That may still be synthetic, of course. Here’s what I did: (eval `(defrecord ~'PushState [~'trace ~@(map (comp symbol name) push-types)])) (let [empty-state (map-PushState {})] (defn make-push-state Returns an empty push state. [] empty-state)) Still, I guess the gorilla in the room, which is eating the multicore performance, hasn't yet been found. No, not yet... I’ve become obsessed with figuring it out though, so still slogging at it. -Marshall -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
Hi, Am 22.12.12 00:37, schrieb Lee Spector: ;; this is defined elsewhere, and I want push-states to have fields for each push-type that's defined here (def push-types '(:exec :integer :float :code :boolean :string :zip :tag :auxiliary :return :environment) (defn keyword-symbol [kwd] Returns the symbol obtained by removing the : from a keyword. (read-string (name kwd))) I'm afraid this one is still necessary, but I would use symbol instead of read-string. (defmacro define-push-state-record-type [] Defines the pushstate record type. The odd trick with read-string was a hack to avoid namespace qualification on the pushstate symbol. `(defrecord ~(read-string pushstate) [~@(map keyword-symbol push-types)])) Use ~'pushstate instead of read-string. (define-push-state-record-type) (defmacro make-push-state Returns an empty push state. [] `(pushstate. ~@(map (fn [_] nil) push-types))) Use this: (defn make-push-state [] (map-pushstate {})) Is there a much simpler way that I overlooked? I'm not sure it's simpler, but it's more straight-forward, I'd say. Kind regards Meikel -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
So here's what we came up with that clearly demonstrates the problem. Lee provided the code and I tweaked it until I believe it shows the problem clearly and succinctly. I have put together a .tar.gz file that has everything needed to run it, except lein. Grab it here: clojush_bowling_benchmark.tar.gz Then run, for instance: /usr/bin/time -f %E lein run clojush.examples.benchmark-bowling and then, when that has finished, edit src/clojush/examples/benchmark_bowling.clj and uncomment :use-single-thread true and run it again. I think this is a succinct, deterministic benchmark that clearly demonstrates the problem and also doesn't use conj or reverse. We don't see slowdowns, but I cannot get any better than around 2x speedup on any hardware with this benchmark. I hope this helps people get to the bottom of things. -Josiah On Sun, Dec 16, 2012 at 4:54 PM, Lee Spector lspec...@hampshire.edu wrote: On Dec 14, 2012, at 10:41 PM, cameron wrote: Until Lee has a representative benchmark for his application it's difficult to tell if he's experiencing the same problem but there would seem to be a case for changing the PersistentList implementation in clojure.lang. We put together a version of our application in which we just replaced all of the random calls with deterministic functions (returning either constants or deterministic functions of their arguments). What we saw was a maximum speedup of less than x2 on a 48 core machine, which was interesting in part because the determinism meant that only a tiny subset of our code was being executed (because all of the Push programs in the population were identical and used only a single Push instruction). So I think that this may indeed help us to hone in on the problem. Before we share that code, however, we should make sure that the evaluations that are being done concurrently are sufficiently long-running, and maybe tweak a couple of other things. I think we'll have a chance to do that early in the week and we'll share the results/code when we do. Thanks, -Lee -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
Whoops, sorry about the link. It should be able to be found here: http://gibson.hampshire.edu/~josiah/clojush/ On Wed, Dec 19, 2012 at 11:57 AM, Wm. Josiah Erikson wmjos...@gmail.comwrote: So here's what we came up with that clearly demonstrates the problem. Lee provided the code and I tweaked it until I believe it shows the problem clearly and succinctly. I have put together a .tar.gz file that has everything needed to run it, except lein. Grab it here: clojush_bowling_benchmark.tar.gz Then run, for instance: /usr/bin/time -f %E lein run clojush.examples.benchmark-bowling and then, when thWhooat has finished, edit src/clojush/examples/benchmark_bowling.clj and uncomment :use-single-thread true and run it again. I think this is a succinct, deterministic benchmark that clearly demonstrates the problem and also doesn't use conj or reverse. We don't see slowdowns, but I cannot get any better than around 2x speedup on any hardware with this benchmark. I hope this helps people get to the bottom of things. -Josiah On Sun, Dec 16, 2012 at 4:54 PM, Lee Spector lspec...@hampshire.eduwrote: On Dec 14, 2012, at 10:41 PM, cameron wrote: Until Lee has a representative benchmark for his application it's difficult to tell if he's experiencing the same problem but there would seem to be a case for changing the PersistentList implementation in clojure.lang. We put together a version of our application in which we just replaced all of the random calls with deterministic functions (returning either constants or deterministic functions of their arguments). What we saw was a maximum speedup of less than x2 on a 48 core machine, which was interesting in part because the determinism meant that only a tiny subset of our code was being executed (because all of the Push programs in the population were identical and used only a single Push instruction). So I think that this may indeed help us to hone in on the problem. Before we share that code, however, we should make sure that the evaluations that are being done concurrently are sufficiently long-running, and maybe tweak a couple of other things. I think we'll have a chance to do that early in the week and we'll share the results/code when we do. Thanks, -Lee -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
On Dec 19, 2012, at 11:57 AM, Wm. Josiah Erikson wrote: I think this is a succinct, deterministic benchmark that clearly demonstrates the problem and also doesn't use conj or reverse. Clarification: it's not just a tight loop involving reverse/conj, as our previous benchmark was. It's our real application but with deterministic versions of all of the random functions, and while the project includes some calls to reverse and conj I don't think they're playing a big role here. Almost all of the time here is spent evaluating a Push program that just does a lot of integer addition and consing. As far as I can tell all of the consing is done with cons explicitly, and not conj although maybe I'm missing something, and I'm saying this only from looking at our code, not the Clojure libraries. -Lee -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
I tried redefining the few places in the code (string_reverse, I think) that used reverse to use the same version of reverse that I got such great speedups with in your code, and it made no difference. There are not any explicit calls to conj in the code that I could find. On Wed, Dec 19, 2012 at 1:00 PM, Lee Spector lspec...@hampshire.edu wrote: On Dec 19, 2012, at 11:57 AM, Wm. Josiah Erikson wrote: I think this is a succinct, deterministic benchmark that clearly demonstrates the problem and also doesn't use conj or reverse. Clarification: it's not just a tight loop involving reverse/conj, as our previous benchmark was. It's our real application but with deterministic versions of all of the random functions, and while the project includes some calls to reverse and conj I don't think they're playing a big role here. Almost all of the time here is spent evaluating a Push program that just does a lot of integer addition and consing. As far as I can tell all of the consing is done with cons explicitly, and not conj although maybe I'm missing something, and I'm saying this only from looking at our code, not the Clojure libraries. -Lee -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
Wm. Josiah Erikson wmjos...@gmail.com writes: Then run, for instance: /usr/bin/time -f %E lein run clojush.examples.benchmark-bowling and then, when that has finished, edit src/clojush/examples/benchmark_bowling.clj and uncomment :use-single-thread true and run it again. I think this is a succinct, deterministic benchmark that clearly demonstrates the problem and also doesn't use conj or reverse. We don't see slowdowns, but I cannot get any better than around 2x speedup on any hardware with this benchmark. FWIW, I've just ran it with these results: - a dual-core intel notebook: + single-threaded: 4:00.09 + multi-threaded: 2:27.35 - an AMD Opteron 8-core machine [*] + single-threaded: 3:03.51 + multi-threaded: 1:31.58 So indeed, I also get only a speedup of factor 2. [*] I'm not exactly sure what for a machine that is. /proc/cpuinfo reports 8 processors, each being a Quad-Core AMD Opteron(tm) Processor 8387. Well, that would make 32 cores, but actually htop shows just 8. But I think its some virtualized machine, so maybe the host has 8 4-cores, but the virtual machine gets only 8 of them... Bye, Tassilo -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
On Dec 15, 2012, at 1:14 AM, cameron wrote: Originally I was using ECJ (http://cs.gmu.edu/~eclab/projects/ecj/) in java for my GP work but for the last few years it's been GEVA with a clojure wrapper I wrote (https://github.com/cdorrat/geva-clj). Ah yes -- I've actually downloaded and played a bit with your GEVA via Clojure. I've thought it would be interesting to try using GEVA to evolve Push programs (which have a trivial grammar), but haven't had the time to try that. I've also used ECJ for some projects but haven't looked at its multicore performance. Sean Luke (ECJ's author and incidentally a former student of mine) even hacked together a way to use it to evolve Push programs but it was nasty (generating Push programs as strings) and would be low-performance for other reasons. -Lee -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
On Dec 14, 2012, at 10:41 PM, cameron wrote: Until Lee has a representative benchmark for his application it's difficult to tell if he's experiencing the same problem but there would seem to be a case for changing the PersistentList implementation in clojure.lang. We put together a version of our application in which we just replaced all of the random calls with deterministic functions (returning either constants or deterministic functions of their arguments). What we saw was a maximum speedup of less than x2 on a 48 core machine, which was interesting in part because the determinism meant that only a tiny subset of our code was being executed (because all of the Push programs in the population were identical and used only a single Push instruction). So I think that this may indeed help us to hone in on the problem. Before we share that code, however, we should make sure that the evaluations that are being done concurrently are sufficiently long-running, and maybe tweak a couple of other things. I think we'll have a chance to do that early in the week and we'll share the results/code when we do. Thanks, -Lee -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
I've created a test harness for this as a leiningen plugin: https://github.com/bendlas/lein-partest You can just put :plugins [[net.bendlas/lein-partest 0.1.0]] into your project and run lein partest your.ns/testfn 6 to run 6 threads/processes in parallel The plugin then runs the function as follows: - (a default of) 2 warmup runs - 6 times serially - 6 times in parallel with futures - 6 times in parallel with classloaders - 6 times in parallel with processes A run with the original function (with an i of 1000 instead of 1) on my machine reveals that running in separate classloaders yields roughly the same times. This hints at the bottleneck being in the JVM instead of the clojure runtime, provided I have used the classloader feature of leiningen correctly (please verify). I'd be very interested in the performance on Azul's JVM. lein partest example/burn 4 # Run serial ... warmup ... # 0, elapsed: 187.57 msecs # 1, elapsed: 184.83 msecs # 2, elapsed: 183.99 msecs # 3, elapsed: 184.33 msecs # Run parallel, same classloader ... warmup ... # 1, elapsed: 1063.02 msecs # 2, elapsed: 1071.12 msecs # 3, elapsed: 1091.50 msecs # 0, elapsed: 1100.42 msecs # Run parallel, distinct classloaders ... warmup ... warmup ... warmup ... warmup ... # 2, elapsed: 1135.74 msecs # 3, elapsed: 1118.14 msecs # 1, elapsed: 1119.45 msecs # 0, elapsed: 1103.45 msecs # Run parallel, distinct processes ... warmup ... warmup ... warmup ... warmup ... # 0, elapsed: 191.96 msecs # 2, elapsed: 209.49 msecs # 3, elapsed: 191.99 msecs # 1, elapsed: 201.93 msecs -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
Thanks Herwig, I used your plugin with the following 2 burn variants: (defn burn-slow [ _] (count (last (take 1000 (iterate #(reduce conj '() %) (range 1)) (defn burn-fast [ _] (count (last (take 1000 (iterate #(reduce conj* (list nil) %) (range 1)) Where conj* is just a copy of clojure.core/conj. In the first case I see the same behaviour you saw, fast serially and in separate processes (~115 ms) but slow in parallel for both the same and different class loaders (708ms - 950ms). In the second case where I avoid any references to clojure.lang.PersistentList$EmptyList and use a clean copy of conj I see much better performance in all cases (~ 40ms - 50ms) and no slow down in the parallel cases in the same JVM. Until Lee has a representative benchmark for his application it's difficult to tell if he's experiencing the same problem but there would seem to be a case for changing the PersistentList implementation in clojure.lang. Cameron. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
I'd be interested in seeing your GP system. The one we're using evolves Push programs and I suspect that whatever's triggering this problem with multicore utilization is stemming from something in the inner loop of my Push interpreter (https://github.com/lspector/Clojush)... but I don't know exactly what it is. Originally I was using ECJ (http://cs.gmu.edu/~eclab/projects/ecj/) in java for my GP work but for the last few years it's been GEVA with a clojure wrapper I wrote (https://github.com/cdorrat/geva-clj). I realize that I could go back to this sort of thing, or something more modern and reasonable like hadoop, but of course it'd be a lot nicer if parallelization via agents (or some other mechanism central to the language) just didn't have whatever pathology we've uncovered. Agreed, I'd like clojure to be my preferred language when I have a 100 core machine, it's certainly something I'd like to get to the bottom of. Cameron. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
OK, I did something a little bit different, but I think it proves the same thing we were shooting for. On a 48-way 4 x Opteron 6168 with 32GB of RAM. This is Tom's Bowling benchmark: 1: multithreaded. Average of 10 runs: 14:00.9 2. singlethreaded. Average of 10 runs: 23:35.3 3. singlethreaded, 8 simultaneous copies. Average run time of said concurrently running copies: 23:31.5 So we see a speedup of less than 2x running multithreaded in a single JVM instance. By contrast, running 8 simultaneous copies in 8 separate JVM's gives us a perfect 8 x speedup over running a single instance of the same singlethreaded benchmark. This proves pretty conclusively that it's not a hardware limitation, it seems to me unless the problem is that it's trying to spawn 48 threads, and that creates contention. I don't think so though, because on an 8-way FX-8120 with 16GB of RAM, we see a very similar lack of speedup going from singlethreaded to multithreaded (and it will only be trying to use 8 threads, right?), and then we see a much better speedup (around 4x - we're doing 8 times the work in twice the amount of time) going to 8 concurrent copies of the same thing in separate JVM's (even though I had to decrease RAM usage on the 8 concurrent copies to avoid swapping, thereby possibly slowing this down a bit): 1. 9:00.6 2. 14:15.6 3. 27:35.1 We're probably getting a better speedup with the concurrent copies on the 48-way node because of higher memory bandwidth, bigger caches (and more of them), and more memory. Does this help? Should I do something else as well? I'm curious to try running like, say 16 concurrent copies on the 48-way node -Josiah On Wed, Dec 12, 2012 at 10:03 AM, Andy Fingerhut andy.finger...@gmail.comwrote: Lee: I believe you said that with your benchmarking code achieved good speedup when run as separate JVMs that were each running a single thread, even before making the changes to the implementation of reverse found by Marshall. I confirmed that on my own machine as well. Have you tried running your real application in a single thread in a JVM, and then run multiple JVMs in parallel, to see if there is any speedup? If so, that would again help determine whether it is multiple threads in a single JVM causing the slowdown, or something to do with the hardware or OS that is the limiting factor. Andy On Dec 11, 2012, at 4:37 PM, Lee Spector wrote: On Dec 11, 2012, at 1:06 PM, Marshall Bockrath-Vandegrift wrote: So I think if you replace your calls to `reverse` and any `conj` loops you have in your own code, you should see a perfectly reasonable speedup. Tantalizing, but on investigation I see that our real application actually does very little explicitly with reverse or conj, and I don't actually think that we're getting reasonable speedups (which is what led me to try that benchmark). So while I'm not sure of the source of the problem in our application I think there can be a problem even if one avoids direct calls to reverse and conj. Andy's recent tests also seem to confirm this. BTW benchmarking our real application ( https://github.com/lspector/Clojush) is a bit tricky because it's riddled with random number generator calls that can have big effects, but we're going to look into working around that. Recent postings re: seedable RNGs may help, although changing all of the RNG code may be a little involved because we use thread-local RNGs (to avoid contention and get good multicore speedups... we thought!). -Lee -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
Ah. We'll look into running several clojures in one JVM too. Thanks. On Thu, Dec 13, 2012 at 1:41 PM, Wm. Josiah Erikson wmjos...@gmail.comwrote: OK, I did something a little bit different, but I think it proves the same thing we were shooting for. On a 48-way 4 x Opteron 6168 with 32GB of RAM. This is Tom's Bowling benchmark: 1: multithreaded. Average of 10 runs: 14:00.9 2. singlethreaded. Average of 10 runs: 23:35.3 3. singlethreaded, 8 simultaneous copies. Average run time of said concurrently running copies: 23:31.5 So we see a speedup of less than 2x running multithreaded in a single JVM instance. By contrast, running 8 simultaneous copies in 8 separate JVM's gives us a perfect 8 x speedup over running a single instance of the same singlethreaded benchmark. This proves pretty conclusively that it's not a hardware limitation, it seems to me unless the problem is that it's trying to spawn 48 threads, and that creates contention. I don't think so though, because on an 8-way FX-8120 with 16GB of RAM, we see a very similar lack of speedup going from singlethreaded to multithreaded (and it will only be trying to use 8 threads, right?), and then we see a much better speedup (around 4x - we're doing 8 times the work in twice the amount of time) going to 8 concurrent copies of the same thing in separate JVM's (even though I had to decrease RAM usage on the 8 concurrent copies to avoid swapping, thereby possibly slowing this down a bit): 1. 9:00.6 2. 14:15.6 3. 27:35.1 We're probably getting a better speedup with the concurrent copies on the 48-way node because of higher memory bandwidth, bigger caches (and more of them), and more memory. Does this help? Should I do something else as well? I'm curious to try running like, say 16 concurrent copies on the 48-way node -Josiah On Wed, Dec 12, 2012 at 10:03 AM, Andy Fingerhut andy.finger...@gmail.com wrote: Lee: I believe you said that with your benchmarking code achieved good speedup when run as separate JVMs that were each running a single thread, even before making the changes to the implementation of reverse found by Marshall. I confirmed that on my own machine as well. Have you tried running your real application in a single thread in a JVM, and then run multiple JVMs in parallel, to see if there is any speedup? If so, that would again help determine whether it is multiple threads in a single JVM causing the slowdown, or something to do with the hardware or OS that is the limiting factor. Andy On Dec 11, 2012, at 4:37 PM, Lee Spector wrote: On Dec 11, 2012, at 1:06 PM, Marshall Bockrath-Vandegrift wrote: So I think if you replace your calls to `reverse` and any `conj` loops you have in your own code, you should see a perfectly reasonable speedup. Tantalizing, but on investigation I see that our real application actually does very little explicitly with reverse or conj, and I don't actually think that we're getting reasonable speedups (which is what led me to try that benchmark). So while I'm not sure of the source of the problem in our application I think there can be a problem even if one avoids direct calls to reverse and conj. Andy's recent tests also seem to confirm this. BTW benchmarking our real application ( https://github.com/lspector/Clojush) is a bit tricky because it's riddled with random number generator calls that can have big effects, but we're going to look into working around that. Recent postings re: seedable RNGs may help, although changing all of the RNG code may be a little involved because we use thread-local RNGs (to avoid contention and get good multicore speedups... we thought!). -Lee -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
I'm not saying that I know this will help, but if you are open to trying a different JVM that has had a lot of work done on it to optimize it for high concurrency, Azul's Zing JVM may be worth a try, to see if it increases parallelism for a single Clojure instance in a single JVM, with lots of threads. It costs $$, but I'm guessing they might have steep discounts for educational institutions. They have free trials, too. http://www.azulsystems.com/products/zing/whatisit Andy On Dec 13, 2012, at 10:41 AM, Wm. Josiah Erikson wrote: OK, I did something a little bit different, but I think it proves the same thing we were shooting for. On a 48-way 4 x Opteron 6168 with 32GB of RAM. This is Tom's Bowling benchmark: 1: multithreaded. Average of 10 runs: 14:00.9 2. singlethreaded. Average of 10 runs: 23:35.3 3. singlethreaded, 8 simultaneous copies. Average run time of said concurrently running copies: 23:31.5 So we see a speedup of less than 2x running multithreaded in a single JVM instance. By contrast, running 8 simultaneous copies in 8 separate JVM's gives us a perfect 8 x speedup over running a single instance of the same singlethreaded benchmark. This proves pretty conclusively that it's not a hardware limitation, it seems to me unless the problem is that it's trying to spawn 48 threads, and that creates contention. I don't think so though, because on an 8-way FX-8120 with 16GB of RAM, we see a very similar lack of speedup going from singlethreaded to multithreaded (and it will only be trying to use 8 threads, right?), and then we see a much better speedup (around 4x - we're doing 8 times the work in twice the amount of time) going to 8 concurrent copies of the same thing in separate JVM's (even though I had to decrease RAM usage on the 8 concurrent copies to avoid swapping, thereby possibly slowing this down a bit): 1. 9:00.6 2. 14:15.6 3. 27:35.1 We're probably getting a better speedup with the concurrent copies on the 48-way node because of higher memory bandwidth, bigger caches (and more of them), and more memory. Does this help? Should I do something else as well? I'm curious to try running like, say 16 concurrent copies on the 48-way node -Josiah On Wed, Dec 12, 2012 at 10:03 AM, Andy Fingerhut andy.finger...@gmail.com wrote: Lee: I believe you said that with your benchmarking code achieved good speedup when run as separate JVMs that were each running a single thread, even before making the changes to the implementation of reverse found by Marshall. I confirmed that on my own machine as well. Have you tried running your real application in a single thread in a JVM, and then run multiple JVMs in parallel, to see if there is any speedup? If so, that would again help determine whether it is multiple threads in a single JVM causing the slowdown, or something to do with the hardware or OS that is the limiting factor. Andy On Dec 11, 2012, at 4:37 PM, Lee Spector wrote: On Dec 11, 2012, at 1:06 PM, Marshall Bockrath-Vandegrift wrote: So I think if you replace your calls to `reverse` and any `conj` loops you have in your own code, you should see a perfectly reasonable speedup. Tantalizing, but on investigation I see that our real application actually does very little explicitly with reverse or conj, and I don't actually think that we're getting reasonable speedups (which is what led me to try that benchmark). So while I'm not sure of the source of the problem in our application I think there can be a problem even if one avoids direct calls to reverse and conj. Andy's recent tests also seem to confirm this. BTW benchmarking our real application (https://github.com/lspector/Clojush) is a bit tricky because it's riddled with random number generator calls that can have big effects, but we're going to look into working around that. Recent postings re: seedable RNGs may help, although changing all of the RNG code may be a little involved because we use thread-local RNGs (to avoid contention and get good multicore speedups... we thought!). -Lee -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at
Re: abysmal multicore performance, especially on AMD processors
Cool. I've requested a free trial. On Thu, Dec 13, 2012 at 1:53 PM, Andy Fingerhut andy.finger...@gmail.comwrote: I'm not saying that I know this will help, but if you are open to trying a different JVM that has had a lot of work done on it to optimize it for high concurrency, Azul's Zing JVM may be worth a try, to see if it increases parallelism for a single Clojure instance in a single JVM, with lots of threads. It costs $$, but I'm guessing they might have steep discounts for educational institutions. They have free trials, too. http://www.azulsystems.com/products/zing/whatisit Andy On Dec 13, 2012, at 10:41 AM, Wm. Josiah Erikson wrote: OK, I did something a little bit different, but I think it proves the same thing we were shooting for. On a 48-way 4 x Opteron 6168 with 32GB of RAM. This is Tom's Bowling benchmark: 1: multithreaded. Average of 10 runs: 14:00.9 2. singlethreaded. Average of 10 runs: 23:35.3 3. singlethreaded, 8 simultaneous copies. Average run time of said concurrently running copies: 23:31.5 So we see a speedup of less than 2x running multithreaded in a single JVM instance. By contrast, running 8 simultaneous copies in 8 separate JVM's gives us a perfect 8 x speedup over running a single instance of the same singlethreaded benchmark. This proves pretty conclusively that it's not a hardware limitation, it seems to me unless the problem is that it's trying to spawn 48 threads, and that creates contention. I don't think so though, because on an 8-way FX-8120 with 16GB of RAM, we see a very similar lack of speedup going from singlethreaded to multithreaded (and it will only be trying to use 8 threads, right?), and then we see a much better speedup (around 4x - we're doing 8 times the work in twice the amount of time) going to 8 concurrent copies of the same thing in separate JVM's (even though I had to decrease RAM usage on the 8 concurrent copies to avoid swapping, thereby possibly slowing this down a bit): 1. 9:00.6 2. 14:15.6 3. 27:35.1 We're probably getting a better speedup with the concurrent copies on the 48-way node because of higher memory bandwidth, bigger caches (and more of them), and more memory. Does this help? Should I do something else as well? I'm curious to try running like, say 16 concurrent copies on the 48-way node -Josiah On Wed, Dec 12, 2012 at 10:03 AM, Andy Fingerhut andy.finger...@gmail.com wrote: Lee: I believe you said that with your benchmarking code achieved good speedup when run as separate JVMs that were each running a single thread, even before making the changes to the implementation of reverse found by Marshall. I confirmed that on my own machine as well. Have you tried running your real application in a single thread in a JVM, and then run multiple JVMs in parallel, to see if there is any speedup? If so, that would again help determine whether it is multiple threads in a single JVM causing the slowdown, or something to do with the hardware or OS that is the limiting factor. Andy On Dec 11, 2012, at 4:37 PM, Lee Spector wrote: On Dec 11, 2012, at 1:06 PM, Marshall Bockrath-Vandegrift wrote: So I think if you replace your calls to `reverse` and any `conj` loops you have in your own code, you should see a perfectly reasonable speedup. Tantalizing, but on investigation I see that our real application actually does very little explicitly with reverse or conj, and I don't actually think that we're getting reasonable speedups (which is what led me to try that benchmark). So while I'm not sure of the source of the problem in our application I think there can be a problem even if one avoids direct calls to reverse and conj. Andy's recent tests also seem to confirm this. BTW benchmarking our real application ( https://github.com/lspector/Clojush) is a bit tricky because it's riddled with random number generator calls that can have big effects, but we're going to look into working around that. Recent postings re: seedable RNGs may help, although changing all of the RNG code may be a little involved because we use thread-local RNGs (to avoid contention and get good multicore speedups... we thought!). -Lee -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to
Re: abysmal multicore performance, especially on AMD processors
On Friday, December 14, 2012 5:41:59 AM UTC+11, Wm. Josiah Erikson wrote: Does this help? Should I do something else as well? I'm curious to try running like, say 16 concurrent copies on the 48-way node Have you made any progress on a small deterministic benchmark that reflects your applications behaviour (ie. the RNG seed work you were discussing)? I'm keen to help, but I don't have time to look at benchmarks that take hours to run. I've also been using clojure for genetic programming and have been getting very good parallel speedups on a single jvm, this type of problem should see excellent performance gains, it has many medium size cpu intensive sub-problems that require no shared resources. I used hadoop to scale beyond single machines, in the worst case you can use this approach and you should see speed-ups equivalent to your multiple jvm test. I just split the population into small blocks and had each map function calculate the fitness and return a map of individual-id - fitness. Cameron. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
On Dec 13, 2012, at 4:21 PM, cameron wrote: Have you made any progress on a small deterministic benchmark that reflects your applications behaviour (ie. the RNG seed work you were discussing)? I'm keen to help, but I don't have time to look at benchmarks that take hours to run. I've also been using clojure for genetic programming and have been getting very good parallel speedups on a single jvm, this type of problem should see excellent performance gains, it has many medium size cpu intensive sub-problems that require no shared resources. I used hadoop to scale beyond single machines, in the worst case you can use this approach and you should see speed-ups equivalent to your multiple jvm test. I just split the population into small blocks and had each map function calculate the fitness and return a map of individual-id - fitness. Cameron. I don't think we've quite worked out a deterministic test, but I'll look into it some more. I'd be interested in seeing your GP system. The one we're using evolves Push programs and I suspect that whatever's triggering this problem with multicore utilization is stemming from something in the inner loop of my Push interpreter (https://github.com/lspector/Clojush)... but I don't know exactly what it is. I've never used hadoop, but I've done lots of coarse-grained parallel GP using (gasp) shell scripts and the file system to glue the independent parts together. I realize that I could go back to this sort of thing, or something more modern and reasonable like hadoop, but of course it'd be a lot nicer if parallelization via agents (or some other mechanism central to the language) just didn't have whatever pathology we've uncovered. -Lee -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
Hi Marshall, the megamorphic call site hypothesis does sound plausible but I'm not sure where the following test fits in. If I understand correctly we believe that it's the fact that the base case (an PersistentList$EmptyList instance) and the normal case (an PersistsentList instance) have different types and when run in paralell the interleaving invocations are causing problems for the JIT but when we use a vector it's a single type so we see a better speedup. I was toying with the idea of replacing the EmptyList class with a PersistsentList instance to mitigate the problem in at least one common case, however it doesn't seem to help. If I replace the reverse call in burn with the following code: #(reduce conj (list nil) %) I get the same slowdown as we see if reverse (equivalent to #(reduce conj '() %)) where: (class '())= clojure.lang.PersistentList$EmptyList (class (list nil)) = clojure.lang.PersistentList I'm not convinced were at the bottom of it yet, perhaps the earlier post on memory contention by Carlos might yield some clues. If the issue was related to memory access it might help explain why the impact differed significantly between intel and amd hardware. Cameron. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
Andy Fingerhut andy.finger...@gmail.com writes: I'm not practiced in recognizing megamorphic call sites, so I could be missing some in the example code below, modified from Lee's original code. It doesn't use reverse or conj, and as far as I can tell doesn't use PersistentList, either, only Cons. ... Can you try to reproduce to see if you get similar results? If so, do you know why we get bad parallelism in a single JVM for this code? If there are no megamorphic call sites, then it is examples like this that lead me to wonder about locking in memory allocation and/or GC. I think your benchmark is a bit different from Lee’s original. The `reverse`-based versions perform heavily allocation as they repeatedly reverse a sequence, but each thread will hold a sequence of length at most 10,000 at any given time. In your benchmark, each thread holds a sequence of at most 2,000,000 elements, for a naive 200x increase in memory pressure and a potential increase in the number of objects being promoted out of the young generation. I ran your run benchmark under a version of Cameron’s criterium-based speed-up measurement wrapper I’ve modified to pass in the `pmap` function to use. I reduced the number of iterations in your algorithm by a factor of 5 to get it to run in a reasonable amount of time. And I ran it using default JVM GC settings, on a 32-way AMD system. I get the following numbers for 1-32 way parallelism with a 500MB heap: andy 1 : smap-ms 7.5, pmap-ms 7.7, speedup 0.97 andy 2 : smap-ms 7.8, pmap-ms 9.8, speedup 0.80 andy 4 : smap-ms 8.5, pmap-ms 10.6, speedup 0.80 andy 8 : smap-ms 8.6, pmap-ms 11.5, speedup 0.75 andy 16 : smap-ms 8.1, pmap-ms 12.5, speedup 0.65 andy 32 : [java.lang.OutOfMemoryError: Java heap space] And these numbers with a 4GB heap: andy 1 : smap-ms 3.8, pmap-ms 4.0, speedup 0.95 andy 2 : smap-ms 4.2, pmap-ms 2.1, speedup 2.02 andy 4 : smap-ms 4.2, pmap-ms 1.7, speedup 2.48 andy 8 : smap-ms 4.2, pmap-ms 1.2, speedup 3.44 andy 16 : smap-ms 4.4, pmap-ms 1.0, speedup 4.52 andy 32 : smap-ms 4.0, pmap-ms 1.6, speedup 2.55 I’m running out of time for breakfast experiments, but it seems relatively likely to me that the increased at-once sequence size in your benchmark is increasing the number of objects making it out of the young generation. This in turn is increasing the number of pause-the-world GCs, which increase even further in frequency at lower heap sizes. I’ll run these again later with GC logging and report if the results are unexpected. -Marshall -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
cameron cdor...@gmail.com writes: the megamorphic call site hypothesis does sound plausible but I'm not sure where the following test fits in. ... I was toying with the idea of replacing the EmptyList class with a PersistsentList instance to mitigate the problem in at least one common case, however it doesn't seem to help. If I replace the reverse call in burn with the following code: #(reduce conj (list nil) %) I get the same slowdown as we see if reverse (equivalent to #(reduce conj '() %)) Ah, but include your own copy of `conj` and try those two cases. The existing clojure.core/conj has already been used on multiple types, so you need a new IFn class with a fresh call site. Here are the numbers I get when I do that: w/o EmptyList : smap-ms 6.1, pmap-ms 1.2, speedup 5.26 w/ EmptyList : smap-ms 10.4, pmap-ms 16.2, speedup 0.64 w/o EmptyList : smap-ms 10.5, pmap-ms 16.3, speedup 0.64 That said, I’m slightly less convinced than I was earlier. I’m having difficulty producing a minimal example demonstrating the issue, and the results wmjosiah reported for modifying their actual code are disheartening. I just don’t have anything else which begins to explain the transition from speedup to inverse speedup in the above benchmarking sequence. -Marshall -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
Lee: I believe you said that with your benchmarking code achieved good speedup when run as separate JVMs that were each running a single thread, even before making the changes to the implementation of reverse found by Marshall. I confirmed that on my own machine as well. Have you tried running your real application in a single thread in a JVM, and then run multiple JVMs in parallel, to see if there is any speedup? If so, that would again help determine whether it is multiple threads in a single JVM causing the slowdown, or something to do with the hardware or OS that is the limiting factor. Andy On Dec 11, 2012, at 4:37 PM, Lee Spector wrote: On Dec 11, 2012, at 1:06 PM, Marshall Bockrath-Vandegrift wrote: So I think if you replace your calls to `reverse` and any `conj` loops you have in your own code, you should see a perfectly reasonable speedup. Tantalizing, but on investigation I see that our real application actually does very little explicitly with reverse or conj, and I don't actually think that we're getting reasonable speedups (which is what led me to try that benchmark). So while I'm not sure of the source of the problem in our application I think there can be a problem even if one avoids direct calls to reverse and conj. Andy's recent tests also seem to confirm this. BTW benchmarking our real application (https://github.com/lspector/Clojush) is a bit tricky because it's riddled with random number generator calls that can have big effects, but we're going to look into working around that. Recent postings re: seedable RNGs may help, although changing all of the RNG code may be a little involved because we use thread-local RNGs (to avoid contention and get good multicore speedups... we thought!). -Lee -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
On Dec 12, 2012, at 10:03 AM, Andy Fingerhut wrote: Have you tried running your real application in a single thread in a JVM, and then run multiple JVMs in parallel, to see if there is any speedup? If so, that would again help determine whether it is multiple threads in a single JVM causing the slowdown, or something to do with the hardware or OS that is the limiting factor. I don't think we've tried this exact test but we will now. Because we haven't yet engineered all of the randomness out of the application we'll do a bunch of runs in each condition and average the times. Based on recent tests I think this will give reasonably stable results if we do 10 runs or so in each condition. I'll keep you posted. Thanks! -Lee -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
Lee, while you are at benchmarking, would you mind running several threads in one JVM with one clojure instance per thread? Thus each thread should get JITted independently. Christophe On Wed, Dec 12, 2012 at 4:11 PM, Lee Spector lspec...@hampshire.edu wrote: On Dec 12, 2012, at 10:03 AM, Andy Fingerhut wrote: Have you tried running your real application in a single thread in a JVM, and then run multiple JVMs in parallel, to see if there is any speedup? If so, that would again help determine whether it is multiple threads in a single JVM causing the slowdown, or something to do with the hardware or OS that is the limiting factor. I don't think we've tried this exact test but we will now. Because we haven't yet engineered all of the randomness out of the application we'll do a bunch of runs in each condition and average the times. Based on recent tests I think this will give reasonably stable results if we do 10 runs or so in each condition. I'll keep you posted. Thanks! -Lee -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -- On Clojure http://clj-me.cgrand.net/ Clojure Programming http://clojurebook.com Training, Consulting Contracting http://lambdanext.eu/ -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
On Dec 12, 2012, at 10:45 AM, Christophe Grand wrote: Lee, while you are at benchmarking, would you mind running several threads in one JVM with one clojure instance per thread? Thus each thread should get JITted independently. I'm not actually sure how to do that. We're starting runs with lein run, which, if I understand correctly launches a JVM and a Clojure instance within it. Within that Clojure instance we can do things in either a single threaded or a multi threaded way. How would I change our method to do what you're asking here? We're not particularly Java savvy (I come from the Lisp side of things). Thanks, -Lee -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
On Thursday, December 13, 2012 12:51:57 AM UTC+11, Marshall Bockrath-Vandegrift wrote: cameron cdo...@gmail.com javascript: writes: the megamorphic call site hypothesis does sound plausible but I'm not sure where the following test fits in. ... I was toying with the idea of replacing the EmptyList class with a PersistsentList instance to mitigate the problem in at least one common case, however it doesn't seem to help. If I replace the reverse call in burn with the following code: #(reduce conj (list nil) %) I get the same slowdown as we see if reverse (equivalent to #(reduce conj '() %)) Ah, but include your own copy of `conj` and try those two cases. The existing clojure.core/conj has already been used on multiple types, so you need a new IFn class with a fresh call site. Here are the numbers I get when I do that: Ah, I've also been looking at this in the morning and missed that bit. When I use a copy of conj I get similar results to yourself: w/ list-base : map-ms: 4.0, pmap-ms 0.6, speedup 7.02 w/ empty-list-base : map-ms: 9.9, pmap-ms 20.4, speedup 0.48 w/ list-base : map-ms: 9.4, pmap-ms 20.7, speedup 0.45 w/ vector : map-ms: 4.5, pmap-ms 1.4, speedup 3.28 It does seem that once used on the empty list case conj gets tainted and all future list uses incur a performance penalty. On that basis it would seem reasonable to convert the EmptyList uses in core.lang to a PersistentList instance. Interestingly using a tainted conj with other types doesn't seem to incur the penalty (the last vector in the times above) If I get time I might look at a protocol based version of conj out of interest. Cameron. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
See https://github.com/flatland/classlojure for a, nearly, ready-made solution to running several Clojures in one JVM. On Wed, Dec 12, 2012 at 5:20 PM, Lee Spector lspec...@hampshire.edu wrote: On Dec 12, 2012, at 10:45 AM, Christophe Grand wrote: Lee, while you are at benchmarking, would you mind running several threads in one JVM with one clojure instance per thread? Thus each thread should get JITted independently. I'm not actually sure how to do that. We're starting runs with lein run, which, if I understand correctly launches a JVM and a Clojure instance within it. Within that Clojure instance we can do things in either a single threaded or a multi threaded way. How would I change our method to do what you're asking here? We're not particularly Java savvy (I come from the Lisp side of things). Thanks, -Lee -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -- On Clojure http://clj-me.cgrand.net/ Clojure Programming http://clojurebook.com Training, Consulting Contracting http://lambdanext.eu/ -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
nicolas.o...@gmail.com nicolas.o...@gmail.com writes: What happens if your run it a third time at the end? (The question is related to the fact that there appears to be transition states between monomorphic and megamorphic call sites, which might lead to an explanation.) Same results, but your comment jogged my reading and analysis is what I believe is the right direction. This is a potentially grandiose claim, but I believe the inverse speedup is due to contention caused by JVM type profiling in megamorphic call sites. Unfortunately -XX:-UseTypeProfile doesn’t appear to turn off type profiling itself, so I can’t prove this claim definitively without going even further down the rabbit hole and grubbing through the JVM source code. To back up this claim, I present the following modified `conj*`: (defn conj* [coll x] (let [a (long-array 32)] (dotimes [n 5] (dotimes [i 32] (aset a i n))) (clojure.lang.RT/conj coll x))) And the resulting profiling numbers: list-conj* : map-ms: 42.0, pmap-ms 24.8, speedup 1.69 cons-conj* : map-ms: 39.7, pmap-ms 25.1, speedup 1.58 Adding busy-work (and an extra allocation!) to the loop improved the speed-up, I believe by decreasing the relative portion of the call execution time during which the callsite type profile information is being updated. If I’m correct, the `Cons` and `PersistentList` `.cons` implementations are so tight that in their base versions the type profiling forms a significant enough portion of the total call that the updates are in frequent conflict. Adding busy-work to the `conj*` call adds some jitter which prevents them from contending quite as frequently. I’m not sure what the next steps are. Open a bug on the JVM? This is something one can attempt to circumvent on a case-by-case basis, but IHMO has significant negative implications for Clojure’s concurrency story. -Marshall -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
On Dec 11, 2012, at 4:37 AM, Marshall Bockrath-Vandegrift wrote: I’m not sure what the next steps are. Open a bug on the JVM? This is something one can attempt to circumvent on a case-by-case basis, but IHMO has significant negative implications for Clojure’s concurrency story. I've gotten a bit lost in some of the diagnoses and experimental results and analyses and suggestions -- all of which I really appreciate! -- but I'm trying to get clear in my mind what the current state of things is from an application perspective. Is the following a fair characterization pending further developments? --- If you have a cons-intensive task then even if it can be divided into completely independent, long-running subtasks, there is currently no known way to get significant speedups by running the subtasks on multiple cores within a single Clojure process. In some cases you will be able to get significant speedups by separating the subtasks completely and running them in separate Clojure processes running on separate JVM instances. But the speedups will be lost (mostly, and you might even experience slowdowns) if you try to run them from within a single Clojure process. --- Or have I missed a currently-available work-around among the many suggestions? I realize that even if this is the current situation it might change via fixes on any one of several fronts (and I hope that it does!), but is this indeed the current situation? Thanks so much, -Lee -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
Lee, My reading of this thread is not quite as pessimistic as yours. Here is my synthesis for the practical application developer in Clojure from reading and re-reading all of the posts above. Marshall and Cameron, please feel free to correct me if I screw anything up here royally. ;-) When running code in parallel, the fastest to slowest ways to allocate memory via Clojure (of those examined above) are: 1. Java array 2. Java linked list 3. conj onto a transient vector 4. cons onto a list 5. conj onto a vector 6. DO NOT conj onto a list EVER!!! And that's pretty much all there is to it until Marshall figures out how to hack the JVM to overcome this very subtle JIT deoptimization bug relating to polymorphic conj calls on Cons and PersistentList types. Good luck, ~Gary On Tuesday, December 11, 2012 10:37:50 AM UTC-5, Lee wrote: On Dec 11, 2012, at 4:37 AM, Marshall Bockrath-Vandegrift wrote: I’m not sure what the next steps are. Open a bug on the JVM? This is something one can attempt to circumvent on a case-by-case basis, but IHMO has significant negative implications for Clojure’s concurrency story. I've gotten a bit lost in some of the diagnoses and experimental results and analyses and suggestions -- all of which I really appreciate! -- but I'm trying to get clear in my mind what the current state of things is from an application perspective. Is the following a fair characterization pending further developments? --- If you have a cons-intensive task then even if it can be divided into completely independent, long-running subtasks, there is currently no known way to get significant speedups by running the subtasks on multiple cores within a single Clojure process. In some cases you will be able to get significant speedups by separating the subtasks completely and running them in separate Clojure processes running on separate JVM instances. But the speedups will be lost (mostly, and you might even experience slowdowns) if you try to run them from within a single Clojure process. --- Or have I missed a currently-available work-around among the many suggestions? I realize that even if this is the current situation it might change via fixes on any one of several fronts (and I hope that it does!), but is this indeed the current situation? Thanks so much, -Lee -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
Lee Spector lspec...@hampshire.edu writes: Is the following a fair characterization pending further developments? If you have a cons-intensive task then even if it can be divided into completely independent, long-running subtasks, there is currently no known way to get significant speedups by running the subtasks on multiple cores within a single Clojure process. Not quite. If you’d been using `cons` (in the benchmark, if `reverse` used `cons` in its implementation), then you’d be getting a perfectly reasonable speedup. The problem child in this instance is `conj`. If my analysis is correct, then the issue is any megamodal call site – such as `conj` – which is invoked in a tight loop by multiple threads simultaneously. Any simultaneous invocation of such call sites introduces contention and reduces speedup, but the problem only becomes pathological in very, very tight loops, such as when performing the minimal work required by the `.cons` [1] implementations of `Cons` and `PersistentList`. In these cases the portion of the call which introduces contention is a sufficient proportion of the overall call time that the speedup becomes inverse. In some cases you will be able to get significant speedups by separating the subtasks completely and running them in separate Clojure processes running on separate JVM instances. But the speedups will be lost (mostly, and you might even experience slowdowns) if you try to run them from within a single Clojure process. For this particular issue, splitting each task into a separate JVM entirely negates the problem, because there is no simultaneous invocation of the same call site. Or have I missed a currently-available work-around among the many suggestions? You can specialize your application to avoid megamodal call sites in tight loops. If you are working with `Cons`-order sequences, just use `cons` instead of `conj`. If you are working with vectors, create your own private implementation of `conj` which you *only* call on vectors. If you are depending on operations which may/do use `conj` in tight loops, create your own private re-implementations which don’t, such as with any of the faster versions of `reverse` earlier in this thread. This is suboptimal, but it’s totally possible to work around the issue with a little bit of analysis and profiling. [1] Possible point of confusion – the JVM interface method invoked by the Clojure `conj` function is named `.cons`, for I assume historical reasons. The Clojure `cons` function on the other hand just allocates a `Cons` object in an entirely monomodal fashion. -Marshall -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
On Dec 11, 2012, at 11:40 AM, Marshall Bockrath-Vandegrift wrote: Or have I missed a currently-available work-around among the many suggestions? You can specialize your application to avoid megamodal call sites in tight loops. If you are working with `Cons`-order sequences, just use `cons` instead of `conj`. If you are working with vectors, create your own private implementation of `conj` which you *only* call on vectors. If you are depending on operations which may/do use `conj` in tight loops, create your own private re-implementations which don’t, such as with any of the faster versions of `reverse` earlier in this thread. This is suboptimal, but it’s totally possible to work around the issue with a little bit of analysis and profiling. Very informative. Thanks and thanks also to Gary. If the application does lots of list processing but does so with a mix of Clojure list and sequence manipulation functions, then one would have to write private, list/cons-only versions of all of these things? That is -- overstating it a bit, to be sure, but perhaps not entirely unfairly -- re-implement Clojure's Lisp? -Lee -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
Lee Spector lspec...@hampshire.edu writes: If the application does lots of list processing but does so with a mix of Clojure list and sequence manipulation functions, then one would have to write private, list/cons-only versions of all of these things? That is -- overstating it a bit, to be sure, but perhaps not entirely unfairly -- re-implement Clojure's Lisp? I just did a quick look over clojure/core.clj, and `reverse` is the only function which stood out to me as hitting the most pathological case. Every other `conj` loop over a user-provided datastructure is `conj`ing into an explicit non-list/`Cons` type. So I think if you replace your calls to `reverse` and any `conj` loops you have in your own code, you should see a perfectly reasonable speedup. -Marshall -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
OK WOW. You hit the nail on the head. It's reverse being called in a pmap that does it. When I redefine my own version of reverse (I totally cheated and just stole this) like this: (defn reverse-recursively [coll] (loop [[r more :as all] (seq coll) acc '()] (if all (recur more (cons r acc)) acc))) I speed it up from, get this, 4:32 to 0:25. Yeah. In case anybody's curious, pmap and pmapall show identical performance in this case. Wow. -Josiah On Tue, Dec 11, 2012 at 1:06 PM, Marshall Bockrath-Vandegrift llas...@gmail.com wrote: Lee Spector lspec...@hampshire.edu writes: If the application does lots of list processing but does so with a mix of Clojure list and sequence manipulation functions, then one would have to write private, list/cons-only versions of all of these things? That is -- overstating it a bit, to be sure, but perhaps not entirely unfairly -- re-implement Clojure's Lisp? I just did a quick look over clojure/core.clj, and `reverse` is the only function which stood out to me as hitting the most pathological case. Every other `conj` loop over a user-provided datastructure is `conj`ing into an explicit non-list/`Cons` type. So I think if you replace your calls to `reverse` and any `conj` loops you have in your own code, you should see a perfectly reasonable speedup. -Marshall -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
And, interestingly enough, suddenly the AMD FX-8350 beats the Intel Core i7 3770K, when before it was very very much not so. So for some reason, this bug was tickled more dramatically on AMD multicore processors than on Intel ones. On Tue, Dec 11, 2012 at 2:54 PM, Wm. Josiah Erikson wmjos...@gmail.comwrote: OK WOW. You hit the nail on the head. It's reverse being called in a pmap that does it. When I redefine my own version of reverse (I totally cheated and just stole this) like this: (defn reverse-recursively [coll] (loop [[r more :as all] (seq coll) acc '()] (if all (recur more (cons r acc)) acc))) I speed it up from, get this, 4:32 to 0:25. Yeah. In case anybody's curious, pmap and pmapall show identical performance in this case. Wow. -Josiah On Tue, Dec 11, 2012 at 1:06 PM, Marshall Bockrath-Vandegrift llas...@gmail.com wrote: Lee Spector lspec...@hampshire.edu writes: If the application does lots of list processing but does so with a mix of Clojure list and sequence manipulation functions, then one would have to write private, list/cons-only versions of all of these things? That is -- overstating it a bit, to be sure, but perhaps not entirely unfairly -- re-implement Clojure's Lisp? I just did a quick look over clojure/core.clj, and `reverse` is the only function which stood out to me as hitting the most pathological case. Every other `conj` loop over a user-provided datastructure is `conj`ing into an explicit non-list/`Cons` type. So I think if you replace your calls to `reverse` and any `conj` loops you have in your own code, you should see a perfectly reasonable speedup. -Marshall -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
Marshall: I'm not practiced in recognizing megamorphic call sites, so I could be missing some in the example code below, modified from Lee's original code. It doesn't use reverse or conj, and as far as I can tell doesn't use PersistentList, either, only Cons. (defn burn-cons [size] (let [size (long size)] (loop [i (long 0) value nil] (if (= i size) (last value) (recur (inc i) (clojure.lang.Cons. (* (int i) (+ (float i) (- (int i) (/ (float i) (inc (int i)) value)) (a) invoke (burn-cons 200) sequentially 64 times in a single JVM (b) invoke (burn-cons 200) 64 times using a modified version of pmap that limits the number of active threads to 2 (see below), in a single JVM. I would hope that this would take about half the elapsed time than (a), but the elapsed time is longer than (a) (c) start up two JVMs simultaneously and invoke (burn-cons 200) sequentially 32 times in each. The elapsed time here is less than (a), as I would expect. (Clojure 1.4, Oracle/Apple JDK 1.6.0_37, Mac OS X 10.6.8, running on a machine with Intel core i7 with 4 physical cores but OS X reports it as 8 I think because of 2 hyperthreads per core -- more details available on request). Can you try to reproduce to see if you get similar results? If so, do you know why we get bad parallelism in a single JVM for this code? If there are no megamorphic call sites, then it is examples like this that lead me to wonder about locking in memory allocation and/or GC. With the functions below, my part (b) was measured by doing: (time (doall (nthreads-pmap 2 (burn-cons 200) (unchunk (range 64) Andy (defn unchunk [s] (when (seq s) (lazy-seq (cons (first s) (unchunk (next s)) (defn nthreads-pmap Like pmap, except can take an argument nthreads to control the maximum number of parallel threads used. ([f coll] (let [n (+ 2 (.. Runtime getRuntime availableProcessors))] (nthreads-pmap n f coll))) ([nthreads f coll] (if (= nthreads 1) (map f coll) (let [n (dec nthreads) rets (map #(future (f %)) coll) step (fn step [[x xs :as vs] fs] (lazy-seq (if-let [s (seq fs)] (cons (deref x) (step xs (rest s))) (map deref vs] (step rets (drop n rets) ([nthreads f coll colls] (let [step (fn step [cs] (lazy-seq (let [ss (map seq cs)] (when (every? identity ss) (cons (map first ss) (step (map rest ss)))] (nthreads-pmap nthreads #(apply f %) (step (cons coll colls)) On Dec 11, 2012, at 10:06 AM, Marshall Bockrath-Vandegrift wrote: Lee Spector lspec...@hampshire.edu writes: If the application does lots of list processing but does so with a mix of Clojure list and sequence manipulation functions, then one would have to write private, list/cons-only versions of all of these things? That is -- overstating it a bit, to be sure, but perhaps not entirely unfairly -- re-implement Clojure's Lisp? I just did a quick look over clojure/core.clj, and `reverse` is the only function which stood out to me as hitting the most pathological case. Every other `conj` loop over a user-provided datastructure is `conj`ing into an explicit non-list/`Cons` type. So I think if you replace your calls to `reverse` and any `conj` loops you have in your own code, you should see a perfectly reasonable speedup. -Marshall -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
...and, suddenly, the high-core-count Opterons show us what we wanted and hoped for. If I increase that range statement to 100 and run it on the 48-core node, it takes 50 seconds (before it took 50 minutes), while the FX-8350 takes 3:31.89 and the 3770K takes 3:48.95. Thanks Marshall! I think you may have just saved us outrageous quantities of time, though Lee isn't convinced that we know exactly what's going on with clojush yet, I don't think. We haven't looked at it yet though I'm sure we will soon enough! On Tue, Dec 11, 2012 at 2:57 PM, Wm. Josiah Erikson wmjos...@gmail.comwrote: And, interestingly enough, suddenly the AMD FX-8350 beats the Intel Core i7 3770K, when before it was very very much not so. So for some reason, this bug was tickled more dramatically on AMD multicore processors than on Intel ones. On Tue, Dec 11, 2012 at 2:54 PM, Wm. Josiah Erikson wmjos...@gmail.comwrote: OK WOW. You hit the nail on the head. It's reverse being called in a pmap that does it. When I redefine my own version of reverse (I totally cheated and just stole this) like this: (defn reverse-recursively [coll] (loop [[r more :as all] (seq coll) acc '()] (if all (recur more (cons r acc)) acc))) I speed it up from, get this, 4:32 to 0:25. Yeah. In case anybody's curious, pmap and pmapall show identical performance in this case. Wow. -Josiah On Tue, Dec 11, 2012 at 1:06 PM, Marshall Bockrath-Vandegrift llas...@gmail.com wrote: Lee Spector lspec...@hampshire.edu writes: If the application does lots of list processing but does so with a mix of Clojure list and sequence manipulation functions, then one would have to write private, list/cons-only versions of all of these things? That is -- overstating it a bit, to be sure, but perhaps not entirely unfairly -- re-implement Clojure's Lisp? I just did a quick look over clojure/core.clj, and `reverse` is the only function which stood out to me as hitting the most pathological case. Every other `conj` loop over a user-provided datastructure is `conj`ing into an explicit non-list/`Cons` type. So I think if you replace your calls to `reverse` and any `conj` loops you have in your own code, you should see a perfectly reasonable speedup. -Marshall -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
Hm. Interesting. For the record, the exact code I'm running right now that I'm seeing great parallelism with is this: (defn reverse-recursively [coll] (loop [[r more :as all] (seq coll) acc '()] (if all (recur more (cons r acc)) acc))) (defn burn ([] (loop [i 0 value '()] (if (= i 1) (count (last (take 1 (iterate reverse-recursively value (recur (inc i) (cons (* (int i) (+ (float i) (- (int i) (/ (float i) (inc (int i)) value) ([_] (burn))) (defn pmapall Like pmap but: 1) coll should be finite, 2) the returned sequence will not be lazy, 3) calls to f may occur in any order, to maximize multicore processor utilization, and 4) takes only one coll so far. [f coll] (let [agents (map agent coll)] (dorun (map #(send % f) agents)) (apply await agents) (doall (map deref agents (defn -main [ args] (time ( doall( pmapall burn (range 100 (System/exit 0)) On Tue, Dec 11, 2012 at 3:00 PM, Andy Fingerhut andy.finger...@gmail.comwrote: Marshall: I'm not practiced in recognizing megamorphic call sites, so I could be missing some in the example code below, modified from Lee's original code. It doesn't use reverse or conj, and as far as I can tell doesn't use PersistentList, either, only Cons. (defn burn-cons [size] (let [size (long size)] (loop [i (long 0) value nil] (if (= i size) (last value) (recur (inc i) (clojure.lang.Cons. (* (int i) (+ (float i) (- (int i) (/ (float i) (inc (int i)) value)) (a) invoke (burn-cons 200) sequentially 64 times in a single JVM (b) invoke (burn-cons 200) 64 times using a modified version of pmap that limits the number of active threads to 2 (see below), in a single JVM. I would hope that this would take about half the elapsed time than (a), but the elapsed time is longer than (a) (c) start up two JVMs simultaneously and invoke (burn-cons 200) sequentially 32 times in each. The elapsed time here is less than (a), as I would expect. (Clojure 1.4, Oracle/Apple JDK 1.6.0_37, Mac OS X 10.6.8, running on a machine with Intel core i7 with 4 physical cores but OS X reports it as 8 I think because of 2 hyperthreads per core -- more details available on request). Can you try to reproduce to see if you get similar results? If so, do you know why we get bad parallelism in a single JVM for this code? If there are no megamorphic call sites, then it is examples like this that lead me to wonder about locking in memory allocation and/or GC. With the functions below, my part (b) was measured by doing: (time (doall (nthreads-pmap 2 (burn-cons 200) (unchunk (range 64) Andy (defn unchunk [s] (when (seq s) (lazy-seq (cons (first s) (unchunk (next s)) (defn nthreads-pmap Like pmap, except can take an argument nthreads to control the maximum number of parallel threads used. ([f coll] (let [n (+ 2 (.. Runtime getRuntime availableProcessors))] (nthreads-pmap n f coll))) ([nthreads f coll] (if (= nthreads 1) (map f coll) (let [n (dec nthreads) rets (map #(future (f %)) coll) step (fn step [[x xs :as vs] fs] (lazy-seq (if-let [s (seq fs)] (cons (deref x) (step xs (rest s))) (map deref vs] (step rets (drop n rets) ([nthreads f coll colls] (let [step (fn step [cs] (lazy-seq (let [ss (map seq cs)] (when (every? identity ss) (cons (map first ss) (step (map rest ss)))] (nthreads-pmap nthreads #(apply f %) (step (cons coll colls)) On Dec 11, 2012, at 10:06 AM, Marshall Bockrath-Vandegrift wrote: Lee Spector lspec...@hampshire.edu writes: If the application does lots of list processing but does so with a mix of Clojure list and sequence manipulation functions, then one would have to write private, list/cons-only versions of all of these things? That is -- overstating it a bit, to be sure, but perhaps not entirely unfairly -- re-implement Clojure's Lisp? I just did a quick look over clojure/core.clj, and `reverse` is the only function which stood out to me as hitting the most pathological case. Every other `conj` loop over a user-provided datastructure is `conj`ing into an explicit non-list/`Cons` type. So I think if you replace your calls to `reverse` and any `conj` loops you have in your own
Re: abysmal multicore performance, especially on AMD processors
On Dec 11, 2012, at 1:06 PM, Marshall Bockrath-Vandegrift wrote: So I think if you replace your calls to `reverse` and any `conj` loops you have in your own code, you should see a perfectly reasonable speedup. Tantalizing, but on investigation I see that our real application actually does very little explicitly with reverse or conj, and I don't actually think that we're getting reasonable speedups (which is what led me to try that benchmark). So while I'm not sure of the source of the problem in our application I think there can be a problem even if one avoids direct calls to reverse and conj. Andy's recent tests also seem to confirm this. BTW benchmarking our real application (https://github.com/lspector/Clojush) is a bit tricky because it's riddled with random number generator calls that can have big effects, but we're going to look into working around that. Recent postings re: seedable RNGs may help, although changing all of the RNG code may be a little involved because we use thread-local RNGs (to avoid contention and get good multicore speedups... we thought!). -Lee -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
Hi Marshall, I think we're definitely on the right track. If I replace the reverse call with the following function I get a parallel speedup of ~7.3 on an 8 core machine. (defn copy-to-java-list [coll] (let [lst (java.util.LinkedList.)] (doseq [x coll] (.addFirst lst x)) lst)) This function should do as much memory allocation as the clojure reverse but has vastly better parallel performance. There does seem to be something unusual about conj and clojure.lang.PersistentList in this parallel test case and I don't think it's related to the JVMs memory allocation. Cameron. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
The main GC feature here are the Thread-Local Allocation Buffers. They are on by default and are automatically sized according to allocation patterns. The size can also be fine-tuned with the -XX:TLABSize=nconfiguration option. You may consider tweaking this setting to optimize runtime. Basically, everything that one call to your function allocates should fit into a TLAB because it is all garbage upon exit. Allocation inside TLAB is ultra-fast and completely concurrent. Configure TLABhttp://docs.oracle.com/javase/7/docs/technotes/tools/windows/java.html On Sunday, December 9, 2012 7:37:09 PM UTC+1, Andy Fingerhut wrote: On Dec 9, 2012, at 6:25 AM, Softaddicts wrote: If the number of object allocation mentioned earlier in this thread are real, yes vm heap management can be a bottleneck. There has to be some locking done somewhere otherwise the heap would corrupt :) The other bottleneck can come from garbage collection which has to freeze object allocation completely or partially. This internal process has to reclaim unreferenced objects otherwise you may end up exhausting the heap. That can even susoend your app while gc is running depending on the strategy used. Agreed that memory allocation and garbage collection will in some cases need to coordinate between threads to work in the general case of arbitrary allocations and GCs. However, one could have a central list of large pages of free memory (e.g. a few MBytes, or maybe even larger), and pass these out to concurrent memory allocators in these large chunks, and let them do small object allocations and GC within each thread completely concurrently. The only times locking of any kind might be needed with such a strategy would be when one of the parallel threads requests a new big page from the central free list, or returned a completely empty free page back to the central list that it didn't need any more. All other memory allocation and GC could be completely concurrent. The idea of making those pages large is that such passing pages around would be infrequent, and thus could be made to have no measurable synchronization overhead. That is pretty much what is happening when you run Lee's benchmark programs as 1 thread per JVM, but 4 or 8 different JVMs processes running in parallel. In that situation the OS has the central free list of pages, and the JVMs manage their small object allocations and GCs completely concurrently without interfering with each other. If HotSpot's JVM could be configured to work like that, he would be seeing big speedups in a single JVM. Andy -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
cameron cdor...@gmail.com writes: There does seem to be something unusual about conj and clojure.lang.PersistentList in this parallel test case and I don't think it's related to the JVMs memory allocation. I’ve got a few more data-points, but still no handle on what exactly is going on. My last benchmark showing the `conj*` speedup for `Cons` objects degrading as soon as it was used on a `PersistantList` was incomplete. In fact, the speedup degrades after it is used on objects of more than one type. The effect just appears immediately when used with `PersistantList` because '() is in fact a different a `PersistantList$EmptyList`. Using `conj*` first in vector implementation then results in the same inverse speedup on `Cons`s. Even without your near-optimal speedup using Java standard library types, I think your earlier benchmarks are enough to demonstrate that this isn’t an issue with allocation alone. All of the implementations based on `reduce` with `conj` must allocate and return a new object for each iteration. If parallel allocation were the sole issue, I’d expect all of the implementations to demonstrate the same behavior. Unfortunately I have no idea what to connect from these facts: - Parallel allocation of `Cons` and `PersistentList` instances through a Clojure `conj` function remains fast as long as the function only ever returns objects of a single concrete type - Parallel allocation speed for `PersistentVector` instances is unaffected by `conj` returning multiple types, and does not demonstrate the inverse speedup seen for the previous types. At this point I believe the symptoms point to cache contention, but I don’t know where or why. Using OpenJDK 7 with -XX:+UseCondMark didn’t appear to produce any improvement. Creating a private copy of `PersistentList` which contained additional padding fields likewise didn’t appear to produce any improvement. So, Lee Spector: I think it’s possible to work around this though by just not using `conj` on lists. It’s suboptimal, but at least solves the problem in your original benchmark. Further improvements are obviously possible, but that’s a start. -Marshall -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
- Parallel allocation of `Cons` and `PersistentList` instances through a Clojure `conj` function remains fast as long as the function only ever returns objects of a single concrete type A possible explanation for this could be JIT Deoptimization. Deoptimization happens when the JIT optimized a hot code path based on what has learned from the executing program, but no longer holds, then the compiler notices it has made an incorrect decision in the previous optimization, and performs the deoptimization. It might be the case that after a deoptimization, the JIT compiler reoptimizes a code path the was previously optimized. This frequently happens when dealing different implementations of an interface, and/or inheritance and type hierarchy due to the code path execution changing constantly and having to go through multiple optimizations and deoptimizations. To identify whether deoptimization is happening, you can add the following flag to jvm args -XX:+PrintCompilation If the output contains prints made not entrant then that indicates a deoptimization, and that method will be interpreted until a certain threshold of executions is surpassed, and get optimized again. Reference: Robust Java Benchmarking, http://www.ibm.com/developerworks/java/library/j-benchmark1/index.html#do Java Performance, Charlie Hunt and Binu John On Monday, December 10, 2012 6:17:47 AM UTC-5, Marshall Bockrath-Vandegrift wrote: cameron cdo...@gmail.com javascript: writes: There does seem to be something unusual about conj and clojure.lang.PersistentList in this parallel test case and I don't think it's related to the JVMs memory allocation. I’ve got a few more data-points, but still no handle on what exactly is going on. My last benchmark showing the `conj*` speedup for `Cons` objects degrading as soon as it was used on a `PersistantList` was incomplete. In fact, the speedup degrades after it is used on objects of more than one type. The effect just appears immediately when used with `PersistantList` because '() is in fact a different a `PersistantList$EmptyList`. Using `conj*` first in vector implementation then results in the same inverse speedup on `Cons`s. Even without your near-optimal speedup using Java standard library types, I think your earlier benchmarks are enough to demonstrate that this isn’t an issue with allocation alone. All of the implementations based on `reduce` with `conj` must allocate and return a new object for each iteration. If parallel allocation were the sole issue, I’d expect all of the implementations to demonstrate the same behavior. Unfortunately I have no idea what to connect from these facts: - Parallel allocation of `Cons` and `PersistentList` instances through a Clojure `conj` function remains fast as long as the function only ever returns objects of a single concrete type - Parallel allocation speed for `PersistentVector` instances is unaffected by `conj` returning multiple types, and does not demonstrate the inverse speedup seen for the previous types. At this point I believe the symptoms point to cache contention, but I don’t know where or why. Using OpenJDK 7 with -XX:+UseCondMark didn’t appear to produce any improvement. Creating a private copy of `PersistentList` which contained additional padding fields likewise didn’t appear to produce any improvement. So, Lee Spector: I think it’s possible to work around this though by just not using `conj` on lists. It’s suboptimal, but at least solves the problem in your original benchmark. Further improvements are obviously possible, but that’s a start. -Marshall -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
Aha. Not only do I get a lot of made not entrant, I get a lot of made zombie. However, I get this for both runs with map and with pmap (and with pmapall as well) For instance, from a pmapall run: 33752 159 clojure.lang.Cons::next (10 bytes) made zombie 33752 164 clojure.lang.RT::conj (21 bytes) made zombie 33753 154 clojure.lang.RT::seq (32 bytes) made zombie 337535 % clojure.core$reduce1::invoke @ -2 (184 bytes) made zombie 33753 167 clojure.core$conj::invoke (13 bytes) made zombie 33753 184 clojure.core$rest::invoke (7 bytes) 33884 186 clojure.lang.Numbers$LongOps::isPos (15 bytes) 34298 187 clojure.lang.Numbers::dec (17 bytes) 34421 188 clojure.lang.Numbers$LongOps::dec (13 bytes) 34897 189 clojure.core$take::invoke (24 bytes) 34903 190 clojure.core$take$fn__4227::init (15 bytes) 35386 191 clojure.core$iterate::invoke (39 bytes) SNIP 1168 154 clojure.lang.RT::seq (32 bytes) made not entrant 1169 168 clojure.lang.RT::seq (32 bytes) 11715 % clojure.core$reduce1::invoke @ -2 (184 bytes) made not entrant 1171 169 clojure.core$reduce1::invoke (184 bytes) 1173 159 clojure.lang.Cons::next (10 bytes) made not entrant 1173 167 clojure.core$conj::invoke (13 bytes) made not entrant 1173 164 clojure.lang.RT::conj (21 bytes) made not entrant 1192 170 clojure.lang.PersistentList::first (5 bytes) 1193 171 clojure.lang.PersistentList::next (18 bytes) 1193 172 clojure.lang.Cons::next (10 bytes) 1194 173 clojure.lang.RT::conj (21 bytes) 1197 174 clojure.core$conj::invoke (13 bytes) 12336 % clojure.core$reduce1::invoke @ 0 (184 bytes) And then, from a map run: 1163 151 clojure.lang.RT::seq (32 bytes) made not entrant 1163 145 clojure.core$cons::invoke (10 bytes) made not entrant 1163 168 clojure.lang.RT::seq (32 bytes) 11654 % clojure.core$reduce1::invoke @ 0 (184 bytes) 1169 144 clojure.lang.RT::cons (46 bytes) made not entrant 3467 169 clojure.lang.RT::cons (46 bytes) 3470 24 clojure.lang.Util::equiv (65 bytes) made zombie 3470 23 java.lang.String::equals (88 bytes) made zombie 3470 18 clojure.lang.PersistentArrayMap::createWithCheck (80 bytes) made zombie 3470 26 clojure.lang.PersistentArrayMap::equalKey (6 bytes) made zombie 3617 170 clojure.core$cons::invoke (10 bytes) 3622 41 clojure.lang.PersistentArrayMap::indexOf (34 bytes) made zombie 3622 30 ! java.net.URL::init (543 bytes) made zombie 3622 58 clojure.lang.Symbol::equals (49 bytes) made zombie 3623 73 java.lang.Object::equals (11 bytes) made zombie 3623 65 clojure.lang.Util::hasheq (43 bytes) made zombie 4249 171 s! clojure.lang.LazySeq::sval (54 bytes) 4259 77 clojure.lang.Util::equiv (65 bytes) made zombie 4259 89 clojure.lang.RT::first (35 bytes) made zombie 4259 88 clojure.lang.PersistentHashMap$NodeSeq::create (94 bytes) made zombie 4578 172 n java.lang.Object::getClass (0 bytes) 5380 173 clojure.lang.AFunction::init (5 bytes) 5634 174 java.lang.Long::longValue (5 bytes) 5785 175 s clojure.lang.LazySeq::seq (53 bytes) 5830 176 clojure.lang.Numbers::ops (97 bytes) 6168 177 clojure.lang.LazySeq::init (10 bytes) 6169 32 java.lang.AbstractStringBuilder::append (48 bytes) made zombie 10727 178 java.lang.Long::valueOf (36 bytes) 10730 37 java.lang.StringBuilder::append (8 bytes) made zombie 10730 49 ! sun.misc.URLClassPath$JarLoader::getResource (91 bytes) made zombie 10730 44 sun.misc.URLClassPath::getResource (74 bytes) made zombie 11121 179 clojure.lang.Numbers::num (5 bytes) 11240 180 clojure.core$rest::invoke (7 bytes) 11240 181 clojure.lang.RT::more (37 bytes) 11242 144 clojure.lang.RT::cons (46 bytes) made zombie 11242 145 clojure.core$cons::invoke (10 bytes) made zombie 11242 151 clojure.lang.RT::seq (32 bytes) made zombie On Mon, Dec 10, 2012 at 11:21 AM, meteorfox ctorresk8guitar@gmail.comwrote: -XX:+PrintCompilation -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated -
Re: abysmal multicore performance, especially on AMD processors
I tried some more performance tuning options in Java, just for kicks, and didn't get any advantages from them: -server -XX:+TieredCompilation -XX:ReservedCodeCacheSize=256m Also, in case it's informative: [josiah@compute-1-17 benchmark]$ grep entrant compilerOutputCompute-1-1.txt | wc -l 173 [josiah@compute-1-17 benchmark]$ grep entrant compilerOutputCompute-1-17.txt | wc -l 178 [josiah@compute-1-17 benchmark]$ grep zombie compilerOutputCompute-1-17.txt | wc -l 163 [josiah@compute-1-17 benchmark]$ grep zombie compilerOutputCompute-1-1.txt | wc -l 158 [josiah@compute-1-17 benchmark]$ On Mon, Dec 10, 2012 at 12:55 PM, Wm. Josiah Erikson wmjos...@gmail.comwrote: Aha. Not only do I get a lot of made not entrant, I get a lot of made zombie. However, I get this for both runs with map and with pmap (and with pmapall as well) For instance, from a pmapall run: 33752 159 clojure.lang.Cons::next (10 bytes) made zombie 33752 164 clojure.lang.RT::conj (21 bytes) made zombie 33753 154 clojure.lang.RT::seq (32 bytes) made zombie 337535 % clojure.core$reduce1::invoke @ -2 (184 bytes) made zombie 33753 167 clojure.core$conj::invoke (13 bytes) made zombie 33753 184 clojure.core$rest::invoke (7 bytes) 33884 186 clojure.lang.Numbers$LongOps::isPos (15 bytes) 34298 187 clojure.lang.Numbers::dec (17 bytes) 34421 188 clojure.lang.Numbers$LongOps::dec (13 bytes) 34897 189 clojure.core$take::invoke (24 bytes) 34903 190 clojure.core$take$fn__4227::init (15 bytes) 35386 191 clojure.core$iterate::invoke (39 bytes) SNIP 1168 154 clojure.lang.RT::seq (32 bytes) made not entrant 1169 168 clojure.lang.RT::seq (32 bytes) 11715 % clojure.core$reduce1::invoke @ -2 (184 bytes) made not entrant 1171 169 clojure.core$reduce1::invoke (184 bytes) 1173 159 clojure.lang.Cons::next (10 bytes) made not entrant 1173 167 clojure.core$conj::invoke (13 bytes) made not entrant 1173 164 clojure.lang.RT::conj (21 bytes) made not entrant 1192 170 clojure.lang.PersistentList::first (5 bytes) 1193 171 clojure.lang.PersistentList::next (18 bytes) 1193 172 clojure.lang.Cons::next (10 bytes) 1194 173 clojure.lang.RT::conj (21 bytes) 1197 174 clojure.core$conj::invoke (13 bytes) 12336 % clojure.core$reduce1::invoke @ 0 (184 bytes) And then, from a map run: 1163 151 clojure.lang.RT::seq (32 bytes) made not entrant 1163 145 clojure.core$cons::invoke (10 bytes) made not entrant 1163 168 clojure.lang.RT::seq (32 bytes) 11654 % clojure.core$reduce1::invoke @ 0 (184 bytes) 1169 144 clojure.lang.RT::cons (46 bytes) made not entrant 3467 169 clojure.lang.RT::cons (46 bytes) 3470 24 clojure.lang.Util::equiv (65 bytes) made zombie 3470 23 java.lang.String::equals (88 bytes) made zombie 3470 18 clojure.lang.PersistentArrayMap::createWithCheck (80 bytes) made zombie 3470 26 clojure.lang.PersistentArrayMap::equalKey (6 bytes) made zombie 3617 170 clojure.core$cons::invoke (10 bytes) 3622 41 clojure.lang.PersistentArrayMap::indexOf (34 bytes) made zombie 3622 30 ! java.net.URL::init (543 bytes) made zombie 3622 58 clojure.lang.Symbol::equals (49 bytes) made zombie 3623 73 java.lang.Object::equals (11 bytes) made zombie 3623 65 clojure.lang.Util::hasheq (43 bytes) made zombie 4249 171 s! clojure.lang.LazySeq::sval (54 bytes) 4259 77 clojure.lang.Util::equiv (65 bytes) made zombie 4259 89 clojure.lang.RT::first (35 bytes) made zombie 4259 88 clojure.lang.PersistentHashMap$NodeSeq::create (94 bytes) made zombie 4578 172 n java.lang.Object::getClass (0 bytes) 5380 173 clojure.lang.AFunction::init (5 bytes) 5634 174 java.lang.Long::longValue (5 bytes) 5785 175 s clojure.lang.LazySeq::seq (53 bytes) 5830 176 clojure.lang.Numbers::ops (97 bytes) 6168 177 clojure.lang.LazySeq::init (10 bytes) 6169 32 java.lang.AbstractStringBuilder::append (48 bytes) made zombie 10727 178 java.lang.Long::valueOf (36 bytes) 10730 37 java.lang.StringBuilder::append (8 bytes) made zombie 10730 49 ! sun.misc.URLClassPath$JarLoader::getResource (91 bytes) made zombie 10730 44
Re: abysmal multicore performance, especially on AMD processors
Wm. Josiah Erikson wmjos...@gmail.com writes: Aha. Not only do I get a lot of made not entrant, I get a lot of made zombie. However, I get this for both runs with map and with pmap (and with pmapall as well) I’m not sure this is all that enlightening. From what I can gather, “made not entrant” just means that a JITed version proved to be invalid in light of later code and new invocation paths won’t use the previous version. And “made zombie” just means all references to an old JIT’d version have been lost, making it available to be GC’ed. A copy of `conj` becomes “not entrant” after being used on both vectors and lists, but the new version gets the same speed-up when used on vectors as a copy which has only been used on vectors. There’s something else going on which is specifically affecting parallel calls to the polymorphic version when applied to instances of `PersistentList`, and `Cons`. -Marshall -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
Interesting. I tried the following: :jvm-opts [-Xmx10g -Xms10g -XX:+AggressiveOpts -server -XX:+TieredCompilation -XX:ReservedCodeCacheSize=256m -XX:TLABSize=1G -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseParNewGC -XX:+ResizeTLAB -XX:+UseTLAB] I got a slight slowdown, and the GC details are as follows: [josiah@compute-1-1 benchmark]$ /usr/bin/time -f %E lein run 0.852: [GC 0.852: [ParNew: 2796224K-6726K(3145728K), 0.0136610 secs] 2796224K-6726K(10136256K), 0.0136930 secs] [Times: user=0.05 sys=0.02, real=0.01 secs] 0.871: [GC 0.871: [ParNew: 2802950K-8104K(3145728K), 0.0116360 secs] 2802950K-8104K(10136256K), 0.0116720 secs] [Times: user=0.06 sys=0.01, real=0.01 secs] 0.890: [GC 0.890: [ParNew: 2804328K-11538K(3145728K), 0.0112460 secs] 2804328K-11538K(10136256K), 0.0112720 secs] [Times: user=0.08 sys=0.00, real=0.01 secs] 0.904: [GC 0.904: [ParNew: 2807762K-11752K(3145728K), 0.0092300 secs] 2807762K-11752K(10136256K), 0.0092550 secs] [Times: user=0.06 sys=0.00, real=0.01 secs] 0.915: [GC 0.915: [ParNew: 2807976K-10702K(3145728K), 0.0072210 secs] 2807976K-10702K(10136256K), 0.0072480 secs] [Times: user=0.06 sys=0.00, real=0.01 secs] 0.969: [GC 0.969: [ParNew: 2806926K-12249K(3145728K), 0.0206880 secs] 2806926K-12249K(10136256K), 0.0207160 secs] [Times: user=0.13 sys=0.01, real=0.02 secs] 21.099: [GC 21.099: [ParNew: 2808473K-14256K(3145728K), 0.0174230 secs] 2808473K-14256K(10136256K), 0.0174580 secs] [Times: user=0.12 sys=0.00, real=0.02 secs] 46.533: [GC 46.533: [ParNew: 2810480K-10070K(3145728K), 0.0097840 secs] 2810480K-10070K(10136256K), 0.0098140 secs] [Times: user=0.08 sys=0.00, real=0.01 secs] 74.988: [GC 74.988: [ParNew: 2806294K-11576K(3145728K), 0.0134020 secs] 2806294K-11576K(10136256K), 0.0134330 secs] [Times: user=0.08 sys=0.00, real=0.02 secs] 105.143: [GC 105.143: [ParNew: 2807800K-12728K(3145728K), 0.0121870 secs] 2807800K-12728K(10136256K), 0.0122240 secs] [Times: user=0.08 sys=0.00, real=0.02 secs] 136.170: [GC 136.170: [ParNew: 2808952K-13336K(3145728K), 0.0144400 secs] 2808952K-13336K(10136256K), 0.0144720 secs] [Times: user=0.09 sys=0.00, real=0.01 secs] 167.703: [GC 167.703: [ParNew: 2809560K-14763K(3145728K), 0.0105520 secs] 2809560K-14763K(10136256K), 0.0105830 secs] [Times: user=0.07 sys=0.00, real=0.01 secs] 199.593: [GC 199.593: [ParNew: 2810987K-11407K(3145728K), 0.0142030 secs] 2810987K-11407K(10136256K), 0.0142350 secs] [Times: user=0.08 sys=0.00, real=0.01 secs] 231.894: [GC 231.894: [ParNew: 2807631K-15066K(3145728K), 0.0129290 secs] 2807631K-15066K(10136256K), 0.0129630 secs] [Times: user=0.10 sys=0.01, real=0.01 secs] 264.239: [GC 264.239: [ParNew: 2811290K-9632K(3145728K), 0.0119130 secs] 2811290K-9632K(10136256K), 0.0119850 secs] [Times: user=0.08 sys=0.00, real=0.01 secs] Elapsed time: 291038.415325 msecs Heap par new generation total 3145728K, used 2700935K [0x00057ae0, 0x00065035, 0x00065035) eden space 2796224K, 96% used [0x00057ae0, 0x00061f239bb0, 0x0006258b) from space 349504K, 2% used [0x00063ae0, 0x00063b768340, 0x00065035) to space 349504K, 0% used [0x0006258b, 0x0006258b, 0x00063ae0) tenured generation total 6990528K, used 0K [0x00065035, 0x0007fae0, 0x0007fae0) the space 6990528K, 0% used [0x00065035, 0x00065035, 0x000650350200, 0x0007fae0) compacting perm gen total 21248K, used 11049K [0x0007fae0, 0x0007fc2c, 0x0008) the space 21248K, 52% used [0x0007fae0, 0x0007fb8ca638, 0x0007fb8ca800, 0x0007fc2c) No shared spaces configured. 4:53.06 [josiah@compute-1-1 benchmark]$ On Mon, Dec 10, 2012 at 3:33 PM, Marshall Bockrath-Vandegrift llas...@gmail.com wrote: Wm. Josiah Erikson wmjos...@gmail.com writes: Aha. Not only do I get a lot of made not entrant, I get a lot of made zombie. However, I get this for both runs with map and with pmap (and with pmapall as well) I’m not sure this is all that enlightening. From what I can gather, “made not entrant” just means that a JITed version proved to be invalid in light of later code and new invocation paths won’t use the previous version. And “made zombie” just means all references to an old JIT’d version have been lost, making it available to be GC’ed. A copy of `conj` becomes “not entrant” after being used on both vectors and lists, but the new version gets the same speed-up when used on vectors as a copy which has only been used on vectors. There’s something else going on which is specifically affecting parallel calls to the polymorphic version when applied to instances of `PersistentList`, and `Cons`. -Marshall -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To
Re: abysmal multicore performance, especially on AMD processors
Hi Lee, Would it be difficult to try the following version of 'pmap'? It doesn't use futures but executors instead so at least this could help narrow the problem down... If the problem is due to the high number of futures spawned by pmap then this should fix it... (defn- with-thread-pool* [num-threads f] (let [pool (java.util.concurrent.Executors/newFixedThreadPool num-threads)] (try (f pool) (finally (when pool (.shutdown pool)) (defmacro with-thread-pool [[name num-threads] body] `(with-thread-pool* ~num-threads (fn [~name] ~@body))) (defn pmapp [f coll] (with-thread-pool [pool (.. Runtime getRuntime availableProcessors)] (doall (map #(.get %) (.invokeAll pool (map (partial partial f) coll)) And btw, reducers are great but usually you need to reformulate your problem..r/map is not parallel. It is serial but with no intermediate allocation (unlike lazy-seqs). If you want to make it parallel you need to r/fold the result of r/map (the reducer) providing reducing combining fns. However, is still doesn't make sense to be slower than map...If you 've followed Rich's example with the pie-maker consider this: Both map r/map will return immediately. The difference is that map will build a recipe for the first item whereas r/map will build a recipe for the entire coll. In other words, the lazy-pie-maker assistant provides 1 apple at a time (after asking for an apple) but the reducer-pie-maker-assistant, as soon as you ask for the 1st apple it will do the entire bag (without intermediate 'asking'). The lazy recipe is recursive whereas the reducer-based one looks like a stream...It should be still faster, albeit not terribly faster. You need r/fold to see any parallelism... Jim On 09/12/12 05:19, Lee Spector wrote: On Dec 8, 2012, at 8:16 PM, Marek Šrank wrote: Yep, reducers, don't use lazy seqs. But they return just sth. like transformed functions, that will be applied when building the collection. So you can use them like this: (into [] (r/map burn (doall (range 4) See http://clojure.com/blog/2012/05/08/reducers-a-library-and-model-for-collection-processing.html and http://clojure.com/blog/2012/05/15/anatomy-of-reducer.html for more info... Thanks Marek. This does fix the too quick to be true issue, but alas, on my mac laptop (a 4 core intel i7): 57522.869 msecs for (time (into [] (r/map burn (doall (range 4) 58263.312 msecs for (time (doall (map burn (doall (range 4) So while I'm not getting a terrible slowdown from using the reducers version of map, I'm also not getting any speedup over the single-thread map. We should try this on our other architectures too, but it doesn't look promising. -Lee -- Lee Spector, Professor of Computer Science Cognitive Science, Hampshire College 893 West Street, Amherst, MA 01002-3359 lspec...@hampshire.edu, http://hampshire.edu/lspector/ Phone: 413-559-5352, Fax: 413-559-5438 -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
cameron cdor...@gmail.com writes: Interesting problem, the slowdown seems to being caused by the reverse call (actually the calls to conj with a list argument). Excellent analysis, sir! I think this points things in the right direction. fast-reverse : map-ms: 3.3, pmap-ms 0.7, speedup 4.97 list-cons : map-ms: 4.0, pmap-ms 0.7, speedup 6.13 The difference between these two I believe is just jitter. Once `cons` is called on either a list or a vector, the result is a `clojure.lang.Cons` object. vec-conj : map-ms: 4.0, pmap-ms 1.3, speedup 3.10 For the sub-goal of optimizing `reverse`, I get better times even than for the `cons`-based implementation by using transient vectors: list-cons : map-ms: 4.0, pmap-ms 0.6, speedup 6.42 tvec-conj : map-ms: 0.9, pmap-ms 0.2, speedup 4.10 (defn tvec-conj [coll] (persistent! (reduce conj! (transient []) coll))) list-conj : map-ms: 10.8, pmap-ms 21.2, speedup 0.51 clojure-reverse : map-ms: 13.5, pmap-ms 26.8, speedup 0.50 (this is equivalent to the original code) I add the following: cons-conj : map-ms: 3.3, pmap-ms 16.8, speedup 0.19 (defn cons-conj [coll] (reduce conj (clojure.lang.Cons. (first coll) nil) (rest coll))) I think this is the key, but I don’t understand it. The `cons` function just immediately creates a new `Cons` object. The `conj` function calls the `.cons` method of the collection, and the `.cons` implementation `Cons` inherits from `ASeq` just creates a new `Cons` object! It’s like there’s a lock of some sort sneaking in on the `conj` path. Any thoughts on what that could be? -Marshall -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
If the number of object allocation mentioned earlier in this thread are real, yes vm heap management can be a bottleneck. There has to be some locking done somewhere otherwise the heap would corrupt :) The other bottleneck can come from garbage collection which has to freeze object allocation completely or partially. This internal process has to reclaim unreferenced objects otherwise you may end up exhausting the heap. That can even susoend your app while gc is running depending on the strategy used. In java 6, the vm is suppose to adjust the gc strategy by looking at your app behavior. That may take some time to profile however. If your app has this heavy memory requirement abruptely, the vm may not have enough data to adapt. Have a look here, you will find how to set explicitly the gc behavior: http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html This is specific to java 6, java 5 is a different beast as java 7. Maybe this can help, I emphasize, *can help* not having looked deeply into your problem. Luc P. On Dec 8, 2012, at 10:19 PM, meteorfox wrote: Now if you run vmstat 1 while running your benchmark you'll notice that the run queue will be most of the time at 8, meaning that 8 processes are waiting for CPU, and this is due to memory accesses (in this case, since this is not true for all applications). So, I feel your benchmark may be artificial and does not truly represent your real application, make sure to profile your real application, and optimize according to the bottlenecks. There are really useful tools out there for profiling, such as VisualVM, perf. Thanks Carlos. I don't actually think that the benchmark is particularly artificial. It's very difficult to profile my actual application because it uses random numbers all over the place and is highly and nonlinearly variable in lots of ways. But I think that the benchmark I'm running really is pretty representative. In any event, WHY would all of that waiting be happening? Logically, nothing should have to be waiting for anything else. We know from the fact that we get good speedups from multiple simultaneous JVM runs, each just running one call to my burn function, that the hardware is capable of performing well with multiple instances of this benchmark running concurrently. It MUST be able to handle all of the memory allocation and everything else. It's just when we try to launch them all in parallel from the same Clojure process, using pmap OR agents OR reducers, that we fail to get the concurrency speedups. Why should this be? And what can be done about it? I know nearly nothing about the internals of the JVM, but is there perhaps a bottleneck on memory allocation because there's a single serial allocator? Perhaps when we run multiple JVMs each has its own allocator so we don't have the bottleneck? If all of this makes sense and is true, then perhaps (wishful thinking) there's a JVM option like use parallel memory allocators that will fix it?! Thanks, -Lee -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -- Softaddictslprefonta...@softaddicts.ca sent by ibisMail from my ipad! -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
On Dec 8, 2012, at 9:37 PM, Lee Spector wrote: On Dec 8, 2012, at 10:19 PM, meteorfox wrote: Now if you run vmstat 1 while running your benchmark you'll notice that the run queue will be most of the time at 8, meaning that 8 processes are waiting for CPU, and this is due to memory accesses (in this case, since this is not true for all applications). So, I feel your benchmark may be artificial and does not truly represent your real application, make sure to profile your real application, and optimize according to the bottlenecks. There are really useful tools out there for profiling, such as VisualVM, perf. Thanks Carlos. I don't actually think that the benchmark is particularly artificial. It's very difficult to profile my actual application because it uses random numbers all over the place and is highly and nonlinearly variable in lots of ways. But I think that the benchmark I'm running really is pretty representative. In any event, WHY would all of that waiting be happening? Logically, nothing should have to be waiting for anything else. We know from the fact that we get good speedups from multiple simultaneous JVM runs, each just running one call to my burn function, that the hardware is capable of performing well with multiple instances of this benchmark running concurrently. It MUST be able to handle all of the memory allocation and everything else. It's just when we try to launch them all in parallel from the same Clojure process, using pmap OR agents OR reducers, that we fail to get the concurrency speedups. Why should this be? And what can be done about it? I know nearly nothing about the internals of the JVM, but is there perhaps a bottleneck on memory allocation because there's a single serial allocator? Perhaps when we run multiple JVMs each has its own allocator so we don't have the bottleneck? If all of this makes sense and is true, then perhaps (wishful thinking) there's a JVM option like use parallel memory allocators that will fix it?! Lee: I don't know yet know how to get good speedups with this workload in the Oracle JVM, although it might be possible with options I'm unaware of. Azul's Zing JVM has a different memory allocator and GC implementation that might be better tuned for parallel workloads. I haven't used it myself yet -- this is just from hearing about it in the past and a brief scan of their web site. They have free trials available. Maybe you could try that to see if it gives you better results out of the box, or with minor tweaking of parameters? I don't know the cost, but like many companies they might have significant discounts for educational customers. Andy -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
On Dec 9, 2012, at 4:48 AM, Marshall Bockrath-Vandegrift wrote: It’s like there’s a lock of some sort sneaking in on the `conj` path. Any thoughts on what that could be? My current best guess is the JVM's memory allocator, not Clojure code. Andy -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
Andy Fingerhut andy.finger...@gmail.com writes: My current best guess is the JVM's memory allocator, not Clojure code. I didn’t mean to imply the problem was in Clojure itself, but I don’t believe the issue is in the memory allocator either. I now believe the problem is in a class of JIT optimization HotSpot is performing which turns into a “pessimization” in the parallel case. I have the following code, taking the structure from Cameron’s benchmarks, but replacing the exact functions being tested: https://gist.github.com/4246320 Note that `conj*` is simply a copy-paste of the `conj` implementation from clojure/core.clj. The benchmark runs reducing that `conj*` function once on `Cons`s, then on `PersistantList`s, then again on `Cons`s. And here are the results: cons-conj* : map-ms: 5.6, pmap-ms 1.1, speedup 5.08 list-conj* : map-ms: 10.1, pmap-ms 15.9, speedup 0.63 cons-conj* : map-ms: 10.0, pmap-ms 15.6, speedup 0.64 The function performs fine on `Cons` objects, but once applied to `PersistantList` objects, `conj*` somehow becomes “tainted” and acquires the inverse speedup factor. When I run with -XX:+PrintCompilation I don’t see anything particularly suspicious, but I’m not well-versed in the art of interpreting HotSpot’s entrails. My idea for next steps is to create a copy of the PersistantList class and selectively modify it in an attempt to identify what aspect of it causes the JVM to produce this behavior. -Marshall -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
On Dec 9, 2012, at 6:25 AM, Softaddicts wrote: If the number of object allocation mentioned earlier in this thread are real, yes vm heap management can be a bottleneck. There has to be some locking done somewhere otherwise the heap would corrupt :) The other bottleneck can come from garbage collection which has to freeze object allocation completely or partially. This internal process has to reclaim unreferenced objects otherwise you may end up exhausting the heap. That can even susoend your app while gc is running depending on the strategy used. Agreed that memory allocation and garbage collection will in some cases need to coordinate between threads to work in the general case of arbitrary allocations and GCs. However, one could have a central list of large pages of free memory (e.g. a few MBytes, or maybe even larger), and pass these out to concurrent memory allocators in these large chunks, and let them do small object allocations and GC within each thread completely concurrently. The only times locking of any kind might be needed with such a strategy would be when one of the parallel threads requests a new big page from the central free list, or returned a completely empty free page back to the central list that it didn't need any more. All other memory allocation and GC could be completely concurrent. The idea of making those pages large is that such passing pages around would be infrequent, and thus could be made to have no measurable synchronization overhead. That is pretty much what is happening when you run Lee's benchmark programs as 1 thread per JVM, but 4 or 8 different JVMs processes running in parallel. In that situation the OS has the central free list of pages, and the JVMs manage their small object allocations and GCs completely concurrently without interfering with each other. If HotSpot's JVM could be configured to work like that, he would be seeing big speedups in a single JVM. Andy -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
There's no magic here, everyone tuning their app hit this wall eventually, tweaking the JVM memory options :) Luc On Dec 9, 2012, at 6:25 AM, Softaddicts wrote: If the number of object allocation mentioned earlier in this thread are real, yes vm heap management can be a bottleneck. There has to be some locking done somewhere otherwise the heap would corrupt :) The other bottleneck can come from garbage collection which has to freeze object allocation completely or partially. This internal process has to reclaim unreferenced objects otherwise you may end up exhausting the heap. That can even susoend your app while gc is running depending on the strategy used. Agreed that memory allocation and garbage collection will in some cases need to coordinate between threads to work in the general case of arbitrary allocations and GCs. However, one could have a central list of large pages of free memory (e.g. a few MBytes, or maybe even larger), and pass these out to concurrent memory allocators in these large chunks, and let them do small object allocations and GC within each thread completely concurrently. The only times locking of any kind might be needed with such a strategy would be when one of the parallel threads requests a new big page from the central free list, or returned a completely empty free page back to the central list that it didn't need any more. All other memory allocation and GC could be completely concurrent. The idea of making those pages large is that such passing pages around would be infrequent, and thus could be made to have no measurable synchronization overhead. That is pretty much what is happening when you run Lee's benchmark programs as 1 thread per JVM, but 4 or 8 different JVMs processes running in parallel. In that situation the OS has the central free list of pages, and the JVMs manage their small object allocations and GCs completely concurrently without interfering with each other. If HotSpot's JVM could be configured to work like that, he would be seeing big speedups in a single JVM. Andy -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -- Softaddictslprefonta...@softaddicts.ca sent by ibisMail from my ipad! -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
Even though this is very surprising (and sad) to hear, I'm afraid I've got different experiences... My reducer-based parallel minimax is about 3x faster than the serial one, on my 4-core AMD phenom II and a tiny bit faster on my girlfriend's intel i5 (2 physical cores + 2 virtual). I'm suspecting due to slightly larger L3 cache... So no complaints on my part from this... Now, with regards to pmapThe only occassions I'm finding pmap useful is for coarse-grain concurrency...In other words I'm only using pmap where i 'd use Executors. Usually, when mapping a very expensive fn (that needs no coordination) across a very large collection of roughly equally-sized elements. In these cases I'm getting proper linear speedup and i don't see why wouldn't I. If you're familiar with text-mining (I think you are) consider annotating a collection of documents (500 - 1000) based on some dictionaries and probabilistic taggers. This could take more than 15 seconds per document so pmap does pay off as you would expect. I've not done any experiments on any clusters with reducers but i wouldn't expect them to perform well when distributed across several nodes. Basically you 'd have horrible locality for something that requires coordination (like minimax). However, if you could run something like a genetic-algorithm across several nodes and minimax on each node then I'd expect good data locality thus tremendous speedup. This is exactly what I'm planning of doing for training my chess neural-net. Using 20-30 8core machines would be ideal for this use-case (20-30 individuals in the population - each lifetime (5 games) running on a separate machine that is entirely devoted to minimax)... Anyway, like you, i am baffled by your experiences...You say that monitoring cpus shows that they are all busy throughout the entire task. If this is truly the case then I really don't understand! Otherwise, as you say pmap may leave cores idle for a bit if some tasks take a lot longer than others...this is why I said equally-sized chunks before...make sure you monitor your cpus closely to verify that they are indeed busyThis has bitten me in the past... Of course, in your case (range 8) indeed contains 'equally-sized' elemements so i don't know what to think! In any case there might be something going on that we're not seeing... Jim On 08/12/12 01:25, Lee Spector wrote: I've been running compute intensive (multi-day), highly parallelizable Clojure processes on high-core-count machines and blithely assuming that since I saw near maximal CPU utilization in top and the like that I was probably getting good speedups. But a colleague recently did some tests and the results are really quite alarming. On intel machines we're seeing speedups but much less than I expected -- about a 2x speedup going from 1 to 8 cores. But on AMD processors we're seeing SLOWDOWNS, with the same tests taking almost twice as long on 8 cores as on 1. I'm baffled, and unhappy that my runs are probably going slower on 48-core and 64-core nodes than on single-core nodes. It's possible that I'm just doing something wrong in the way that I dispatch the tasks, or that I've missed some Clojure or JVM setting... but right now I'm mystified and would really appreciate some help. I'm aware that there's overhead for multicore distribution and that one can expect slowdowns if the computations that are being distributed are fast relative to the dispatch overhead, but this should not be the case here. We're distributing computations that take seconds or minutes, and not huge numbers of them (at least in our tests while trying to figure out what's going on). I'm also aware that the test that produced the data I give below, insofar as it uses pmap to do the distribution, may leave cores idle for a bit if some tasks take a lot longer than others, because of the way that pmap allocates cores to threads. But that also shouldn't be a big issue here because for this test all of the threads are doing the exact same computation. And I also tried using an agent-based dispatch approach that shouldn't have the pmap thread allocation issue, and the results were about the same. Note also that all of the computations in this test are purely functional and independent -- there shouldn't be any resource contention issues. The test: I wrote a time-consuming function that just does a bunch of math and list manipulation (which is what takes a lot of time in my real applications): (defn burn ([] (loop [i 0 value '()] (if (= i 1) (count (last (take 1 (iterate reverse value (recur (inc i) (cons (* (int i) (+ (float i) (- (int i) (/ (float i) (inc (int i)) value) ([_] (burn))) Then I have a main
Re: abysmal multicore performance, especially on AMD processors
Lee Spector lspec...@hampshire.edu writes: I'm also aware that the test that produced the data I give below, insofar as it uses pmap to do the distribution, may leave cores idle for a bit if some tasks take a lot longer than others, because of the way that pmap allocates cores to threads. Although it doesn’t impact your benchmark, `pmap` may be further adversely affecting the performance of your actual program. There’s a open bug regarding `pmap` and chunked seqs: http://dev.clojure.org/jira/browse/CLJ-862 The impact is that `pmap` with chunked seq input will spawn futures for its function applications in flights of 32, spawning as many flights as necessary to reach or exceed #CPUS + 2. On a 48-way system, it will initially launch 64 futures, then spawn an additional 32 every time the number of active unrealized futures drops below 50, leading to significant contention for a CPU-bound application. I hope it can be made useful in a future version of Clojure, but right now `pmap` is more of an attractive nuisance than anything else. -Marshall -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
On Dec 7, 2012, at 9:42 PM, Andy Fingerhut wrote: When you say we can run multiple instances of the test on the same machine, do you mean that, for example, on an 8 core machine you run 8 different JVMs in parallel, each doing a single-threaded 'map' in your Clojure code and not a 'pmap'? And what kinds of speedups does that achieve? The results could help indicate whether the problem you are seeing is due to the hardware/OS, or something about multiple threads within a single JVM. Here are a couple more results, courtesy my colleague Josiah Erikson (who may chime in here at some point too): On a 4 core machine, each test within a single lein run call: 2:31 for (doall (pmap burn (range 8))) 1:29 for (doall (map burn (range 8))) 1:48 for (doall (pmapall burn (range 8))) [see my reply to Jim, immediately following this one, for pmapall] Now with multiple lein run calls launched concurrently: 29 seconds to do 4 concurrent copies of (doall (map burn (range 2))) 33 seconds to do 8 concurrent copies of (burn 2) [the argument to burn doesn't matter, but that's what Josiah ran] 38 seconds to do 8 concurrent copies of (burn (rand-int 8)) [again, the argument shouldn't matter]. 48 seconds to do 2 concurrent copies of (doall (map burn (range 4))) 1:07 to do 2 concurrent copies of (doall (pmap burn (range 4))) What do we learn from this, aside from the fact that things are way wacky? Thanks for the help!! -Lee -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
On Dec 8, 2012, at 9:36 AM, Marshall Bockrath-Vandegrift wrote: Although it doesn’t impact your benchmark, `pmap` may be further adversely affecting the performance of your actual program. There’s a open bug regarding `pmap` and chunked seqs: http://dev.clojure.org/jira/browse/CLJ-862 The impact is that `pmap` with chunked seq input will spawn futures for its function applications in flights of 32, spawning as many flights as necessary to reach or exceed #CPUS + 2. On a 48-way system, it will initially launch 64 futures, then spawn an additional 32 every time the number of active unrealized futures drops below 50, leading to significant contention for a CPU-bound application. I hope it can be made useful in a future version of Clojure, but right now `pmap` is more of an attractive nuisance than anything else. (I said this would be in a reply to Jim, but now I see it makes more sense in a reply to this message of Marshall's.) In order to avoid issues with pmap we've also run tests using pmapall, defined as follows: (defn pmapall Like pmap but: 1) coll should be finite, 2) the returned sequence will not be lazy, 3) calls to f may occur in any order, to maximize multicore processor utilization, and 4) takes only one coll so far. [f coll] (let [agents (map agent coll)] (dorun (map #(send % f) agents)) (apply await agents) (doall (map deref agents The results are more or less the same (awful). -Lee -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
My experiences in the past are similar to the numbers that Jim is reporting. I have recently been centering most of my crunching code around reducers. Is it possible for you to cook up a small representative test using reducers+fork/join (and potentially primitives in the intermediate steps)? Perhaps it can shed some light on contention or chunking issues (and eliminate some of the allocation concerns). Paul -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
On Dec 8, 2012, at 1:28 PM, Paul deGrandis wrote: My experiences in the past are similar to the numbers that Jim is reporting. I have recently been centering most of my crunching code around reducers. Is it possible for you to cook up a small representative test using reducers+fork/join (and potentially primitives in the intermediate steps)? Perhaps it can shed some light on contention or chunking issues (and eliminate some of the allocation concerns). Just tried, my first foray into reducers, but I must not be understanding something correctly: (time (r/map burn (doall (range 4 returns in less than a second on my macbook pro, whereas (time (doall (map burn (range 4 takes nearly a minute. This feels like unforced laziness (although it's not quite that fast), but clojure.core.reducers/map involves no laziness, right? I think my burn function (reproduced below for convenience) forces all of the laziness in there... So what am I doing wrong? BTW the runs described in this message use Clojure 1.5.0-alpha3 -- the ones in previous messages in this thread used Clojure 1.3.0. -Lee (defn burn ([] (loop [i 0 value '()] (if (= i 1) (count (last (take 1 (iterate reverse value (recur (inc i) (cons (* (int i) (+ (float i) (- (int i) (/ (float i) (inc (int i)) value) ([_] (burn))) -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
Andy: The short answer is yes, and we saw huge speedups. My latest post, as well as Lee's, has details. On Friday, December 7, 2012 9:42:03 PM UTC-5, Andy Fingerhut wrote: On Dec 7, 2012, at 5:25 PM, Lee Spector wrote: Another strange observation is that we can run multiple instances of the test on the same machine and (up to some limit, presumably) they don't seem to slow each other down, even though just one instance of the test appears to be maxing out all of the CPU according to top. I suppose that means that top isn't telling me what I thought -- my colleague says it can mean that something is blocked in some way with a full instruction queue. But I'm not interested in running multiple instances. I have single computations that involve multiple expensive but independent subcomputations, and I want to farm those subcomputations out to multiple cores -- and get speedups as a result. My subcomputations are so completely independent that I think I should be able to get speedups approaching a factor of n for n cores, but what I see is a factor of only about 2 on intel machines, and a bizarre factor of about 1/2 on AMD machines. Lee: When you say we can run multiple instances of the test on the same machine, do you mean that, for example, on an 8 core machine you run 8 different JVMs in parallel, each doing a single-threaded 'map' in your Clojure code and not a 'pmap'? And what kinds of speedups does that achieve? The results could help indicate whether the problem you are seeing is due to the hardware/OS, or something about multiple threads within a single JVM. Andy -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
Hi guys - I'm the colleague Lee speaks of. Because Jim mentioned running things on a 4-core Phenom II, I did some benchmarking on a Phenom II X4 945, and found some very strange results, which I shall post here, after I explain a little function that Lee wrote that is designed to get improved results over pmap. It looks like this: (defn pmapall Like pmap but: 1) coll should be finite, 2) the returned sequence will not be lazy, 3) calls to f may occur in any order, to maximize multicore processor utilization, and 4) takes only one coll so far. [f coll] (let [agents (map agent coll)] (dorun (map #(send % f) agents)) (apply await agents) (doall (map deref agents Refer to Lee's first post for the benchmarking routine we're running. I figured that, in order to figure out if it was Java's multithreading that was the problem (as opposed to memory bandwidth, or the OS, or whatever), I'd compare ( doall( pmapall burn (range 8))) to running 8 concurrent copies of (burn (rand-int 8) or even just (burn 2) or 4 copies of ( doall( map burn (range 2))) or whatever. Does this make sense? I THINK it does. If it doesn't, then that's cool - just let me know why and I'll feel less crazy, because I am finding my results rather confounding. On said Phenom II X4 945 with 16GB of RAM, it takes 2:31 to do ( doall( pmap burn (range 8))), 1:29 to do ( doall( map burn (range 8))), and 1:48 to do ( doall( pmapall burn (range 8))). So that's weird, because although we do see decreased slowdown from using pmapall, we still don't see a speedup compared to map. Watching processor utilization while these are going on shows that map is using one core, and both pmap and pmapall are using all four cores fully, as they should. So, maybe the OS or the hardware just can't deal with running that many copies of burn at once? Maybe there's a memory bottleneck? Now here's the weird part: it takes around 29 seconds to do four concurrent copies of ( doall( map burn (range 2))), around 33 seconds to run 8 copies of (burn 2). Yes. Read that again. What? Watching top while this is going on shows what you would expect to see: When I run four concurrent copies, I've got four copies of Java using 100% of a core each, and when I run eight concurrent copies, I see eight copies of Java, all using around 50% of the processor each. Also, by the way, it takes 48 seconds to run two concurrent copies of ( doall( map burn (range 4))) and 1:07 to run two concurrent copies of ( doall( pmap burn (range 4))). What is going on here? Is Java's multithreading really THAT bad? This appears to me to prove that Java, or clojure, has something very seriously wrong with it, or has outrageous amounts of overhead when spawning a new thread. No? all run with :jvm-opts [-Xmx1g -Xms1g -XX:+AggressiveOpts] and clojure 1.5.0-beta1 (I tried increasing the memory allowed for the pmap and pmapall runs, even to 8g, and it doesn't help at all) Java(TM) SE Runtime Environment (build 1.7.0_03-b04) Java HotSpot(TM) 64-Bit Server VM (build 22.1-b02, mixed mode): on ROCKS 6.0 (CentOS 6.2) with kernel 2.6.32-220.13.1.el6.x86_64 #1 SMP Any thoughts or ideas? There's more weirdness, too, in case anybody in interested. I'm getting results that vary strangely from other benchmarks that are available, and make no sense to me. Check this out (these are incomplete, because I decided to dig deeper with the above benchmarks, but you'll see, I think, why this is so confusing, if you know how fast these processors are supposed to be): all run with :jvm-opts [-Xmx1g -Xms1g -XX:+AggressiveOpts] and clojure 1.5.0-beta1 Java(TM) SE Runtime Environment (build 1.7.0_03-b04) Java HotSpot(TM) 64-Bit Server VM (build 22.1-b02, mixed mode): Key:1. (pmap range 8) : 2. (map range 8) : 3. (8 concurrent copies of pmap range 8) : 4. (8 concurrent copies of map range 8) : 5. pmapall range 8: 4x AMD Opteron 6168: 1. 4:02.06 2. 2:20.29 3. 4. AMD Phenom II X4 945: 1. 2:31.65 2. 1:29.90 3. 3:32.60 4. 3:08.97 5. 1:48.36 AMD Phenom II X6 1100T: 1. 2:03.71 2. 1:14.76 3. 2:20.14 4. 1:57.38 5. 2:14.43 AMD FX 8120: 1. 4:50.06 2. 1:25.04 3. 5:55.84 4. 2:46.94 5. 4:36.61 AMD FX 8350: 1. 3:42.35 2. 1:13.94 3. 3:00.46 4. 2:06.18 5. 3:56.95 Intel Core i7 3770K: 1. 0:44 2. 1:37.18 3. 2:29.41 4. 2:16.05 5. 0:44.42 2 x Intel Paxville DP Xeon: 1. 6:26.112 2. 3:20.149 3. 8:09.85 4. 7:06.52 5. 5:55.29 On Saturday, December 8, 2012 9:36:56 AM UTC-5, Marshall Bockrath-Vandegrift wrote: Lee Spector lspe...@hampshire.edu javascript: writes: I'm also aware that the test that produced the data I give below, insofar as it uses pmap to do the distribution, may leave cores idle for a bit if some tasks take a lot longer than others, because of the way that pmap allocates cores to threads.
Re: abysmal multicore performance, especially on AMD processors
I haven't analyzed your results in detail, but here are some results I had on my 2GHz 4-core Intel core i7 MacBook Pro vintage 2011. When running multiple threads within a single JVM invocation, I never got a speedup of even 2. The highest speedup I measured was 1.82 speedup when I ran 8 threads using -XX:+UseParallelGC. I tried with -XX:+UseParNewGC but never got a speedup over 1.45 (with 4 threads in parallel -- it was lower with 8 threads). When running multiple invocations of lein2 run in parallel as separate processes, I was able to achieve a speedup of 1.88 with 2 processes, 3.40 with 4 processes, and 5.34 with 8 processes (it went over 4 I think because of 2 hyperthreads per each of the 4 cores). This is a strong indication that the issue is some kind of interference between multiple threads in the same JVM, not the hardware, at least on my hardware and OS (OS was Mac OS X 10.6.8, JVM was Apple/Oracle Java 1.6.0_37). My first guess would be that even with -XX:+UseParallelGC or -XX:+UseParNewGC, there is either some kind of interference with garbage collection, or perhaps there is even some kind of interference between them when allocating memory? Should JVM memory allocations be completely parallel with no synchronization when running multiple threads, or do memory allocations sometimes lock a shared data structure? Andy On Dec 8, 2012, at 11:10 AM, Wm. Josiah Erikson wrote: Hi guys - I'm the colleague Lee speaks of. Because Jim mentioned running things on a 4-core Phenom II, I did some benchmarking on a Phenom II X4 945, and found some very strange results, which I shall post here, after I explain a little function that Lee wrote that is designed to get improved results over pmap. It looks like this: (defn pmapall Like pmap but: 1) coll should be finite, 2) the returned sequence will not be lazy, 3) calls to f may occur in any order, to maximize multicore processor utilization, and 4) takes only one coll so far. [f coll] (let [agents (map agent coll)] (dorun (map #(send % f) agents)) (apply await agents) (doall (map deref agents Refer to Lee's first post for the benchmarking routine we're running. I figured that, in order to figure out if it was Java's multithreading that was the problem (as opposed to memory bandwidth, or the OS, or whatever), I'd compare ( doall( pmapall burn (range 8))) to running 8 concurrent copies of (burn (rand-int 8) or even just (burn 2) or 4 copies of ( doall( map burn (range 2))) or whatever. Does this make sense? I THINK it does. If it doesn't, then that's cool - just let me know why and I'll feel less crazy, because I am finding my results rather confounding. On said Phenom II X4 945 with 16GB of RAM, it takes 2:31 to do ( doall( pmap burn (range 8))), 1:29 to do ( doall( map burn (range 8))), and 1:48 to do ( doall( pmapall burn (range 8))). So that's weird, because although we do see decreased slowdown from using pmapall, we still don't see a speedup compared to map. Watching processor utilization while these are going on shows that map is using one core, and both pmap and pmapall are using all four cores fully, as they should. So, maybe the OS or the hardware just can't deal with running that many copies of burn at once? Maybe there's a memory bottleneck? Now here's the weird part: it takes around 29 seconds to do four concurrent copies of ( doall( map burn (range 2))), around 33 seconds to run 8 copies of (burn 2). Yes. Read that again. What? Watching top while this is going on shows what you would expect to see: When I run four concurrent copies, I've got four copies of Java using 100% of a core each, and when I run eight concurrent copies, I see eight copies of Java, all using around 50% of the processor each. Also, by the way, it takes 48 seconds to run two concurrent copies of ( doall( map burn (range 4))) and 1:07 to run two concurrent copies of ( doall( pmap burn (range 4))). What is going on here? Is Java's multithreading really THAT bad? This appears to me to prove that Java, or clojure, has something very seriously wrong with it, or has outrageous amounts of overhead when spawning a new thread. No? all run with :jvm-opts [-Xmx1g -Xms1g -XX:+AggressiveOpts] and clojure 1.5.0-beta1 (I tried increasing the memory allowed for the pmap and pmapall runs, even to 8g, and it doesn't help at all) Java(TM) SE Runtime Environment (build 1.7.0_03-b04) Java HotSpot(TM) 64-Bit Server VM (build 22.1-b02, mixed mode): on ROCKS 6.0 (CentOS 6.2) with kernel 2.6.32-220.13.1.el6.x86_64 #1 SMP Any thoughts or ideas? There's more weirdness, too, in case anybody in interested. I'm getting results that vary strangely from other benchmarks that are available, and make no sense to me. Check this out (these are incomplete, because I decided to dig deeper with the above benchmarks, but you'll see, I think, why this
Re: abysmal multicore performance, especially on AMD processors
On Dec 7, 2012, at 5:25 PM, Lee Spector wrote: The test: I wrote a time-consuming function that just does a bunch of math and list manipulation (which is what takes a lot of time in my real applications): (defn burn ([] (loop [i 0 value '()] (if (= i 1) (count (last (take 1 (iterate reverse value (recur (inc i) (cons (* (int i) (+ (float i) (- (int i) (/ (float i) (inc (int i)) value) ([_] (burn))) Lee: I'm hoping you realize that (take 1 (iterate reverse value)) is reversing a linked list 1 times, each time allocating 1 cons cells (or Clojure's equivalent of a cons cell)? For a total of around 100,000,000 memory allocations of small objects per call to burn? I know that this is just a bit of code for testing purposes, and not your real application, but do you really do that quantity of memory allocation versus arithmetic in your code (7*10,000 arithmetic operations for every 10,000*10,000 memory allocations?) I still believe there is an issue here with either the memory allocation or garbage collection implementations that we are seeing here that can hopefully be improved with some command line options to the JVM that I don't know yet, but I know that I have seen much bigger speedups in code that didn't do this much memory allocation versus non-allocating computation (e.g. the Computer Language Benchmarks Game Clojure programs). Andy -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
On Dec 8, 2012, at 3:42 PM, Andy Fingerhut wrote: I'm hoping you realize that (take 1 (iterate reverse value)) is reversing a linked list 1 times, each time allocating 1 cons cells (or Clojure's equivalent of a cons cell)? For a total of around 100,000,000 memory allocations of small objects per call to burn? I know that this is just a bit of code for testing purposes, and not your real application, but do you really do that quantity of memory allocation versus arithmetic in your code (7*10,000 arithmetic operations for every 10,000*10,000 memory allocations?) I still believe there is an issue here with either the memory allocation or garbage collection implementations that we are seeing here that can hopefully be improved with some command line options to the JVM that I don't know yet, but I know that I have seen much bigger speedups in code that didn't do this much memory allocation versus non-allocating computation (e.g. the Computer Language Benchmarks Game Clojure programs). The numbers were guesses, but my real applications do have something of this flavor. Mostly I'm doing genetic programming, which involves generating, running, mutating and recombining programs. Population sizes are often in the thousands, with program sizes up to thousands of instructions/literals. Often the programs include list manipulation instructions so there's lots of memory allocation involved in executing them (in addition to the allocation required to generate and manipulate them). -Lee -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
I'm glad somebody else can duplicate our findings! I get results similar to this on Intel hardware. On AMD hardware, the disparity is bigger, and multiple threads of a single JVM invocation on AMD hardware consistently gives me slowdowns as compared to a single thread. Also, your results are on MacOS and my results are on linux, so it begs the question: Is this generally true of Java, or is it something about clojure? On Saturday, December 8, 2012 3:31:25 PM UTC-5, Andy Fingerhut wrote: I haven't analyzed your results in detail, but here are some results I had on my 2GHz 4-core Intel core i7 MacBook Pro vintage 2011. When running multiple threads within a single JVM invocation, I never got a speedup of even 2. The highest speedup I measured was 1.82 speedup when I ran 8 threads using -XX:+UseParallelGC. I tried with -XX:+UseParNewGC but never got a speedup over 1.45 (with 4 threads in parallel -- it was lower with 8 threads). When running multiple invocations of lein2 run in parallel as separate processes, I was able to achieve a speedup of 1.88 with 2 processes, 3.40 with 4 processes, and 5.34 with 8 processes (it went over 4 I think because of 2 hyperthreads per each of the 4 cores). This is a strong indication that the issue is some kind of interference between multiple threads in the same JVM, not the hardware, at least on my hardware and OS (OS was Mac OS X 10.6.8, JVM was Apple/Oracle Java 1.6.0_37). My first guess would be that even with -XX:+UseParallelGC or -XX:+UseParNewGC, there is either some kind of interference with garbage collection, or perhaps there is even some kind of interference between them when allocating memory? Should JVM memory allocations be completely parallel with no synchronization when running multiple threads, or do memory allocations sometimes lock a shared data structure? Andy On Dec 8, 2012, at 11:10 AM, Wm. Josiah Erikson wrote: Hi guys - I'm the colleague Lee speaks of. Because Jim mentioned running things on a 4-core Phenom II, I did some benchmarking on a Phenom II X4 945, and found some very strange results, which I shall post here, after I explain a little function that Lee wrote that is designed to get improved results over pmap. It looks like this: (defn pmapall Like pmap but: 1) coll should be finite, 2) the returned sequence will not be lazy, 3) calls to f may occur in any order, to maximize multicore processor utilization, and 4) takes only one coll so far. [f coll] (let [agents (map agent coll)] (dorun (map #(send % f) agents)) (apply await agents) (doall (map deref agents Refer to Lee's first post for the benchmarking routine we're running. I figured that, in order to figure out if it was Java's multithreading that was the problem (as opposed to memory bandwidth, or the OS, or whatever), I'd compare ( doall( pmapall burn (range 8))) to running 8 concurrent copies of (burn (rand-int 8) or even just (burn 2) or 4 copies of ( doall( map burn (range 2))) or whatever. Does this make sense? I THINK it does. If it doesn't, then that's cool - just let me know why and I'll feel less crazy, because I am finding my results rather confounding. On said Phenom II X4 945 with 16GB of RAM, it takes 2:31 to do ( doall( pmap burn (range 8))), 1:29 to do ( doall( map burn (range 8))), and 1:48 to do ( doall( pmapall burn (range 8))). So that's weird, because although we do see decreased slowdown from using pmapall, we still don't see a speedup compared to map. Watching processor utilization while these are going on shows that map is using one core, and both pmap and pmapall are using all four cores fully, as they should. So, maybe the OS or the hardware just can't deal with running that many copies of burn at once? Maybe there's a memory bottleneck? Now here's the weird part: it takes around 29 seconds to do four concurrent copies of ( doall( map burn (range 2))), around 33 seconds to run 8 copies of (burn 2). Yes. Read that again. What? Watching top while this is going on shows what you would expect to see: When I run four concurrent copies, I've got four copies of Java using 100% of a core each, and when I run eight concurrent copies, I see eight copies of Java, all using around 50% of the processor each. Also, by the way, it takes 48 seconds to run two concurrent copies of ( doall( map burn (range 4))) and 1:07 to run two concurrent copies of ( doall( pmap burn (range 4))). What is going on here? Is Java's multithreading really THAT bad? This appears to me to prove that Java, or clojure, has something very seriously wrong with it, or has outrageous amounts of overhead when spawning a new thread. No? all run with :jvm-opts [-Xmx1g -Xms1g -XX:+AggressiveOpts] and clojure 1.5.0-beta1 (I tried increasing the memory allowed for the pmap and pmapall runs, even to 8g, and it doesn't help at all) Java(TM) SE
Re: abysmal multicore performance, especially on AMD processors
Just tried, my first foray into reducers, but I must not be understanding something correctly: (time (r/map burn (doall (range 4 returns in less than a second on my macbook pro, whereas (time (doall (map burn (range 4 takes nearly a minute. This feels like unforced laziness (although it's not quite that fast), but clojure.core.reducers/map involves no laziness, right? Yep, reducers, don't use lazy seqs. But they return just sth. like transformed functions, that will be applied when building the collection. So you can use them like this: (into [] (r/map burn (doall (range 4) See http://clojure.com/blog/2012/05/08/reducers-a-library-and-model-for-collection-processing.html and http://clojure.com/blog/2012/05/15/anatomy-of-reducer.html for more info... Marek. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
Lee: So I ran On Friday, December 7, 2012 8:25:14 PM UTC-5, Lee wrote: I've been running compute intensive (multi-day), highly parallelizable Clojure processes on high-core-count machines and blithely assuming that since I saw near maximal CPU utilization in top and the like that I was probably getting good speedups. But a colleague recently did some tests and the results are really quite alarming. On intel machines we're seeing speedups but much less than I expected -- about a 2x speedup going from 1 to 8 cores. But on AMD processors we're seeing SLOWDOWNS, with the same tests taking almost twice as long on 8 cores as on 1. I'm baffled, and unhappy that my runs are probably going slower on 48-core and 64-core nodes than on single-core nodes. It's possible that I'm just doing something wrong in the way that I dispatch the tasks, or that I've missed some Clojure or JVM setting... but right now I'm mystified and would really appreciate some help. I'm aware that there's overhead for multicore distribution and that one can expect slowdowns if the computations that are being distributed are fast relative to the dispatch overhead, but this should not be the case here. We're distributing computations that take seconds or minutes, and not huge numbers of them (at least in our tests while trying to figure out what's going on). I'm also aware that the test that produced the data I give below, insofar as it uses pmap to do the distribution, may leave cores idle for a bit if some tasks take a lot longer than others, because of the way that pmap allocates cores to threads. But that also shouldn't be a big issue here because for this test all of the threads are doing the exact same computation. And I also tried using an agent-based dispatch approach that shouldn't have the pmap thread allocation issue, and the results were about the same. Note also that all of the computations in this test are purely functional and independent -- there shouldn't be any resource contention issues. The test: I wrote a time-consuming function that just does a bunch of math and list manipulation (which is what takes a lot of time in my real applications): (defn burn ([] (loop [i 0 value '()] (if (= i 1) (count (last (take 1 (iterate reverse value (recur (inc i) (cons (* (int i) (+ (float i) (- (int i) (/ (float i) (inc (int i)) value) ([_] (burn))) Then I have a main function like this: (defn -main [ args] (time (doall (pmap burn (range 8 (System/exit 0)) We run it with lein run (we've tried both leingingen 1.7.1 and 2.0.0-preview10) with Java 1.7.0_03 Java HotSpot(TM) 64-Bit Server VM. We also tried Java 1.6.0_22. We've tried various JVM memory options (via :jvm-opts with -Xmx and -Xms settings) and also with and without -XX:+UseParallelGC. None of this seems to change the picture substantially. The results that we get generally look like this: - On an Intel Core i7 3770K with 8 cores and 16GB of RAM, running the code above, it takes about 45 seconds (and all cores appear to be fully loaded as it does so). If we change the pmap to just plain map, so that we use only a single core, the time goes up to about 1 minute and 36 seconds. So the speedup for 8 cores is just about 2x, even though there are 8 completely independent tasks. So that's pretty depressing. - But much worse: on a 4 x Opteron 6272 with 48 cores and 32GB of RAM, running the same test (with pmap) takes about 4 minutes and 2 seconds. That's really slow! Changing the pmap to map here produces a runtime of about 2 minutes and 20 seconds. So it's quite a bit faster on one core than on 8! And all of these times are terrible compared to those on the intel. Another strange observation is that we can run multiple instances of the test on the same machine and (up to some limit, presumably) they don't seem to slow each other down, even though just one instance of the test appears to be maxing out all of the CPU according to top. I suppose that means that top isn't telling me what I thought -- my colleague says it can mean that something is blocked in some way with a full instruction queue. But I'm not interested in running multiple instances. I have single computations that involve multiple expensive but independent subcomputations, and I want to farm those subcomputations out to multiple cores -- and get speedups as a result. My subcomputations are so completely independent that I think I should be able to get speedups approaching a factor of n for n cores, but what I see is a factor of only about 2 on intel machines, and a bizarre factor of about 1/2 on AMD machines.
Re: abysmal multicore performance, especially on AMD processors
Lee: I ran Linux perf and also watched the run queue (with vmstat) and your bottleneck is basically memory access. The CPUs are idle 80% of the time by stalled cycles. Here's what I got on my machine. Intel Core i7 4 cores with Hyper thread (8 virtual processors) 16 GiB of Memory Oracle JVM and this is how I ran perf. (Note: After monitoring the GC, even 1GiB Heap is a lot of memory for your benchmark, it mostly used a lot less than 256 MiB) Also I ran the (time (doall ..)) three times to make sure that the JVM was warmed up. :) perf stat java -server -Xmx1024m -Xms1024m -jar target/cpu-burn-0.1.0-SNAPSHOT-standalone.jar Elapsed time: 49194.258825 msecs Warm-up 1 (1 1 1 1 1 1 1 1) Elapsed time: 48200.221677 msecs Warm-up 2 (1 1 1 1 1 1 1 1) Elapsed time: 50050.013156 msecs Run (1 1 1 1 1 1 1 1) Performance counter stats for 'java -server -Xmx1024m -Xms1024m -jar target/cpu-burn-0.1.0-SNAPSHOT-standalone.jar': 1094967.286876 task-clock#7.383 CPUs utilized 170,384 context-switches #0.000 M/sec 2,932 CPU-migrations#0.000 M/sec 24,734 page-faults #0.000 M/sec 3,004,827,628,531 cycles#2.744 GHz [83.34%] *2,430,611,924,472 stalled-cycles-frontend # 80.89% frontend cycles idle[83.34%]* * 1,976,836,501,358 stalled-cycles-backend# 65.79% backend cycles idle[66.67%]* 650,868,504,974 instructions #0.22 insns per cycle #3.73 stalled cycles per insn [83.33%] 124,479,033,687 branches # 113.683 M/sec [83.33%] 1,753,021,734 branch-misses #1.41% of all branches [83.33%] 148.307812402 seconds time elapsed Now if you run vmstat 1 while running your benchmark you'll notice that the run queue will be most of the time at 8, meaning that 8 processes are waiting for CPU, and this is due to memory accesses (in this case, since this is not true for all applications). So, I feel your benchmark may be artificial and does not truly represent your real application, make sure to profile your real application, and optimize according to the bottlenecks. There are really useful tools out there for profiling, such as VisualVM, perf. Good Luck, Carlos On Friday, December 7, 2012 8:25:14 PM UTC-5, Lee wrote: I've been running compute intensive (multi-day), highly parallelizable Clojure processes on high-core-count machines and blithely assuming that since I saw near maximal CPU utilization in top and the like that I was probably getting good speedups. But a colleague recently did some tests and the results are really quite alarming. On intel machines we're seeing speedups but much less than I expected -- about a 2x speedup going from 1 to 8 cores. But on AMD processors we're seeing SLOWDOWNS, with the same tests taking almost twice as long on 8 cores as on 1. I'm baffled, and unhappy that my runs are probably going slower on 48-core and 64-core nodes than on single-core nodes. It's possible that I'm just doing something wrong in the way that I dispatch the tasks, or that I've missed some Clojure or JVM setting... but right now I'm mystified and would really appreciate some help. I'm aware that there's overhead for multicore distribution and that one can expect slowdowns if the computations that are being distributed are fast relative to the dispatch overhead, but this should not be the case here. We're distributing computations that take seconds or minutes, and not huge numbers of them (at least in our tests while trying to figure out what's going on). I'm also aware that the test that produced the data I give below, insofar as it uses pmap to do the distribution, may leave cores idle for a bit if some tasks take a lot longer than others, because of the way that pmap allocates cores to threads. But that also shouldn't be a big issue here because for this test all of the threads are doing the exact same computation. And I also tried using an agent-based dispatch approach that shouldn't have the pmap thread allocation issue, and the results were about the same. Note also that all of the computations in this test are purely functional and independent -- there shouldn't be any resource contention issues. The test: I wrote a time-consuming function that just does a bunch of math and list manipulation (which is what takes a lot of time in my real applications): (defn burn ([] (loop [i 0 value '()] (if (= i 1) (count (last (take 1 (iterate reverse value (recur (inc i)
Re: abysmal multicore performance, especially on AMD processors
Correction regarding the run-queue, this is not completely correct, :S . But the stalled cycles and memory accesses still holds. Sorry for the misinformation. On Friday, December 7, 2012 8:25:14 PM UTC-5, Lee wrote: I've been running compute intensive (multi-day), highly parallelizable Clojure processes on high-core-count machines and blithely assuming that since I saw near maximal CPU utilization in top and the like that I was probably getting good speedups. But a colleague recently did some tests and the results are really quite alarming. On intel machines we're seeing speedups but much less than I expected -- about a 2x speedup going from 1 to 8 cores. But on AMD processors we're seeing SLOWDOWNS, with the same tests taking almost twice as long on 8 cores as on 1. I'm baffled, and unhappy that my runs are probably going slower on 48-core and 64-core nodes than on single-core nodes. It's possible that I'm just doing something wrong in the way that I dispatch the tasks, or that I've missed some Clojure or JVM setting... but right now I'm mystified and would really appreciate some help. I'm aware that there's overhead for multicore distribution and that one can expect slowdowns if the computations that are being distributed are fast relative to the dispatch overhead, but this should not be the case here. We're distributing computations that take seconds or minutes, and not huge numbers of them (at least in our tests while trying to figure out what's going on). I'm also aware that the test that produced the data I give below, insofar as it uses pmap to do the distribution, may leave cores idle for a bit if some tasks take a lot longer than others, because of the way that pmap allocates cores to threads. But that also shouldn't be a big issue here because for this test all of the threads are doing the exact same computation. And I also tried using an agent-based dispatch approach that shouldn't have the pmap thread allocation issue, and the results were about the same. Note also that all of the computations in this test are purely functional and independent -- there shouldn't be any resource contention issues. The test: I wrote a time-consuming function that just does a bunch of math and list manipulation (which is what takes a lot of time in my real applications): (defn burn ([] (loop [i 0 value '()] (if (= i 1) (count (last (take 1 (iterate reverse value (recur (inc i) (cons (* (int i) (+ (float i) (- (int i) (/ (float i) (inc (int i)) value) ([_] (burn))) Then I have a main function like this: (defn -main [ args] (time (doall (pmap burn (range 8 (System/exit 0)) We run it with lein run (we've tried both leingingen 1.7.1 and 2.0.0-preview10) with Java 1.7.0_03 Java HotSpot(TM) 64-Bit Server VM. We also tried Java 1.6.0_22. We've tried various JVM memory options (via :jvm-opts with -Xmx and -Xms settings) and also with and without -XX:+UseParallelGC. None of this seems to change the picture substantially. The results that we get generally look like this: - On an Intel Core i7 3770K with 8 cores and 16GB of RAM, running the code above, it takes about 45 seconds (and all cores appear to be fully loaded as it does so). If we change the pmap to just plain map, so that we use only a single core, the time goes up to about 1 minute and 36 seconds. So the speedup for 8 cores is just about 2x, even though there are 8 completely independent tasks. So that's pretty depressing. - But much worse: on a 4 x Opteron 6272 with 48 cores and 32GB of RAM, running the same test (with pmap) takes about 4 minutes and 2 seconds. That's really slow! Changing the pmap to map here produces a runtime of about 2 minutes and 20 seconds. So it's quite a bit faster on one core than on 8! And all of these times are terrible compared to those on the intel. Another strange observation is that we can run multiple instances of the test on the same machine and (up to some limit, presumably) they don't seem to slow each other down, even though just one instance of the test appears to be maxing out all of the CPU according to top. I suppose that means that top isn't telling me what I thought -- my colleague says it can mean that something is blocked in some way with a full instruction queue. But I'm not interested in running multiple instances. I have single computations that involve multiple expensive but independent subcomputations, and I want to farm those subcomputations out to multiple cores -- and get speedups as a result. My subcomputations are so completely independent that I think I should be able to get speedups approaching
Re: abysmal multicore performance, especially on AMD processors
On Dec 8, 2012, at 8:16 PM, Marek Šrank wrote: Yep, reducers, don't use lazy seqs. But they return just sth. like transformed functions, that will be applied when building the collection. So you can use them like this: (into [] (r/map burn (doall (range 4) See http://clojure.com/blog/2012/05/08/reducers-a-library-and-model-for-collection-processing.html and http://clojure.com/blog/2012/05/15/anatomy-of-reducer.html for more info... Thanks Marek. This does fix the too quick to be true issue, but alas, on my mac laptop (a 4 core intel i7): 57522.869 msecs for (time (into [] (r/map burn (doall (range 4) 58263.312 msecs for (time (doall (map burn (doall (range 4) So while I'm not getting a terrible slowdown from using the reducers version of map, I'm also not getting any speedup over the single-thread map. We should try this on our other architectures too, but it doesn't look promising. -Lee -- Lee Spector, Professor of Computer Science Cognitive Science, Hampshire College 893 West Street, Amherst, MA 01002-3359 lspec...@hampshire.edu, http://hampshire.edu/lspector/ Phone: 413-559-5352, Fax: 413-559-5438 -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
On Dec 8, 2012, at 10:19 PM, meteorfox wrote: Now if you run vmstat 1 while running your benchmark you'll notice that the run queue will be most of the time at 8, meaning that 8 processes are waiting for CPU, and this is due to memory accesses (in this case, since this is not true for all applications). So, I feel your benchmark may be artificial and does not truly represent your real application, make sure to profile your real application, and optimize according to the bottlenecks. There are really useful tools out there for profiling, such as VisualVM, perf. Thanks Carlos. I don't actually think that the benchmark is particularly artificial. It's very difficult to profile my actual application because it uses random numbers all over the place and is highly and nonlinearly variable in lots of ways. But I think that the benchmark I'm running really is pretty representative. In any event, WHY would all of that waiting be happening? Logically, nothing should have to be waiting for anything else. We know from the fact that we get good speedups from multiple simultaneous JVM runs, each just running one call to my burn function, that the hardware is capable of performing well with multiple instances of this benchmark running concurrently. It MUST be able to handle all of the memory allocation and everything else. It's just when we try to launch them all in parallel from the same Clojure process, using pmap OR agents OR reducers, that we fail to get the concurrency speedups. Why should this be? And what can be done about it? I know nearly nothing about the internals of the JVM, but is there perhaps a bottleneck on memory allocation because there's a single serial allocator? Perhaps when we run multiple JVMs each has its own allocator so we don't have the bottleneck? If all of this makes sense and is true, then perhaps (wishful thinking) there's a JVM option like use parallel memory allocators that will fix it?! Thanks, -Lee -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
Interesting problem, the slowdown seems to being caused by the reverse call (actually the calls to conj with a list argument). Calling conj in a multi-threaded environment seems to have a significant performance impact when using lists I created some alternate reverse implementations (the fastest uses a vector and cons), the gist with the test code can be found at https://gist.github.com/4243724 On a dual Xeon E5520 (8 physical cores) @ 2.27GHz with Linux runnning OpenJDK 1.7.0_09 I got the following results: fast-reverse: map-ms: 3.3, pmap-ms 0.7, speedup 4.97 list-cons : map-ms: 4.0, pmap-ms 0.7, speedup 6.13 vec-conj: map-ms: 4.0, pmap-ms 1.3, speedup 3.10 list-conj : map-ms: 10.8, pmap-ms 21.2, speedup 0.51 clojure-reverse : map-ms: 13.5, pmap-ms 26.8, speedup 0.50 (this is equivalent to the original code) The following JDK command line options were used: -Xmx20G -XX:MaxPermSize=8G -XX:+UseParallelGC -XX:+UseParallelOldGC Some other notes: I ran the sample under YourKit and garbage collection represents a small percentage of execution time in both single and multi-threaded tests, there are no blocked threads for the duration of the test and there is no unexpected monitor/lock usage. Cheers, Cameron. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: abysmal multicore performance, especially on AMD processors
I forgot to mention, I cut the number of reverse iterations down to 1000 (not 1) so I wouldn't have to wait too long for criterium, the speedup numbers are representative of the full test though. Cameron. On Sunday, December 9, 2012 6:26:16 PM UTC+11, cameron wrote: Interesting problem, the slowdown seems to being caused by the reverse call (actually the calls to conj with a list argument). Calling conj in a multi-threaded environment seems to have a significant performance impact when using lists I created some alternate reverse implementations (the fastest uses a vector and cons), the gist with the test code can be found at https://gist.github.com/4243724 On a dual Xeon E5520 (8 physical cores) @ 2.27GHz with Linux runnning OpenJDK 1.7.0_09 I got the following results: fast-reverse: map-ms: 3.3, pmap-ms 0.7, speedup 4.97 list-cons : map-ms: 4.0, pmap-ms 0.7, speedup 6.13 vec-conj: map-ms: 4.0, pmap-ms 1.3, speedup 3.10 list-conj : map-ms: 10.8, pmap-ms 21.2, speedup 0.51 clojure-reverse : map-ms: 13.5, pmap-ms 26.8, speedup 0.50 (this is equivalent to the original code) The following JDK command line options were used: -Xmx20G -XX:MaxPermSize=8G -XX:+UseParallelGC -XX:+UseParallelOldGC Some other notes: I ran the sample under YourKit and garbage collection represents a small percentage of execution time in both single and multi-threaded tests, there are no blocked threads for the duration of the test and there is no unexpected monitor/lock usage. Cheers, Cameron. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
abysmal multicore performance, especially on AMD processors
I've been running compute intensive (multi-day), highly parallelizable Clojure processes on high-core-count machines and blithely assuming that since I saw near maximal CPU utilization in top and the like that I was probably getting good speedups. But a colleague recently did some tests and the results are really quite alarming. On intel machines we're seeing speedups but much less than I expected -- about a 2x speedup going from 1 to 8 cores. But on AMD processors we're seeing SLOWDOWNS, with the same tests taking almost twice as long on 8 cores as on 1. I'm baffled, and unhappy that my runs are probably going slower on 48-core and 64-core nodes than on single-core nodes. It's possible that I'm just doing something wrong in the way that I dispatch the tasks, or that I've missed some Clojure or JVM setting... but right now I'm mystified and would really appreciate some help. I'm aware that there's overhead for multicore distribution and that one can expect slowdowns if the computations that are being distributed are fast relative to the dispatch overhead, but this should not be the case here. We're distributing computations that take seconds or minutes, and not huge numbers of them (at least in our tests while trying to figure out what's going on). I'm also aware that the test that produced the data I give below, insofar as it uses pmap to do the distribution, may leave cores idle for a bit if some tasks take a lot longer than others, because of the way that pmap allocates cores to threads. But that also shouldn't be a big issue here because for this test all of the threads are doing the exact same computation. And I also tried using an agent-based dispatch approach that shouldn't have the pmap thread allocation issue, and the results were about the same. Note also that all of the computations in this test are purely functional and independent -- there shouldn't be any resource contention issues. The test: I wrote a time-consuming function that just does a bunch of math and list manipulation (which is what takes a lot of time in my real applications): (defn burn ([] (loop [i 0 value '()] (if (= i 1) (count (last (take 1 (iterate reverse value (recur (inc i) (cons (* (int i) (+ (float i) (- (int i) (/ (float i) (inc (int i)) value) ([_] (burn))) Then I have a main function like this: (defn -main [ args] (time (doall (pmap burn (range 8 (System/exit 0)) We run it with lein run (we've tried both leingingen 1.7.1 and 2.0.0-preview10) with Java 1.7.0_03 Java HotSpot(TM) 64-Bit Server VM. We also tried Java 1.6.0_22. We've tried various JVM memory options (via :jvm-opts with -Xmx and -Xms settings) and also with and without -XX:+UseParallelGC. None of this seems to change the picture substantially. The results that we get generally look like this: - On an Intel Core i7 3770K with 8 cores and 16GB of RAM, running the code above, it takes about 45 seconds (and all cores appear to be fully loaded as it does so). If we change the pmap to just plain map, so that we use only a single core, the time goes up to about 1 minute and 36 seconds. So the speedup for 8 cores is just about 2x, even though there are 8 completely independent tasks. So that's pretty depressing. - But much worse: on a 4 x Opteron 6272 with 48 cores and 32GB of RAM, running the same test (with pmap) takes about 4 minutes and 2 seconds. That's really slow! Changing the pmap to map here produces a runtime of about 2 minutes and 20 seconds. So it's quite a bit faster on one core than on 8! And all of these times are terrible compared to those on the intel. Another strange observation is that we can run multiple instances of the test on the same machine and (up to some limit, presumably) they don't seem to slow each other down, even though just one instance of the test appears to be maxing out all of the CPU according to top. I suppose that means that top isn't telling me what I thought -- my colleague says it can mean that something is blocked in some way with a full instruction queue. But I'm not interested in running multiple instances. I have single computations that involve multiple expensive but independent subcomputations, and I want to farm those subcomputations out to multiple cores -- and get speedups as a result. My subcomputations are so completely independent that I think I should be able to get speedups approaching a factor of n for n cores, but what I see is a factor of only about 2 on intel machines, and a bizarre factor of about 1/2 on AMD machines. Any help would be greatly appreciated! Thanks, -Lee -- Lee Spector, Professor of Computer Science Cognitive Science, Hampshire College 893 West Street, Amherst, MA 01002-3359
Re: abysmal multicore performance, especially on AMD processors
Lee: I'll just give a brief description right now, but one thing I've found in the past on a 2-core machine that was achieving much less than 2x speedup was memory bandwidth being the limiting factor. Not all Clojure code allocates memory, but a lot does. If the hardware in a system can write at rate X from a multicore processor to main memory, and a single-threaded Clojure program writes to memory at rate 0.5*X, then the most speedup you will ever get out of multicore execution of the same code on N cores will be 2x, no matter how large N is. As one way to see if this is the problem, you could try changing your burn function so that instead of doing cons to build up a list result, first allocate a Java mutable array before the loop that is as large as you need it to be at the end, and write values into that. You can convert it to some other Clojure type at the end of the loop if you prefer. I have some C benchmark programs that test memory read and write bandwidth on single and multiple cores you can run on your Intel machine to see if that might be the issue. If this is the issue, I would expect to see at least a little speedup from 1 core to multiple cores, but capped at some maximum speedup that is determined by the memory bandwidth, not the number of cores you run in parallel. I don't currently have any guess about what might be happening with the AMD multicore machine. If you are interested in wild guessing, perhaps there could be some kind of multicore cache coherency protocol that is badly configured, causing cache lines to be frequently invalidated when multiple cores are sharing memory? That would make more sense if multiple cores were reading from and writing to the same cache lines, which doesn't seem terribly likely for a typical Clojure program. Let me know if you are interested and I will find those C programs for you to try out. I got them from somewhere on the Internet and may have tweaked them a little bit. Andy On Dec 7, 2012, at 5:25 PM, Lee Spector wrote: I've been running compute intensive (multi-day), highly parallelizable Clojure processes on high-core-count machines and blithely assuming that since I saw near maximal CPU utilization in top and the like that I was probably getting good speedups. But a colleague recently did some tests and the results are really quite alarming. On intel machines we're seeing speedups but much less than I expected -- about a 2x speedup going from 1 to 8 cores. But on AMD processors we're seeing SLOWDOWNS, with the same tests taking almost twice as long on 8 cores as on 1. I'm baffled, and unhappy that my runs are probably going slower on 48-core and 64-core nodes than on single-core nodes. It's possible that I'm just doing something wrong in the way that I dispatch the tasks, or that I've missed some Clojure or JVM setting... but right now I'm mystified and would really appreciate some help. I'm aware that there's overhead for multicore distribution and that one can expect slowdowns if the computations that are being distributed are fast relative to the dispatch overhead, but this should not be the case here. We're distributing computations that take seconds or minutes, and not huge numbers of them (at least in our tests while trying to figure out what's going on). I'm also aware that the test that produced the data I give below, insofar as it uses pmap to do the distribution, may leave cores idle for a bit if some tasks take a lot longer than others, because of the way that pmap allocates cores to threads. But that also shouldn't be a big issue here because for this test all of the threads are doing the exact same computation. And I also tried using an agent-based dispatch approach that shouldn't have the pmap thread allocation issue, and the results were about the same. Note also that all of the computations in this test are purely functional and independent -- there shouldn't be any resource contention issues. The test: I wrote a time-consuming function that just does a bunch of math and list manipulation (which is what takes a lot of time in my real applications): (defn burn ([] (loop [i 0 value '()] (if (= i 1) (count (last (take 1 (iterate reverse value (recur (inc i) (cons (* (int i) (+ (float i) (- (int i) (/ (float i) (inc (int i)) value) ([_] (burn))) Then I have a main function like this: (defn -main [ args] (time (doall (pmap burn (range 8 (System/exit 0)) We run it with lein run (we've tried both leingingen 1.7.1 and 2.0.0-preview10) with Java 1.7.0_03 Java HotSpot(TM) 64-Bit Server VM. We also tried Java 1.6.0_22. We've tried various JVM memory options (via