Re: How does pmap partition its work?
David Liebke gave a talk at Clojure-Conj 2010 titled "From Concurrency to Parallelism" with detailed performance comparisons of map, pmap, and Fork/Join-style iteration. Look for it on clojure.blip.tv in the near future! -Stuart Sierra clojure.com -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: How does pmap partition its work?
On Thu, Jan 27, 2011 at 5:05 PM, Rasmus Svensson wrote: > 2011/1/27 Ken Wesson : >> On Thu, Jan 27, 2011 at 2:09 PM, Michael Gardner wrote: >>> On Jan 27, 2011, at 7:24 AM, Rasmus Svensson wrote: >>> If you simply want all tasks to be performed as quickly as possible, one alternative could be to use an ExecutorService (perhaps one created with newFixedThreadPool) with its invokeAll method. invokeAll takes a collection of callables (in clojure terms: you can pass it a seq of zero-arg functions) and returns a collection of futures. An ExecutorService could perhaps give you fine-grained control. I recently wrote a blog post on this; you might find it interesting: http://blog.raek.se/2011/01/24/executors-in-clojure/ >>> >>> Thanks for the tip. By coincidence, I just stumbled across ExecutorService >>> yesterday via the example at http://clojure.org/concurrent_programming. I'm >>> never thrilled about having to use Java APIs directly, but in this case an >>> ExecutorService does what I want much better than pmap, and isn't too >>> difficult to use. >> >> Perhaps pmap should be rewritten to use ExecutorService, if that is so. > > Actually, 'pmap' is built on top of 'future' which is built on top of > an ExecutorService. Then how come they have different performance characteristics? -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: How does pmap partition its work?
2011/1/27 Ken Wesson : > On Thu, Jan 27, 2011 at 2:09 PM, Michael Gardner wrote: >> On Jan 27, 2011, at 7:24 AM, Rasmus Svensson wrote: >> >>> If you simply want all tasks to be performed as quickly as possible, >>> one alternative could be to use an ExecutorService (perhaps one >>> created with newFixedThreadPool) with its invokeAll method. invokeAll >>> takes a collection of callables (in clojure terms: you can pass it a >>> seq of zero-arg functions) and returns a collection of futures. An >>> ExecutorService could perhaps give you fine-grained control. >>> >>> I recently wrote a blog post on this; you might find it interesting: >>> http://blog.raek.se/2011/01/24/executors-in-clojure/ >> >> Thanks for the tip. By coincidence, I just stumbled across ExecutorService >> yesterday via the example at http://clojure.org/concurrent_programming. I'm >> never thrilled about having to use Java APIs directly, but in this case an >> ExecutorService does what I want much better than pmap, and isn't too >> difficult to use. > > Perhaps pmap should be rewritten to use ExecutorService, if that is so. Actually, 'pmap' is built on top of 'future' which is built on top of an ExecutorService. // raek -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: How does pmap partition its work?
On Thu, Jan 27, 2011 at 2:09 PM, Michael Gardner wrote: > On Jan 27, 2011, at 7:24 AM, Rasmus Svensson wrote: > >> If you simply want all tasks to be performed as quickly as possible, >> one alternative could be to use an ExecutorService (perhaps one >> created with newFixedThreadPool) with its invokeAll method. invokeAll >> takes a collection of callables (in clojure terms: you can pass it a >> seq of zero-arg functions) and returns a collection of futures. An >> ExecutorService could perhaps give you fine-grained control. >> >> I recently wrote a blog post on this; you might find it interesting: >> http://blog.raek.se/2011/01/24/executors-in-clojure/ > > Thanks for the tip. By coincidence, I just stumbled across ExecutorService > yesterday via the example at http://clojure.org/concurrent_programming. I'm > never thrilled about having to use Java APIs directly, but in this case an > ExecutorService does what I want much better than pmap, and isn't too > difficult to use. Perhaps pmap should be rewritten to use ExecutorService, if that is so. -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: How does pmap partition its work?
On Jan 27, 2011, at 7:24 AM, Rasmus Svensson wrote: > If you simply want all tasks to be performed as quickly as possible, > one alternative could be to use an ExecutorService (perhaps one > created with newFixedThreadPool) with its invokeAll method. invokeAll > takes a collection of callables (in clojure terms: you can pass it a > seq of zero-arg functions) and returns a collection of futures. An > ExecutorService could perhaps give you fine-grained control. > > I recently wrote a blog post on this; you might find it interesting: > http://blog.raek.se/2011/01/24/executors-in-clojure/ Thanks for the tip. By coincidence, I just stumbled across ExecutorService yesterday via the example at http://clojure.org/concurrent_programming. I'm never thrilled about having to use Java APIs directly, but in this case an ExecutorService does what I want much better than pmap, and isn't too difficult to use. -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: How does pmap partition its work?
2011/1/24 Michael Gardner : > Suppose I have a sequence of tasks I'd like to parallelize using pmap. The > amount of CPU time these tasks require varies greatly; in particular, many of > them will require virtually no work. Can I rely on pmap to divide the work > efficiently even if there is some pattern to the distribution of easy tasks > (e.g. clumped at the beginning, or every nth)? Assume it's not possible to > identify the easy tasks beforehand. If you simply want all tasks to be performed as quickly as possible, one alternative could be to use an ExecutorService (perhaps one created with newFixedThreadPool) with its invokeAll method. invokeAll takes a collection of callables (in clojure terms: you can pass it a seq of zero-arg functions) and returns a collection of futures. An ExecutorService could perhaps give you fine-grained control. I recently wrote a blog post on this; you might find it interesting: http://blog.raek.se/2011/01/24/executors-in-clojure/ // raek -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: How does pmap partition its work?
On Jan 25, 2011, at 12:13 PM, Andy Fingerhut wrote: > In my original message describing pmap's behavior, there was a little > "gotcha" near the end: > > "Note: Sometimes working at odds with pmap's "Don't work too far ahead" > approach is if the input sequence to pmap is chunked. When a chunk is > reached, all elements in that chunk have threads start in parallel, so the > number of parallel threads can easily exceed (availableProcessors+2) during > those times." > > > (range n) produces heavily chunked sequences. Chunks are an optimization on > time and memory when representing large sequences, but they do cause pmap as > written today to suddenly fire off all futures in the chunk as parallel > threads when it reaches the chunk. Interesting! Sorry I missed this gotcha from your original email; I hadn't run across this problem at the time and thus the info didn't stick in my head. Thanks for the link to modified-pmap, too. This seems like undesirable behavior from pmap to me (and/or unwanted chunking-induced behavior[1]). At the least I would like to see pmap adopt an optional extra argument that works like your modified-pmap's num-threads argument, and possibly also the arrival of Clojure's promised de-chunking interface. [1] http://disclojure.org/documents/clojure-1-1-0-rc1-release-notes/ (section 2.3: "Please share any use cases where chunked processing has resulted in behavioral differences that matter to you on the Clojure google group.") -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: How does pmap partition its work?
In my original message describing pmap's behavior, there was a little "gotcha" near the end: "Note: Sometimes working at odds with pmap's "Don't work too far ahead" approach is if the input sequence to pmap is chunked. When a chunk is reached, all elements in that chunk have threads start in parallel, so the number of parallel threads can easily exceed (availableProcessors+2) during those times." (range n) produces heavily chunked sequences. Chunks are an optimization on time and memory when representing large sequences, but they do cause pmap as written today to suddenly fire off all futures in the chunk as parallel threads when it reaches the chunk. A modified version of pmap called (boringly) "modified-pmap" that does not behave this way when reaching a chunk in its input sequence is part of one of my submissions to the shootout web site, because I did not want the extra parallelism: http://shootout.alioth.debian.org/u32/program.php?test=knucleotide&lang=clojure&id=2 It is nearly identical to the original pmap. Comparing to the source of the original is the easiest way to see the changes, if you are curious. Andy On Jan 25, 2011, at 6:45 AM, Michael Gardner wrote: I have run across something else I don't understand about pmap. Why does the following: (pmap (fn [_] (clojure.java.shell/sh "sleep" "10")) (range 32)) result in all 32 "sleep" processes being run at once? I thought pmap used n+2 threads, where n is the number of processors/cores available (I have two cores). -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: How does pmap partition its work?
On Jan 25, 2011, at 9:33 AM, Ken Wesson wrote: > Well, that's weird, because the documentation *I* read says it > composits the arguments together into a command line and hands off to > Runtime/exec. And the documentation of *that* says it returns a > Process object and the process it launches runs asynchronously. That doesn't necessarily contradict what I observed. The way I read the docs[1] is that sh does indeed use Runtime.exec(), but then waits for the sub-process to complete and returns a map of its exit status, etc. That's not explicitly stated, but it's the only explanation consistent with what I observe. For comparison, what behavior do you observe with (do (clojure.java.shell/sh "sleep" "10") nil)? > If that were the case, though, your pmap should indeed only be running > cores+2 instances at once. Which is not what you observed. > > Something's hinky here. Indeed. According to a previous thread[2], something similar happens with Thread/sleep; my guess is that sh sleeps (or something similar) while it waits for the sub-process to finish and thus suffers from the same issue. [1] http://clojure.github.com/clojure/clojure.java.shell-api.html#clojure.java.shell/sh [2] http://groups.google.com/group/clojure/browse_thread/thread/9b3581d9f4b4afd6 -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: How does pmap partition its work?
On Tue, Jan 25, 2011 at 10:21 AM, Michael Gardner wrote: > On Jan 25, 2011, at 9:06 AM, Ken Wesson wrote: > >> sh is asynchronous. It calls Runtime/exec, which launches the sleep as >> a separate process and immediately returns a Process object (which >> your pmap should be returning a seq of). It may produce n+2 Process >> objects at a time but it produces them all before the first of the >> asynchronous sleep processes finishes, so for a while all of the >> latter are running at the same time. > > Really? The documentation for sh claims to return a map, which is what I > observe. Well, that's weird, because the documentation *I* read says it composits the arguments together into a command line and hands off to Runtime/exec. And the documentation of *that* says it returns a Process object and the process it launches runs asynchronously. > And the call to sh doesn't return until the sub-process has finished If that were the case, though, your pmap should indeed only be running cores+2 instances at once. Which is not what you observed. Something's hinky here. -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: How does pmap partition its work?
On Jan 25, 2011, at 9:06 AM, Ken Wesson wrote: > sh is asynchronous. It calls Runtime/exec, which launches the sleep as > a separate process and immediately returns a Process object (which > your pmap should be returning a seq of). It may produce n+2 Process > objects at a time but it produces them all before the first of the > asynchronous sleep processes finishes, so for a while all of the > latter are running at the same time. Really? The documentation for sh claims to return a map, which is what I observe. And the call to sh doesn't return until the sub-process has finished, even if I discard its return value (so it can't be the case that the resulting "map" is actually some lazy thing that blocks when I try to access its contents). -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: How does pmap partition its work?
On Tue, Jan 25, 2011 at 9:45 AM, Michael Gardner wrote: > I have run across something else I don't understand about pmap. Why does the > following: > > (pmap (fn [_] (clojure.java.shell/sh "sleep" "10")) (range 32)) > > result in all 32 "sleep" processes being run at once? I thought pmap used n+2 > threads, where n is the number of processors/cores available (I have two > cores). sh is asynchronous. It calls Runtime/exec, which launches the sleep as a separate process and immediately returns a Process object (which your pmap should be returning a seq of). It may produce n+2 Process objects at a time but it produces them all before the first of the asynchronous sleep processes finishes, so for a while all of the latter are running at the same time. -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: How does pmap partition its work?
I have run across something else I don't understand about pmap. Why does the following: (pmap (fn [_] (clojure.java.shell/sh "sleep" "10")) (range 32)) result in all 32 "sleep" processes being run at once? I thought pmap used n+2 threads, where n is the number of processors/cores available (I have two cores). -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: How does pmap partition its work?
On Mon, Jan 24, 2011 at 11:05 AM, Armando Blancas wrote: >> This is much faster than either of the other eager-pmaps I posted to >> this thread, and yet it's only using 60% CPU on my dual-core box. That >> means there's still another x1.6 or so speedup possible(!) but I'm not >> sure how. >> > > Could -server make a difference here? I did some more investigating of this. It seems that pmap itself only uses ~70% CPU on my machine. The same happens with eager-pmap and with the jumbled-pmap I just posted. All produce results comparably fast when given a giant math problem to chug on. Probably it's an artifact of the test cases mainly using jobs that either are fast or use Thread/sleep to be artificially slow. I'm not sure why this would persist when the unit of parallelization is a map of the job over a sizable partition or the job has a busy-loop in it instead of a Thread/sleep to make it heavyweight, though. -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: How does pmap partition its work?
On Mon, Jan 24, 2011 at 10:29 AM, Michael Gardner wrote: > Thanks for the work, Ken. You're welcome. > Clojure's multi-threaded performance can be mysterious indeed I think that may be generally true of multi-processing of every kind. :) > Anyway, for now I will probably just stick with pmap + shuffle, since it's > dead-simple > and should be enough my purposes. Probably a good idea, if keeping the input in order is not important. If you do need to keep the ordering, you can still do the items in a somewhat perturbed order, if you're willing to give up some laziness: (defn permute "Applies a permutation to a vector. Permutation should be a vector of same length with (range n) in some shuffled order. If v is shorter than perm, just returns v." [perm v] (if (< (count v) (count perm)) v (vec (map #(nth v %) perm (defn i-perm "Inverts a permutation." [perm] (let [r (range (count perm))] (persistent! (reduce (fn [out [pi ix]] (assoc! out pi ix)) (transient (vec r)) (map vector perm r) (defn shuffle "Shuffles a vector" [v] (let [s (count v)] (loop [out (transient v) n (dec s)] (if (zero? n) (persistent! out) (let [i (rand-int (inc n)) t (out n)] (recur (assoc! (assoc! out n (out i)) i t) (dec n))) Test that shuffle works: user=> (frequencies (map shuffle (take 500 (repeat [1 2 3] {[3 1 2] 88, [2 3 1] 91, [1 2 3] 83, [2 1 3] 84, [3 2 1] 82, [1 3 2] 72} All six possible permutations occur, at reasonably similar rates. (defn jumbleize "Returns a pair of functions one of which will jumble and one of which will unjumble a seq; the seq is chunked into parts of size n and those parts are jumbled. The functions are lazy in that only one chunk is realized at a time." [n] (let [rn (vec (range n)) perms (repeatedly #(shuffle rn)) ips (map i-perm perms) jumble (fn [s perm-seq] (mapcat permute perm-seq (map vec (partition n n [] s] [#(jumble % perms) #(jumble % ips)])) (defn jumbled-pmap "Like pmap, only the order in which it sees sequence elements is jumbled somewhat, which may improve efficiency if easy and difficult jobs have a regular arrangement in the input. Chunks the input by chunk-size and jumbles each chunk internally before calling pmap; puts the output back in order." [chunk-size f & colls] (let [[jumbler unjumbler] (jumbleize chunk-size)] (unjumbler (apply pmap f (map jumbler colls) user=> (drop 97 (jumbled-pmap 16 #(* % %) (range 100))) (9409 9604 9801) user=> (take 3 (drop 50 (jumbled-pmap 16 #(* % %) (range 100 (2500 2601 2704) user=> (take 3 (drop 50 (jumbled-pmap 16 #(+ (* 1000 %1) %2) (range 100) (range 100 180 (50150 51151 52152) That last, BTW, shows it working properly even with multiple input colls of different lengths. user=> (take 3 (drop 50 (jumbled-pmap 16 #(+ (* 1000 %1) %2) (iterate inc 0) (drop 100 (iterate inc 0) (50150 51151 52152) And this shows that it's still lazy enough not to blow up on infinite inputs. :) The internal sequences of random permutations and their inverses, as well as the input sequences, lose their heads: user=> (take 3 (drop 5000 (jumbled-pmap 16 #(+ (* 1000 %1) %2) (iterate inc 0) (drop 100 (iterate inc 0) (5005100 50050001101 50050002102) Memory use stayed constant at a bit more than 64MB, which was probably the -Xms parameter to the JVM instance Enclojure ran its REPL in. -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: How does pmap partition its work?
> This is much faster than either of the other eager-pmaps I posted to > this thread, and yet it's only using 60% CPU on my dual-core box. That > means there's still another x1.6 or so speedup possible(!) but I'm not > sure how. > Could -server make a difference here? -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: How does pmap partition its work?
On Jan 23, 2011, at 10:56 PM, Ken Wesson wrote: > I've managed to make this more efficient, by making each agent send > process a larger portion of the job than one single item: > > (defn eager-pmap [f & colls] > (let [cores (.. Runtime getRuntime availableProcessors) >agents (cycle (for [_ (range cores)] (agent nil))) >ccolls (map (partial partition 16 16 []) colls) >promises (apply map (fn [& _] (promise)) ccolls)] >(doall > (apply map (fn [a p & chunks] > (send a > (fn [_] > (deliver p > (apply map f chunks) >agents promises ccolls)) >(mapcat deref promises))) > > This is much faster than either of the other eager-pmaps I posted to > this thread, and yet it's only using 60% CPU on my dual-core box. That > means there's still another x1.6 or so speedup possible(!) but I'm not > sure how. > > Raising the size of the partitions from 16 all the way to 1024 makes > no noticeable difference. Thanks for the work, Ken. Clojure's multi-threaded performance can be mysterious indeed, which is a shame when one of the major benefits of functional code is that it's easily parallelizable. Anyway, for now I will probably just stick with pmap + shuffle, since it's dead-simple and should be enough my purposes. -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: How does pmap partition its work?
On Sun, Jan 23, 2011 at 11:34 PM, Ken Wesson wrote: > Other posts to the thread indicate that longer-range patterns in the > inputs could cause problems. If you know you'll be consuming the full > sequence, try this: > > (defn eager-pmap [f & colls] > (map deref (doall (apply map #(future (f %)) colls > > This creates all of the futures right away (due to the doall) and > leaves it up to the future implementation to distribute work around > some thread pool. On my system, it creates 80 or so of them each time > it's run on a seq of 10 things, which persist for some time > afterward -- so, a bit of a problem there. Another one that seems like > it ought to be more efficient: > > (defn eager-pmap [f & colls] > (let [cores (.. Runtime getRuntime availableProcessors) > agents (cycle (for [_ (range cores)] (agent nil))) > promises (apply map (fn [& _] (promise)) colls)] > (doall > (apply map (fn [a p & args] > (send a > (fn [_] (deliver p (apply f args) > agents promises colls)) > (map deref promises))) > > This one uses the agent "send" thread pool to divide up the work. > Oddly, it's actually 2-3x slower than the previous and only uses 75% > CPU on a dual-core machine, though it doesn't leak threads. I've managed to make this more efficient, by making each agent send process a larger portion of the job than one single item: (defn eager-pmap [f & colls] (let [cores (.. Runtime getRuntime availableProcessors) agents (cycle (for [_ (range cores)] (agent nil))) ccolls (map (partial partition 16 16 []) colls) promises (apply map (fn [& _] (promise)) ccolls)] (doall (apply map (fn [a p & chunks] (send a (fn [_] (deliver p (apply map f chunks) agents promises ccolls)) (mapcat deref promises))) This is much faster than either of the other eager-pmaps I posted to this thread, and yet it's only using 60% CPU on my dual-core box. That means there's still another x1.6 or so speedup possible(!) but I'm not sure how. Raising the size of the partitions from 16 all the way to 1024 makes no noticeable difference. -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: How does pmap partition its work?
Other posts to the thread indicate that longer-range patterns in the inputs could cause problems. If you know you'll be consuming the full sequence, try this: (defn eager-pmap [f & colls] (map deref (doall (apply map #(future (f %)) colls This creates all of the futures right away (due to the doall) and leaves it up to the future implementation to distribute work around some thread pool. On my system, it creates 80 or so of them each time it's run on a seq of 10 things, which persist for some time afterward -- so, a bit of a problem there. Another one that seems like it ought to be more efficient: (defn eager-pmap [f & colls] (let [cores (.. Runtime getRuntime availableProcessors) agents (cycle (for [_ (range cores)] (agent nil))) promises (apply map (fn [& _] (promise)) colls)] (doall (apply map (fn [a p & args] (send a (fn [_] (deliver p (apply f args) agents promises colls)) (map deref promises))) This one uses the agent "send" thread pool to divide up the work. Oddly, it's actually 2-3x slower than the previous and only uses 75% CPU on a dual-core machine, though it doesn't leak threads. It looks like agents, or promise/deliver, or both have lots of overhead compared to futures. (Which is odd since the obvious implementation of future is (defmacro future [& body] `(let [p# (promise)] (.start (Thread. (fn [] (deliver p# (do ~@body) p#)). So maybe it's agents that are causing the inefficiency? On the other hand, the result of evaluating (future foo) doesn't appear to be a naked promise, though it could still be a wrapped one, and that implementation would not limit itself to 80-odd threads; maybe future instead uses the agent "send-off" pool?!) I'd recommend just using the first version of eager-pmap in this email if you need an eager-pmap. The extra threads it creates are not too numerous (it doesn't create 100,000 threads for an input seq of 100,000 items) and they *do* seem to disappear again *eventually* (it may take several minutes; it takes much less than 1 hour; it takes more than 30 seconds; can't be more precise with the data I've gathered thus far). -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: How does pmap partition its work?
On Sun, Jan 23, 2011 at 10:31 PM, Ken Wesson wrote: > On Sun, Jan 23, 2011 at 8:56 PM, Michael Gardner wrote: >> Suppose I have a sequence of tasks I'd like to parallelize using pmap. The >> amount of CPU time these tasks require varies greatly; in particular, many >> of them will require virtually no work. Can I rely on pmap to divide the >> work efficiently even if there is some pattern to the distribution of easy >> tasks (e.g. clumped at the beginning, or every nth)? Assume it's not >> possible to identify the easy tasks beforehand. > > Here's a little data: > user=> (def printer (agent nil)) > #'user/printer > user=> (defn xyz [n] (send printer (fn [_] println > (Thread/currentThread))) (* n n)) Eh, that's weird. That's not the definition I used. I used (defn xyz [n] (let [t (Thread/currentThread)] (send printer (fn [_] (println t (* n n)) of course so it would be the pmap thread, not the agent thread, in the printout. I had been experimenting with investigating the agent thread pools too, so the defn must have come from that part of the REPL session. This data: > # > # > # > # > # > # > # > # > # > # > # > # > # > # > # > # > # > # > # > # was generated using the latter defn (the one that captures Thread/currentThread outside the send). -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: How does pmap partition its work?
On Sun, Jan 23, 2011 at 8:56 PM, Michael Gardner wrote: > Suppose I have a sequence of tasks I'd like to parallelize using pmap. The > amount of CPU time these tasks require varies greatly; in particular, many of > them will require virtually no work. Can I rely on pmap to divide the work > efficiently even if there is some pattern to the distribution of easy tasks > (e.g. clumped at the beginning, or every nth)? Assume it's not possible to > identify the easy tasks beforehand. Here's a little data: user=> (def printer (agent nil)) #'user/printer user=> (defn xyz [n] (send printer (fn [_] println (Thread/currentThread))) (* n n)) #'user/xyz user=> (xyz 3) # 9 user=> (pmap xyz (range 20)) # # # # # # # # # # # # # # # # # # # # (0 1 4 9 16 25 36 49 64 81 100 121 144 169 196 225 256 289 324 361) user=> The fiddling with an agent is to prevent individual println outputs from getting intermingled; instead the lines are perhaps slightly out of order but at least are not mangled into unreadability. As you can see, there are at least seven threads called somewhat randomly; not a simple alternating pattern of 2 as you might naively expect on a dual-core machine. It seems to prefer thread 22 for some reason. I don't know exactly what is going on under the hood but it looks like it either involves some degree of randomness, or it tries to load-balance (and something else was sometimes preempting threads -- I don't know what, when I ran this nothing on my system was doing anything CPU-intensive with a priority equal to or higher than the REPL process, but NetBeans and the REPL were both being a bit sluggish; some sort of Windows weirdness). Either way it's likely that a mix of easy and hard jobs, even with a pattern, won't screw things up too much. That can be further tested: user=> (defn heavy [f amt] (fn [x] (Thread/sleep amt) (f x))) #'user/heavy user=> (defn s [x] (* x x)) #'user/x user=> (def l (heavy s 1000)) #'user/y user=> (defn run [fseq n] (time (doall (pmap #(%1 %2) fseq (range n nil) #'user/run user=> (run (cycle [s l]) 100) "Elapsed time: 17141.64872 msecs" nil user=> (run (cycle [s s l l]) 100) "Elapsed time: 17020.59136 msecs" nil user=> (run (cycle [s s l s l l]) 100) "Elapsed time: 17007.0366 msecs" nil user=> (defn randomize [s] (let [n (count s)] (repeatedly #(nth s (rand-int n) #'user/randomize user=> (run (randomize [s l]) 100) "Elapsed time: 18015.8562 msecs" nil As you can see, the timings are fairly consistent when a 50/50 mix of short (<1msec) and long (>1000msec) tasks get parallelized using pmap, with several possible patterns and also with no pattern to their distribution. This was again on a dual-core machine; it's interesting that the numbers are around 17-18s instead of 25s (100 jobs, averaging 500ms each, split between two cores). The use of more than two threads in the thread pool is presumably the cause of this; Thread/sleep lets another of the threads make progress. It didn't collapse down to just 1s so it's also not the case that pmap is spawning as many threads as will maximize throughput (fifty running the slow jobs plus two running the fast ones on both cores while the slow ones sleep). Since the jobs would cumulatively take 50s if run serially, the speedup factor from pmap here is just under 3. One more data point: user=> (run (repeat l) 50) "Elapsed time: 10172.37124 msecs" nil That's fifty long jobs without any short ones. The speedup factor here is 5 rather than 3. This shows that it will allow five, but no more, jobs to be running concurrently, and that it doesn't load balance. If it did, adding the short jobs wouldn't slow it down more than a few ms; in fact doing so almost doubles the time, so it seems the jobs are preallocated in some manner and the short jobs take up half the available threads in the pool. (This is odd given that it seemed before to be using at least seven threads, not just five.) Of course, real jobs are likely to be CPU-intensive instead of yielding like Thread/sleep. user=> (def foo (atom nil)) #'user/foo user=> (defn l2 [x] (doseq [a (range 750)] (reset! foo (* a x))) (* x x)) #'user/l2 The number 750 causes l2 to take about 900ms on my machine. user=> (run (cycle [s l2]) 100) "Elapsed time: 37024.37508 msecs" nil Oddly, quite a bit less than 50s. Perhaps reset! spends some time yielding the CPU for some reason? user=> (run (cycle [s s l2 l2]) 100) "Elapsed time: 46963.28936 msecs" nil user=> (run (cycle [s s l2 s l2 l2]) 100) "Elapsed time: 36551.62528 msecs" nil user=> (run (randomize [s l2]) 100) "Elapsed time: 51713.50044 msecs" nil user=> (run (repeat l2) 50) "Elapsed time: 43197.77312 msecs" nil These numbers suggest that the CPU load will generally be divvied up equally among your cores, if the jobs don't yield the CPU (sleep, block on I/O, block on monitors). If they do yield the CPU here and there, it looks like pmap may perform slightly better, rather than slightly worse, if the jobs that do so form
Re: How does pmap partition its work?
On Jan 23, 2011, at 8:34 PM, Andy Fingerhut wrote: > No, you cannot rely on pmap to do that. Thanks for the detailed and informative answer. I'll probably settle for shuffling the list of inputs before passing them to pmap, and just accept that there may be some inefficiency in the distribution of work. -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: How does pmap partition its work?
No, you cannot rely on pmap to do that. pmap is lazy in the sequence it produces, so it tries not to work farther ahead of the consumer of its "output sequence" than the amount of parallelism it uses, which is the number of available processors plus 2. Suppose you have 4 available processors, so pmap tries to keep at most 6 parallel invocations of your map function running. Suppose also for example's sake that the function you are using with pmap on each element is sequential, so it uses at most one processor at a time. Imagine that the numbers below are the times in seconds required to calculate your function on each element of the input sequence to pmap: 100 1 1 1 1 (15 more elements requiring 1 second each) 100 1 1 1 1 (15 more elements requiring 1 second each) When some code tries to retrieve the first element of the lazy seq that is the output of pmap, 6 threads will start calculating. The threads for calculating the function on the 2nd through 6th elements will finish soon (if scheduling is fair), but no threads will be invoked to calculate the function on the 7th element yet, because pmap is trying not to work too far ahead. When the function is done calculating the first element, and some consumer tries to access the 2nd element of the output sequence, then a thread will be started working on the 7th input element, and so on. So if the consumer of the output sequence is trying to retrieve elements as quickly as possible and doing some negligible amount of processing on each one, here will be the rough pattern of CPU busy time: (1) 1.5 seconds of 4 processors working for 1 second each on the first 6 elements. (2) about 99 seconds more time finishing the function on the first element, but only 1 processor busy at a time, the other 3 idle (3) about 14/4 seconds of all 4 processors busy working on the next 14 elements, each taking about 1 sec of CPU time. Then this pattern repeats when the next "heavy element" requiring 100 seconds to calculate the function is reached. Note: Sometimes working at odds with pmap's "Don't work too far ahead" approach is if the input sequence to pmap is chunked. When a chunk is reached, all elements in that chunk have threads start in parallel, so the number of parallel threads can easily exceed (availableProcessors +2) during those times. Amit Rathore's medusa is intended to let you keep processors busy, but with a different method of invoking it than pmap. Andy On Jan 23, 2011, at 5:56 PM, Michael Gardner wrote: Suppose I have a sequence of tasks I'd like to parallelize using pmap. The amount of CPU time these tasks require varies greatly; in particular, many of them will require virtually no work. Can I rely on pmap to divide the work efficiently even if there is some pattern to the distribution of easy tasks (e.g. clumped at the beginning, or every nth)? Assume it's not possible to identify the easy tasks beforehand. -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en