Re: How does pmap partition its work?

2011-01-28 Thread Stuart Sierra
David Liebke gave a talk at Clojure-Conj 2010 titled "From Concurrency to 
Parallelism" with detailed performance comparisons of map, pmap, and 
Fork/Join-style iteration.  Look for it on clojure.blip.tv in the near 
future!

-Stuart Sierra
clojure.com

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: How does pmap partition its work?

2011-01-27 Thread Ken Wesson
On Thu, Jan 27, 2011 at 5:05 PM, Rasmus Svensson  wrote:
> 2011/1/27 Ken Wesson :
>> On Thu, Jan 27, 2011 at 2:09 PM, Michael Gardner  wrote:
>>> On Jan 27, 2011, at 7:24 AM, Rasmus Svensson wrote:
>>>
 If you simply want all tasks to be performed as quickly as possible,
 one alternative could be to use an ExecutorService (perhaps one
 created with newFixedThreadPool) with its invokeAll method. invokeAll
 takes a collection of callables (in clojure terms: you can pass it a
 seq of zero-arg functions) and returns a collection of futures. An
 ExecutorService could perhaps give you fine-grained control.

 I recently wrote a blog post on this; you might find it interesting:
 http://blog.raek.se/2011/01/24/executors-in-clojure/
>>>
>>> Thanks for the tip. By coincidence, I just stumbled across ExecutorService 
>>> yesterday via the example at http://clojure.org/concurrent_programming. I'm 
>>> never thrilled about having to use Java APIs directly, but in this case an 
>>> ExecutorService does what I want much better than pmap, and isn't too 
>>> difficult to use.
>>
>> Perhaps pmap should be rewritten to use ExecutorService, if that is so.
>
> Actually, 'pmap' is built on top of 'future' which is built on top of
> an ExecutorService.

Then how come they have different performance characteristics?

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: How does pmap partition its work?

2011-01-27 Thread Rasmus Svensson
2011/1/27 Ken Wesson :
> On Thu, Jan 27, 2011 at 2:09 PM, Michael Gardner  wrote:
>> On Jan 27, 2011, at 7:24 AM, Rasmus Svensson wrote:
>>
>>> If you simply want all tasks to be performed as quickly as possible,
>>> one alternative could be to use an ExecutorService (perhaps one
>>> created with newFixedThreadPool) with its invokeAll method. invokeAll
>>> takes a collection of callables (in clojure terms: you can pass it a
>>> seq of zero-arg functions) and returns a collection of futures. An
>>> ExecutorService could perhaps give you fine-grained control.
>>>
>>> I recently wrote a blog post on this; you might find it interesting:
>>> http://blog.raek.se/2011/01/24/executors-in-clojure/
>>
>> Thanks for the tip. By coincidence, I just stumbled across ExecutorService 
>> yesterday via the example at http://clojure.org/concurrent_programming. I'm 
>> never thrilled about having to use Java APIs directly, but in this case an 
>> ExecutorService does what I want much better than pmap, and isn't too 
>> difficult to use.
>
> Perhaps pmap should be rewritten to use ExecutorService, if that is so.

Actually, 'pmap' is built on top of 'future' which is built on top of
an ExecutorService.

// raek

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: How does pmap partition its work?

2011-01-27 Thread Ken Wesson
On Thu, Jan 27, 2011 at 2:09 PM, Michael Gardner  wrote:
> On Jan 27, 2011, at 7:24 AM, Rasmus Svensson wrote:
>
>> If you simply want all tasks to be performed as quickly as possible,
>> one alternative could be to use an ExecutorService (perhaps one
>> created with newFixedThreadPool) with its invokeAll method. invokeAll
>> takes a collection of callables (in clojure terms: you can pass it a
>> seq of zero-arg functions) and returns a collection of futures. An
>> ExecutorService could perhaps give you fine-grained control.
>>
>> I recently wrote a blog post on this; you might find it interesting:
>> http://blog.raek.se/2011/01/24/executors-in-clojure/
>
> Thanks for the tip. By coincidence, I just stumbled across ExecutorService 
> yesterday via the example at http://clojure.org/concurrent_programming. I'm 
> never thrilled about having to use Java APIs directly, but in this case an 
> ExecutorService does what I want much better than pmap, and isn't too 
> difficult to use.

Perhaps pmap should be rewritten to use ExecutorService, if that is so.

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: How does pmap partition its work?

2011-01-27 Thread Michael Gardner
On Jan 27, 2011, at 7:24 AM, Rasmus Svensson wrote:

> If you simply want all tasks to be performed as quickly as possible,
> one alternative could be to use an ExecutorService (perhaps one
> created with newFixedThreadPool) with its invokeAll method. invokeAll
> takes a collection of callables (in clojure terms: you can pass it a
> seq of zero-arg functions) and returns a collection of futures. An
> ExecutorService could perhaps give you fine-grained control.
> 
> I recently wrote a blog post on this; you might find it interesting:
> http://blog.raek.se/2011/01/24/executors-in-clojure/

Thanks for the tip. By coincidence, I just stumbled across ExecutorService 
yesterday via the example at http://clojure.org/concurrent_programming. I'm 
never thrilled about having to use Java APIs directly, but in this case an 
ExecutorService does what I want much better than pmap, and isn't too difficult 
to use.

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: How does pmap partition its work?

2011-01-27 Thread Rasmus Svensson
2011/1/24 Michael Gardner :
> Suppose I have a sequence of tasks I'd like to parallelize using pmap. The 
> amount of CPU time these tasks require varies greatly; in particular, many of 
> them will require virtually no work. Can I rely on pmap to divide the work 
> efficiently even if there is some pattern to the distribution of easy tasks 
> (e.g. clumped at the beginning, or every nth)? Assume it's not possible to 
> identify the easy tasks beforehand.

If you simply want all tasks to be performed as quickly as possible,
one alternative could be to use an ExecutorService (perhaps one
created with newFixedThreadPool) with its invokeAll method. invokeAll
takes a collection of callables (in clojure terms: you can pass it a
seq of zero-arg functions) and returns a collection of futures. An
ExecutorService could perhaps give you fine-grained control.

I recently wrote a blog post on this; you might find it interesting:
http://blog.raek.se/2011/01/24/executors-in-clojure/

// raek

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: How does pmap partition its work?

2011-01-25 Thread Michael Gardner
On Jan 25, 2011, at 12:13 PM, Andy Fingerhut wrote:

> In my original message describing pmap's behavior, there was a little 
> "gotcha" near the end:
> 
> "Note: Sometimes working at odds with pmap's "Don't work too far ahead" 
> approach is if the input sequence to pmap is chunked.  When a chunk is 
> reached, all elements in that chunk have threads start in parallel, so the 
> number of parallel threads can easily exceed (availableProcessors+2) during 
> those times."
> 
> 
> (range n) produces heavily chunked sequences.  Chunks are an optimization on 
> time and memory when representing large sequences, but they do cause pmap as 
> written today to suddenly fire off all futures in the chunk as parallel 
> threads when it reaches the chunk.

Interesting! Sorry I missed this gotcha from your original email; I hadn't run 
across this problem at the time and thus the info didn't stick in my head. 
Thanks for the link to modified-pmap, too.

This seems like undesirable behavior from pmap to me (and/or unwanted 
chunking-induced behavior[1]). At the least I would like to see pmap adopt an 
optional extra argument that works like your modified-pmap's num-threads 
argument, and possibly also the arrival of Clojure's promised de-chunking 
interface.

[1] http://disclojure.org/documents/clojure-1-1-0-rc1-release-notes/ (section 
2.3: "Please share any use cases where chunked processing has resulted in 
behavioral differences that matter to you on the Clojure google group.")

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: How does pmap partition its work?

2011-01-25 Thread Andy Fingerhut
In my original message describing pmap's behavior, there was a little  
"gotcha" near the end:


"Note: Sometimes working at odds with pmap's "Don't work too far  
ahead" approach is if the input sequence to pmap is chunked.  When a  
chunk is reached, all elements in that chunk have threads start in  
parallel, so the number of parallel threads can easily exceed  
(availableProcessors+2) during those times."



(range n) produces heavily chunked sequences.  Chunks are an  
optimization on time and memory when representing large sequences, but  
they do cause pmap as written today to suddenly fire off all futures  
in the chunk as parallel threads when it reaches the chunk.


A modified version of pmap called (boringly) "modified-pmap" that does  
not behave this way when reaching a chunk in its input sequence is  
part of one of my submissions to the shootout web site, because I did  
not want the extra parallelism:


http://shootout.alioth.debian.org/u32/program.php?test=knucleotide&lang=clojure&id=2

It is nearly identical to the original pmap.  Comparing to the source  
of the original is the easiest way to see the changes, if you are  
curious.


Andy


On Jan 25, 2011, at 6:45 AM, Michael Gardner wrote:

I have run across something else I don't understand about pmap. Why  
does the following:


(pmap (fn [_] (clojure.java.shell/sh "sleep" "10")) (range 32))

result in all 32 "sleep" processes being run at once? I thought pmap  
used n+2 threads, where n is the number of processors/cores  
available (I have two cores).


--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient  
with your first post.

To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: How does pmap partition its work?

2011-01-25 Thread Michael Gardner
On Jan 25, 2011, at 9:33 AM, Ken Wesson wrote:

> Well, that's weird, because the documentation *I* read says it
> composits the arguments together into a command line and hands off to
> Runtime/exec. And the documentation of *that* says it returns a
> Process object and the process it launches runs asynchronously.

That doesn't necessarily contradict what I observed. The way I read the docs[1] 
is that sh does indeed use Runtime.exec(), but then waits for the sub-process 
to complete and returns a map of its exit status, etc. That's not explicitly 
stated, but it's the only explanation consistent with what I observe. For 
comparison, what behavior do you observe with (do (clojure.java.shell/sh 
"sleep" "10") nil)?

> If that were the case, though, your pmap should indeed only be running
> cores+2 instances at once. Which is not what you observed.
> 
> Something's hinky here.

Indeed. According to a previous thread[2], something similar happens with 
Thread/sleep; my guess is that sh sleeps (or something similar) while it waits 
for the sub-process to finish and thus suffers from the same issue.

[1] 
http://clojure.github.com/clojure/clojure.java.shell-api.html#clojure.java.shell/sh
[2] http://groups.google.com/group/clojure/browse_thread/thread/9b3581d9f4b4afd6

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: How does pmap partition its work?

2011-01-25 Thread Ken Wesson
On Tue, Jan 25, 2011 at 10:21 AM, Michael Gardner  wrote:
> On Jan 25, 2011, at 9:06 AM, Ken Wesson wrote:
>
>> sh is asynchronous. It calls Runtime/exec, which launches the sleep as
>> a separate process and immediately returns a Process object (which
>> your pmap should be returning a seq of). It may produce n+2 Process
>> objects at a time but it produces them all before the first of the
>> asynchronous sleep processes finishes, so for a while all of the
>> latter are running at the same time.
>
> Really? The documentation for sh claims to return a map, which is what I 
> observe.

Well, that's weird, because the documentation *I* read says it
composits the arguments together into a command line and hands off to
Runtime/exec. And the documentation of *that* says it returns a
Process object and the process it launches runs asynchronously.

> And the call to sh doesn't return until the sub-process has finished

If that were the case, though, your pmap should indeed only be running
cores+2 instances at once. Which is not what you observed.

Something's hinky here.

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: How does pmap partition its work?

2011-01-25 Thread Michael Gardner
On Jan 25, 2011, at 9:06 AM, Ken Wesson wrote:

> sh is asynchronous. It calls Runtime/exec, which launches the sleep as
> a separate process and immediately returns a Process object (which
> your pmap should be returning a seq of). It may produce n+2 Process
> objects at a time but it produces them all before the first of the
> asynchronous sleep processes finishes, so for a while all of the
> latter are running at the same time.

Really? The documentation for sh claims to return a map, which is what I 
observe. And the call to sh doesn't return until the sub-process has finished, 
even if I discard its return value (so it can't be the case that the resulting 
"map" is actually some lazy thing that blocks when I try to access its 
contents).

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: How does pmap partition its work?

2011-01-25 Thread Ken Wesson
On Tue, Jan 25, 2011 at 9:45 AM, Michael Gardner  wrote:
> I have run across something else I don't understand about pmap. Why does the 
> following:
>
> (pmap (fn [_] (clojure.java.shell/sh "sleep" "10")) (range 32))
>
> result in all 32 "sleep" processes being run at once? I thought pmap used n+2 
> threads, where n is the number of processors/cores available (I have two 
> cores).

sh is asynchronous. It calls Runtime/exec, which launches the sleep as
a separate process and immediately returns a Process object (which
your pmap should be returning a seq of). It may produce n+2 Process
objects at a time but it produces them all before the first of the
asynchronous sleep processes finishes, so for a while all of the
latter are running at the same time.

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: How does pmap partition its work?

2011-01-25 Thread Michael Gardner
I have run across something else I don't understand about pmap. Why does the 
following:

(pmap (fn [_] (clojure.java.shell/sh "sleep" "10")) (range 32))

result in all 32 "sleep" processes being run at once? I thought pmap used n+2 
threads, where n is the number of processors/cores available (I have two cores).

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: How does pmap partition its work?

2011-01-24 Thread Ken Wesson
On Mon, Jan 24, 2011 at 11:05 AM, Armando Blancas
 wrote:
>> This is much faster than either of the other eager-pmaps I posted to
>> this thread, and yet it's only using 60% CPU on my dual-core box. That
>> means there's still another x1.6 or so speedup possible(!) but I'm not
>> sure how.
>>
>
> Could -server make a difference here?

I did some more investigating of this. It seems that pmap itself only
uses ~70% CPU on my machine. The same happens with eager-pmap and with
the jumbled-pmap I just posted. All produce results comparably fast
when given a giant math problem to chug on.

Probably it's an artifact of the test cases mainly using jobs that
either are fast or use Thread/sleep to be artificially slow. I'm not
sure why this would persist when the unit of parallelization is a map
of the job over a sizable partition or the job has a busy-loop in it
instead of a Thread/sleep to make it heavyweight, though.

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: How does pmap partition its work?

2011-01-24 Thread Ken Wesson
On Mon, Jan 24, 2011 at 10:29 AM, Michael Gardner  wrote:
> Thanks for the work, Ken.

You're welcome.

> Clojure's multi-threaded performance can be mysterious indeed

I think that may be generally true of multi-processing of every kind. :)

> Anyway, for now I will probably just stick with pmap + shuffle, since it's 
> dead-simple
> and should be enough my purposes.

Probably a good idea, if keeping the input in order is not important.
If you do need to keep the ordering, you can still do the items in a
somewhat perturbed order, if you're willing to give up some laziness:

(defn permute
  "Applies a permutation to a vector. Permutation should be a vector
   of same length with (range n) in some shuffled order. If v is shorter
   than perm, just returns v."
  [perm v]
  (if (< (count v) (count perm))
v
(vec (map #(nth v %) perm

(defn i-perm
  "Inverts a permutation."
  [perm]
  (let [r (range (count perm))]
(persistent!
  (reduce
(fn [out [pi ix]] (assoc! out pi ix))
(transient (vec r))
(map vector perm r)

(defn shuffle
  "Shuffles a vector"
  [v]
  (let [s (count v)]
(loop [out (transient v) n (dec s)]
  (if (zero? n)
(persistent! out)
(let [i (rand-int (inc n))
  t (out n)]
  (recur
(assoc!
  (assoc! out n (out i))
  i t)
(dec n)))

Test that shuffle works:

user=> (frequencies (map shuffle (take 500 (repeat [1 2 3]
{[3 1 2] 88,
 [2 3 1] 91,
 [1 2 3] 83,
 [2 1 3] 84,
 [3 2 1] 82,
 [1 3 2] 72}

All six possible permutations occur, at reasonably similar rates.

(defn jumbleize
  "Returns a pair of functions one of which will jumble and one of
   which will unjumble a seq; the seq is chunked into parts of
   size n and those parts are jumbled. The functions are lazy in
   that only one chunk is realized at a time."
  [n]
  (let [rn (vec (range n))
perms (repeatedly #(shuffle rn))
ips (map i-perm perms)
jumble (fn [s perm-seq]
 (mapcat permute
   perm-seq
   (map vec (partition n n [] s]
[#(jumble % perms) #(jumble % ips)]))

(defn jumbled-pmap
  "Like pmap, only the order in which it sees sequence elements is jumbled
   somewhat, which may improve efficiency if easy and difficult jobs have a
   regular arrangement in the input. Chunks the input by chunk-size and
   jumbles each chunk internally before calling pmap; puts the output back
   in order."
  [chunk-size f & colls]
  (let [[jumbler unjumbler] (jumbleize chunk-size)]
(unjumbler (apply pmap f (map jumbler colls)

user=> (drop 97 (jumbled-pmap 16 #(* % %) (range 100)))
(9409 9604 9801)

user=> (take 3 (drop 50 (jumbled-pmap 16 #(* % %) (range 100
(2500 2601 2704)

user=> (take 3 (drop 50 (jumbled-pmap 16 #(+ (* 1000 %1) %2) (range
100) (range 100 180
(50150 51151 52152)

That last, BTW, shows it working properly even with multiple input
colls of different lengths.

user=> (take 3 (drop 50 (jumbled-pmap 16 #(+ (* 1000 %1) %2) (iterate
inc 0) (drop 100 (iterate inc 0)
(50150 51151 52152)

And this shows that it's still lazy enough not to blow up on infinite inputs. :)

The internal sequences of random permutations and their inverses, as
well as the input sequences, lose their heads:

user=> (take 3 (drop 5000 (jumbled-pmap 16 #(+ (* 1000 %1) %2)
(iterate inc 0) (drop 100 (iterate inc 0)
(5005100
 50050001101
 50050002102)

Memory use stayed constant at a bit more than 64MB, which was probably
the -Xms parameter to the JVM instance Enclojure ran its REPL in.

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: How does pmap partition its work?

2011-01-24 Thread Armando Blancas
> This is much faster than either of the other eager-pmaps I posted to
> this thread, and yet it's only using 60% CPU on my dual-core box. That
> means there's still another x1.6 or so speedup possible(!) but I'm not
> sure how.
>

Could -server make a difference here?

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: How does pmap partition its work?

2011-01-24 Thread Michael Gardner
On Jan 23, 2011, at 10:56 PM, Ken Wesson wrote:

> I've managed to make this more efficient, by making each agent send
> process a larger portion of the job than one single item:
> 
> (defn eager-pmap [f & colls]
>  (let [cores (.. Runtime getRuntime availableProcessors)
>agents (cycle (for [_ (range cores)] (agent nil)))
>ccolls (map (partial partition 16 16 []) colls)
>promises (apply map (fn [& _] (promise)) ccolls)]
>(doall
>  (apply map (fn [a p & chunks]
>   (send a
> (fn [_]
>   (deliver p
> (apply map f chunks)
>agents promises ccolls))
>(mapcat deref promises)))
> 
> This is much faster than either of the other eager-pmaps I posted to
> this thread, and yet it's only using 60% CPU on my dual-core box. That
> means there's still another x1.6 or so speedup possible(!) but I'm not
> sure how.
> 
> Raising the size of the partitions from 16 all the way to 1024 makes
> no noticeable difference.

Thanks for the work, Ken. Clojure's multi-threaded performance can be 
mysterious indeed, which is a shame when one of the major benefits of 
functional code is that it's easily parallelizable. Anyway, for now I will 
probably just stick with pmap + shuffle, since it's dead-simple and should be 
enough my purposes.

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: How does pmap partition its work?

2011-01-23 Thread Ken Wesson
On Sun, Jan 23, 2011 at 11:34 PM, Ken Wesson  wrote:
> Other posts to the thread indicate that longer-range patterns in the
> inputs could cause problems. If you know you'll be consuming the full
> sequence, try this:
>
> (defn eager-pmap [f & colls]
>  (map deref (doall (apply map #(future (f %)) colls
>
> This creates all of the futures right away (due to the doall) and
> leaves it up to the future implementation to distribute work around
> some thread pool. On my system, it creates 80 or so of them each time
> it's run on a seq of 10 things, which persist for some time
> afterward -- so, a bit of a problem there. Another one that seems like
> it ought to be more efficient:
>
> (defn eager-pmap [f & colls]
>  (let [cores (.. Runtime getRuntime availableProcessors)
>        agents (cycle (for [_ (range cores)] (agent nil)))
>        promises (apply map (fn [& _] (promise)) colls)]
>    (doall
>      (apply map (fn [a p & args]
>                   (send a
>                     (fn [_] (deliver p (apply f args)
>        agents promises colls))
>    (map deref promises)))
>
> This one uses the agent "send" thread pool to divide up the work.
> Oddly, it's actually 2-3x slower than the previous and only uses 75%
> CPU on a dual-core machine, though it doesn't leak threads.

I've managed to make this more efficient, by making each agent send
process a larger portion of the job than one single item:

(defn eager-pmap [f & colls]
  (let [cores (.. Runtime getRuntime availableProcessors)
agents (cycle (for [_ (range cores)] (agent nil)))
ccolls (map (partial partition 16 16 []) colls)
promises (apply map (fn [& _] (promise)) ccolls)]
(doall
  (apply map (fn [a p & chunks]
   (send a
 (fn [_]
   (deliver p
 (apply map f chunks)
agents promises ccolls))
(mapcat deref promises)))

This is much faster than either of the other eager-pmaps I posted to
this thread, and yet it's only using 60% CPU on my dual-core box. That
means there's still another x1.6 or so speedup possible(!) but I'm not
sure how.

Raising the size of the partitions from 16 all the way to 1024 makes
no noticeable difference.

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: How does pmap partition its work?

2011-01-23 Thread Ken Wesson
Other posts to the thread indicate that longer-range patterns in the
inputs could cause problems. If you know you'll be consuming the full
sequence, try this:

(defn eager-pmap [f & colls]
  (map deref (doall (apply map #(future (f %)) colls

This creates all of the futures right away (due to the doall) and
leaves it up to the future implementation to distribute work around
some thread pool. On my system, it creates 80 or so of them each time
it's run on a seq of 10 things, which persist for some time
afterward -- so, a bit of a problem there. Another one that seems like
it ought to be more efficient:

(defn eager-pmap [f & colls]
  (let [cores (.. Runtime getRuntime availableProcessors)
agents (cycle (for [_ (range cores)] (agent nil)))
promises (apply map (fn [& _] (promise)) colls)]
(doall
  (apply map (fn [a p & args]
   (send a
 (fn [_] (deliver p (apply f args)
agents promises colls))
(map deref promises)))

This one uses the agent "send" thread pool to divide up the work.
Oddly, it's actually 2-3x slower than the previous and only uses 75%
CPU on a dual-core machine, though it doesn't leak threads. It looks
like agents, or promise/deliver, or both have lots of overhead
compared to futures. (Which is odd since the obvious implementation of
future is (defmacro future [& body] `(let [p# (promise)] (.start
(Thread. (fn [] (deliver p# (do ~@body) p#)). So maybe it's agents
that are causing the inefficiency? On the other hand, the result of
evaluating (future foo) doesn't appear to be a naked promise, though
it could still be a wrapped one, and that implementation would not
limit itself to 80-odd threads; maybe future instead uses the agent
"send-off" pool?!)

I'd recommend just using the first version of eager-pmap in this email
if you need an eager-pmap. The extra threads it creates are not too
numerous (it doesn't create 100,000 threads for an input seq of
100,000 items) and they *do* seem to disappear again *eventually* (it
may take several minutes; it takes much less than 1 hour; it takes
more than 30 seconds; can't be more precise with the data I've
gathered thus far).

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: How does pmap partition its work?

2011-01-23 Thread Ken Wesson
On Sun, Jan 23, 2011 at 10:31 PM, Ken Wesson  wrote:
> On Sun, Jan 23, 2011 at 8:56 PM, Michael Gardner  wrote:
>> Suppose I have a sequence of tasks I'd like to parallelize using pmap. The 
>> amount of CPU time these tasks require varies greatly; in particular, many 
>> of them will require virtually no work. Can I rely on pmap to divide the 
>> work efficiently even if there is some pattern to the distribution of easy 
>> tasks (e.g. clumped at the beginning, or every nth)? Assume it's not 
>> possible to identify the easy tasks beforehand.
>
> Here's a little data:
> user=> (def printer (agent nil))
> #'user/printer
> user=> (defn xyz [n] (send printer (fn [_] println
> (Thread/currentThread))) (* n n))

Eh, that's weird. That's not the definition I used. I used

(defn xyz [n] (let [t (Thread/currentThread)] (send printer (fn [_]
(println t (* n n))

of course so it would be the pmap thread, not the agent thread, in the
printout. I had been experimenting with investigating the agent thread
pools too, so the defn must have come from that part of the REPL
session.

This data:

> #
> #
> #
> #
> #
> #
> #
> #
> #
> #
> #
> #
> #
> #
> #
> #
> #
> #
> #
> #

was generated using the latter defn (the one that captures
Thread/currentThread outside the send).

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: How does pmap partition its work?

2011-01-23 Thread Ken Wesson
On Sun, Jan 23, 2011 at 8:56 PM, Michael Gardner  wrote:
> Suppose I have a sequence of tasks I'd like to parallelize using pmap. The 
> amount of CPU time these tasks require varies greatly; in particular, many of 
> them will require virtually no work. Can I rely on pmap to divide the work 
> efficiently even if there is some pattern to the distribution of easy tasks 
> (e.g. clumped at the beginning, or every nth)? Assume it's not possible to 
> identify the easy tasks beforehand.

Here's a little data:
user=> (def printer (agent nil))
#'user/printer
user=> (defn xyz [n] (send printer (fn [_] println
(Thread/currentThread))) (* n n))
#'user/xyz
user=> (xyz 3)
#
9
user=> (pmap xyz (range 20))
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
(0 1 4 9 16 25 36 49 64 81 100 121 144 169 196 225 256 289 324 361)
user=>

The fiddling with an agent is to prevent individual println outputs
from getting intermingled; instead the lines are perhaps slightly out
of order but at least are not mangled into unreadability.

As you can see, there are at least seven threads called somewhat
randomly; not a simple alternating pattern of 2 as you might naively
expect on a dual-core machine. It seems to prefer thread 22 for some
reason.

I don't know exactly what is going on under the hood but it looks like
it either involves some degree of randomness, or it tries to
load-balance (and something else was sometimes preempting threads -- I
don't know what, when I ran this nothing on my system was doing
anything CPU-intensive with a priority equal to or higher than the
REPL process, but NetBeans and the REPL were both being a bit
sluggish; some sort of Windows weirdness). Either way it's likely that
a mix of easy and hard jobs, even with a pattern, won't screw things
up too much.

That can be further tested:

user=> (defn heavy [f amt] (fn [x] (Thread/sleep amt) (f x)))
#'user/heavy
user=> (defn s [x] (* x x))
#'user/x
user=> (def l (heavy s 1000))
#'user/y
user=> (defn run [fseq n] (time (doall (pmap #(%1 %2) fseq (range n nil)
#'user/run
user=> (run (cycle [s l]) 100)
"Elapsed time: 17141.64872 msecs"
nil
user=> (run (cycle [s s l l]) 100)
"Elapsed time: 17020.59136 msecs"
nil
user=> (run (cycle [s s l s l l]) 100)
"Elapsed time: 17007.0366 msecs"
nil
user=> (defn randomize [s] (let [n (count s)] (repeatedly #(nth s
(rand-int n)
#'user/randomize
user=> (run (randomize [s l]) 100)
"Elapsed time: 18015.8562 msecs"
nil

As you can see, the timings are fairly consistent when a 50/50 mix of
short (<1msec) and long (>1000msec) tasks get parallelized using pmap,
with several possible patterns and also with no pattern to their
distribution.

This was again on a dual-core machine; it's interesting that the
numbers are around 17-18s instead of 25s (100 jobs, averaging 500ms
each, split between two cores). The use of more than two threads in
the thread pool is presumably the cause of this; Thread/sleep lets
another of the threads make progress. It didn't collapse down to just
1s so it's also not the case that pmap is spawning as many threads as
will maximize throughput (fifty running the slow jobs plus two running
the fast ones on both cores while the slow ones sleep). Since the jobs
would cumulatively take 50s if run serially, the speedup factor from
pmap here is just under 3.

One more data point:

user=> (run (repeat l) 50)
"Elapsed time: 10172.37124 msecs"
nil

That's fifty long jobs without any short ones. The speedup factor here
is 5 rather than 3. This shows that it will allow five, but no more,
jobs to be running concurrently, and that it doesn't load balance. If
it did, adding the short jobs wouldn't slow it down more than a few
ms; in fact doing so almost doubles the time, so it seems the jobs are
preallocated in some manner and the short jobs take up half the
available threads in the pool. (This is odd given that it seemed
before to be using at least seven threads, not just five.)

Of course, real jobs are likely to be CPU-intensive instead of
yielding like Thread/sleep.

user=> (def foo (atom nil))
#'user/foo
user=> (defn l2 [x] (doseq [a (range 750)] (reset! foo (* a x))) (* x x))
#'user/l2

The number 750 causes l2 to take about 900ms on my machine.

user=> (run (cycle [s l2]) 100)
"Elapsed time: 37024.37508 msecs"
nil

Oddly, quite a bit less than 50s. Perhaps reset! spends some time
yielding the CPU for some reason?

user=> (run (cycle [s s l2 l2]) 100)
"Elapsed time: 46963.28936 msecs"
nil
user=> (run (cycle [s s l2 s l2 l2]) 100)
"Elapsed time: 36551.62528 msecs"
nil
user=> (run (randomize [s l2]) 100)
"Elapsed time: 51713.50044 msecs"
nil
user=> (run (repeat l2) 50)
"Elapsed time: 43197.77312 msecs"
nil

These numbers suggest that the CPU load will generally be divvied up
equally among your cores, if the jobs don't yield the CPU (sleep,
block on I/O, block on monitors). If they do yield the CPU here and
there, it looks like pmap may perform slightly better, rather than
slightly worse, if the jobs that do so form 

Re: How does pmap partition its work?

2011-01-23 Thread Michael Gardner
On Jan 23, 2011, at 8:34 PM, Andy Fingerhut wrote:

> No, you cannot rely on pmap to do that.


Thanks for the detailed and informative answer. I'll probably settle for 
shuffling the list of inputs before passing them to pmap, and just accept that 
there may be some inefficiency in the distribution of work.

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: How does pmap partition its work?

2011-01-23 Thread Andy Fingerhut

No, you cannot rely on pmap to do that.

pmap is lazy in the sequence it produces, so it tries not to work  
farther ahead of the consumer of its "output sequence" than the amount  
of parallelism it uses, which is the number of available processors  
plus 2.


Suppose you have 4 available processors, so pmap tries to keep at most  
6 parallel invocations of your map function running.  Suppose also for  
example's sake that the function you are using with pmap on each  
element is sequential, so it uses at most one processor at a time.


Imagine that the numbers below are the times in seconds required to  
calculate your function on each element of the input sequence to pmap:


100 1 1 1 1 (15 more elements requiring 1 second each) 100 1 1 1 1 (15  
more elements requiring 1 second each)


When some code tries to retrieve the first element of the lazy seq  
that is the output of pmap, 6 threads will start calculating.  The  
threads for calculating the function on the 2nd through 6th elements  
will finish soon (if scheduling is fair), but no threads will be  
invoked to calculate the function on the 7th element yet, because pmap  
is trying not to work too far ahead.  When the function is done  
calculating the first element, and some consumer tries to access the  
2nd element of the output sequence, then a thread will be started  
working on the 7th input element, and so on.


So if the consumer of the output sequence is trying to retrieve  
elements as quickly as possible and doing some negligible amount of  
processing on each one, here will be the rough pattern of CPU busy time:


(1) 1.5 seconds of 4 processors working for 1 second each on the first  
6 elements.


(2) about 99 seconds more time finishing the function on the first  
element, but only 1 processor busy at a time, the other 3 idle


(3) about 14/4 seconds of all 4 processors busy working on the next 14  
elements, each taking about 1 sec of CPU time.


Then this pattern repeats when the next "heavy element" requiring 100  
seconds to calculate the function is reached.


Note: Sometimes working at odds with pmap's "Don't work too far ahead"  
approach is if the input sequence to pmap is chunked.  When a chunk is  
reached, all elements in that chunk have threads start in parallel, so  
the number of parallel threads can easily exceed (availableProcessors 
+2) during those times.


Amit Rathore's medusa is intended to let you keep processors busy, but  
with a different method of invoking it than pmap.


Andy


On Jan 23, 2011, at 5:56 PM, Michael Gardner wrote:

Suppose I have a sequence of tasks I'd like to parallelize using  
pmap. The amount of CPU time these tasks require varies greatly; in  
particular, many of them will require virtually no work. Can I rely  
on pmap to divide the work efficiently even if there is some pattern  
to the distribution of easy tasks (e.g. clumped at the beginning, or  
every nth)? Assume it's not possible to identify the easy tasks  
beforehand.


--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient  
with your first post.

To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en