Re: Parallelising over a lazy sequence - request for help

Alex Miller Sat, 28 Sep 2013 14:00:50 -0700

Reducers (and fork/join in general) are best suited for fine-grained 
computational parallelism on in-memory data. The problem in question 
involves processing more data than will fit in memory.


So the question is then what is the best way to parallelize computation 
over the stream. There are many ways to do so - pmap is one easy mechanism 
to get parallelism over a lazy sequence but the chunking behavior of pmap 
can easily be non-optimal for particular use cases. Stu's other solution 
(prepare-with-partition-then-fold) reads chunks of data into memory, then 
uses f/j over the top of those chunks. 

Yet another possible solution is to use a pool of compute threads and to 
read from the stream, give those compute tasks to the pool to execute in a 
background thread, then reassemble the results at the end (or in a separate 
thread). The balance in this approach is to make the tasks the right 
"chunkiness". Using a buffered queue is one technique to avoid having your 
input reader overload the capacity of the processing system.

I would also mention that using transients as you build input collections 
will be a big win. 

Alex

On Saturday, September 28, 2013 11:14:41 AM UTC-5, Jozef Wagner wrote:
>
> I would go a bit more further and suggest that you do not use sequences at 
> all and work only with reducible/foldable collections. Make an input reader 
> which returns a foldable collection and you will have the most performant 
> solution. The thing about holding into the head is being worked on right 
> now, see http://dev.clojure.org/jira/browse/CLJ-1250
>
> JW
>
> On Saturday, September 28, 2013 10:41:20 AM UTC+2, Paul Butcher wrote:
>>
>> On 28 Sep 2013, at 00:27, Stuart Halloway <stuart....@gmail.com> wrote:
>>
>> I have posted an example that shows partition-then-fold at 
>> https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj
>> .
>>
>> I would be curious to know how this approach performs with your data. 
>>  With the generated data I used, the partition+fold and partition+pmap 
>> approaches both used most of my cores and had similar perf.
>>
>>
>> Hey Stuart, 
>>
>> Thanks for getting back to me.
>>
>> I've updated the code for my word count example on GitHub with (I 
>> believe) something that works along the lines you suggest:
>>
>> https://github.com/paulbutcher/parallel-word-count
>>
>> Here are some sample runs on my machine (a 4-core retina MacBook Pro). 
>> Each of these runs counts the words in the first 10000 pages of Wikipedia:
>>
>> $ lein run 10000 ~/enwiki.xml sequential
>> "Elapsed time: 23630.443 msecs"
>> $ lein run 10000 ~/enwiki.xml fold
>> "Elapsed time: 8697.79 msecs"
>> $ lein run 10000 ~/enwiki.xml pmap 10000
>> "Elapsed time: 27393.703 msecs"
>> $ lein run 10000 ~/enwiki.xml pthenf 10000
>> "Elapsed time: 37263.193 msecs"
>>
>>
>> As you can see, the the foldable-seq version gives an almost 3x speedup 
>> relative to the sequential version, and both the partition-then-pmap and 
>> partition-then-fold versions are significantly slower.
>>
>> The last argument for the partition-then-pmap and partition-then-fold 
>> versions is a partition size. I've tried various different sizes with no 
>> material effect:
>>
>> $ lein run 10000 ~/enwiki.xml pthenf 1000
>> "Elapsed time: 43187.385 msecs"
>> $ lein run 10000 ~/enwiki.xml pthenf 100000
>> "Elapsed time: 35702.651 msecs"
>> $ lein run 10000 ~/enwiki.xml pmap 1000
>> "Elapsed time: 34087.314 msecs"
>> $ lein run 10000 ~/enwiki.xml pmap 100000
>> "Elapsed time: 47340.969 msecs"
>>
>>
>> The performance of the partition-then-pmap version is actually much worse 
>> than the numbers above suggest. There's something very weird going on with 
>> (I guess) garbage collection - it takes a *long* time to finish after 
>> printing the elapsed time and the performance is completely pathological 
>> with larger page counts.
>>
>> Bottom line: the only way I've been able to obtain any kind of speedup 
>> remains foldable-seq.
>>
>> I'd be very grateful indeed if you could take a look at how I've 
>> implemented partition-then-fold to make sure that I've correctly captured 
>> your intent. Or if you have any suggestions for anything else that might 
>> work, or to explain the poor performance of partition-then-pmap and 
>> partition-then-fold.
>>
>> My guess is that the problem with partition-then-fold is the copying 
>> that's going on during the (into [] %). I can see that it is operating in 
>> parallel because the number of cores in use goes up, but the net effect is 
>> an overall slowdown rather than a speedup.
>>
>> That it performs worse than foldable-seq isn't surprising to me, given 
>> that it introduces an unnecessary copy.
>>
>> I still think that it's a crying shame to disallow folding over sequences 
>> - as the above shows, the gains both in performance and programming ease 
>> are significant, and it would only take a small modification to the 
>> reducers API to fix the holding-onto-head problem. What would be the 
>> downside of making this modification and allowing foldable sequences?
>>
>> --
>> paul.butcher->msgCount++
>>
>> Snetterton, Castle Combe, Cadwell Park...
>> Who says I have a one track mind?
>>
>> http://www.paulbutcher.com/
>> LinkedIn: http://www.linkedin.com/in/paulbutcher
>> MSN: pa...@paulbutcher.com
>> AIM: paulrabutcher
>> Skype: paulrabutcher
>>  
>> On 28 Sep 2013, at 00:27, Stuart Halloway <stuart....@gmail.com> wrote:
>>
>> Hi Paul,
>>
>> I have posted an example that shows partition-then-fold at 
>> https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj
>> .
>>
>> I would be curious to know how this approach performs with your data. 
>>  With the generated data I used, the partition+fold and partition+pmap 
>> approaches both used most of my cores and had similar perf.
>>
>> Enjoying your book!
>>
>> Stu
>>
>>
>> On Sat, May 25, 2013 at 12:34 PM, Paul Butcher <pa...@paulbutcher.com>wrote:
>>
>>> I'm currently working on a book on concurrent/parallel development for 
>>> The Pragmatic Programmers. One of the subjects I'm covering is parallel 
>>> programming in Clojure, but I've hit a roadblock with one of the examples. 
>>> I'm hoping that I can get some help to work through it here.
>>>
>>> The example counts the words contained within a Wikipedia dump. It 
>>> should respond well to parallelisation (I have Java and Scala versions that 
>>> perform excellently) but I've been incapable of coming up with a nice 
>>> solution in Clojure.
>>>
>>> The code I'm working with is checked into 
>>> GitHub<https://github.com/paulbutcher/parallel-word-count>
>>> : 
>>>
>>> The basic sequential algorithm is:
>>>
>>> (frequencies (mapcat get-words pages))
>>>
>>>
>>> If I run that on the first 10k pages in Wikipedia dump, it takes ~21s on 
>>> my MacBook Pro.
>>>
>>> One way to parallelise it is to create a parallel version of frequencies 
>>> that uses reducers:
>>>
>>> (defn frequencies-fold [words]
>>>   (r/fold (partial merge-with +)
>>>           (fn [counts word] (assoc counts word (inc (get counts word 0
>>> ))))
>>>
>>>
>>>           words))
>>>
>>>
>>> And sure enough, if I use that, along with use the 
>>> foldable-seq<https://github.com/paulbutcher/parallel-word-count/blob/master/src/foldable_seq/core.clj>
>>>  utility 
>>> I posted about 
>>> here<https://groups.google.com/d/msg/clojure/8RKCjF00ukQ/b5mmmOB5Uh4J> are 
>>> while ago it runs in ~8s, almost a 3x speedup, not bad given that the 
>>> parallel version is unable to use transients.
>>>
>>> Unfortunately, as they currently stand, reducers have a fatal flaw that 
>>> means that, even with foldable-seq, they're basically useless with lazy 
>>> sequences. Reducers always hold onto the 
>>> head<https://groups.google.com/d/msg/clojure-dev/qJo7z_9CVdw/ARnHe1bThuMJ> 
>>> of 
>>> the sequence they're given, so there's no way to use this approach for a 
>>> complete Wikipedia dump (which runs to around 40GiB).
>>>
>>> So the other approach I've tried is to use pmap:
>>>
>>> (defn frequencies-pmap [words]
>>>   (reduce (partial merge-with +) 
>>>     (pmap frequencies 
>>>       (partition-all 10000 words))))
>>>
>>>
>>> But, for reasons I don't understand, this performs dreadfully - taking 
>>> ~26s, i.e. significantly slower than the sequential version.
>>>
>>> I've tried playing with different partition sizes without materially 
>>> affecting the result.
>>>
>>> So, what I'm looking for is either:
>>>
>>> a) An explanation for why the pmap-based solution performs so badly
>>>
>>> b) A way to fix the "holding onto head" problem that's inherent within 
>>> reducers.
>>>
>>> With the last of these in mind, it strikes me that the problem 
>>> fundamentally arises from the desire for reducers to follow the same basic 
>>> API as "normal" code. So:
>>>
>>> (reduce (filter ... (map ... coll)))
>>>
>>> becomes:
>>>
>>> (r/fold (r/filter ... (r/map ... coll)))
>>>
>>> A very small change to the reducers API - passing the collection to the 
>>> reduce and/or fold - would avoid the problem:
>>>
>>> (r/fold (r/filter ... (r/map ...)) coll)
>>>
>>> Anyway - I'd be very grateful for any insight into either of the above 
>>> questions. Or for suggestions for an alternative approach that might be 
>>> more fruitful.
>>>
>>> Many thanks in advance,
>>>
>>> --
>>> paul.butcher->msgCount++
>>>
>>> Snetterton, Castle Combe, Cadwell Park...
>>> Who says I have a one track mind?
>>>
>>> http://www.paulbutcher.com/
>>> LinkedIn: http://www.linkedin.com/in/paulbutcher
>>> MSN: pa...@paulbutcher.com
>>> AIM: paulrabutcher
>>> Skype: paulrabutcher
>>>  
>>>
>>> -- 
>>> -- 
>>> You received this message because you are subscribed to the Google
>>> Groups "Clojure" group.
>>> To post to this group, send email to clo...@googlegroups.com
>>> Note that posts from new members are moderated - please be patient with 
>>> your first post.
>>> To unsubscribe from this group, send email to
>>> clojure+u...@googlegroups.com
>>> For more options, visit this group at
>>> http://groups.google.com/group/clojure?hl=en
>>> --- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "Clojure" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to clojure+u...@googlegroups.com.
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>  
>>>  
>>>
>>
>>
>> -- 
>> -- 
>> You received this message because you are subscribed to the Google
>> Groups "Clojure" group.
>> To post to this group, send email to clo...@googlegroups.com
>> Note that posts from new members are moderated - please be patient with 
>> your first post.
>> To unsubscribe from this group, send email to
>> clojure+u...@googlegroups.com
>> For more options, visit this group at
>> http://groups.google.com/group/clojure?hl=en
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "Clojure" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to clojure+u...@googlegroups.com.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
>>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Parallelising over a lazy sequence - request for help

Reply via email to