Re: Parallelising over a lazy sequence - request for help

2013-10-02 Thread Paul Butcher
Alan,

Apologies for the delayed reply - I remember Iota well (there was some 
cross-fertilisation between it and foldable-seq a few months back IIRC :-)

Having said that, I don't think that Iota will help in my particular situation 
(although I'd be delighted to be proven wrong)? Given that the file I'm 
processing is an XML file, and will therefore have to pass through an XML 
parser, unless I write an XML parser on top of the reducers framework, I'm 
stuck with dealing with sequences at some point along the way?

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

On 30 Sep 2013, at 11:15, Alan Busby  wrote:

> Sorry to jump in, but I thought it worthwhile to add a couple points; (sorry 
> for being brief)
> 
> 1. Reducers work fine with data much larger than memory, you just need to 
> mmap() the data you're working with so Clojure thinks everything is in memory 
> when it isn't. Reducer access is fairly sequential, not random, so spinning 
> disks work great here. 
> 
> 2. A 40GB XML file is very often many many smaller XML documents aggregated 
> together. It's often faster to separate each document into it's own line (via 
> various UNIX tools) and parse each line separately. I typically do something 
> like $ zcat bigxml.gz | tr '\n' ' ' | sed 's//\n/' | grep '^' 
> > records.xml . 
> 
> 3. Check out the Iota library, https://github.com/thebusby/iota/ . I often 
> use for reducing over 100's of GB's worth of text data. It does what Jozef 
> suggests, and makes a text file a foldable collection.
> 
> 4. While pmap is great for advertising the power of Clojure, it's likely safe 
> to say that it should be ignored if you're actually looking for performance. 
> 
> 
> Hope this helps,
> Alan Busby
> 
> 
> 
> 
> -- 
> -- 
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with your 
> first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> --- 
> You received this message because you are subscribed to the Google Groups 
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-30 Thread Alan Busby
Sorry to jump in, but I thought it worthwhile to add a couple points;
(sorry for being brief)

1. Reducers work fine with data much larger than memory, you just need to
mmap() the data you're working with so Clojure thinks everything is in
memory when it isn't. Reducer access is fairly sequential, not random, so
spinning disks work great here.

2. A 40GB XML file is very often many many smaller XML documents aggregated
together. It's often faster to separate each document into it's own line
(via various UNIX tools) and parse each line separately. I typically do
something like $ zcat bigxml.gz | tr '\n' ' ' | sed 's//\n/' |
grep '^' > records.xml .

3. Check out the Iota library, https://github.com/thebusby/iota/ . I often
use for reducing over 100's of GB's worth of text data. It does what Jozef
suggests, and makes a text file a foldable collection.

4. While pmap is great for advertising the power of Clojure, it's likely
safe to say that it should be ignored if you're actually looking for
performance.


Hope this helps,
Alan Busby

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-29 Thread Brian Craft
On the other hand it is 2013, not 2003. 40G is small in terms of modern 
hardware. Terabyte ram servers have been available for awhile, at prices 
within the reach of many projects. "Large" data in this decade is measured 
in petabytes, at least.


On Sunday, September 29, 2013 5:13:14 PM UTC-7, Paul Mooser wrote:
>
> Thanks - when I said "small", I was referring to the fact that your tests 
> were using the first 1 pages, as opposed to the entire data dump. Sorry 
> if I was unclear or misunderstood. 
>
> On Sunday, September 29, 2013 3:20:38 PM UTC-7, Paul Butcher wrote:
>>
>> The dataset I'm using is a Wikipedia dump, which hardly counts as "small" 
>> :-)
>>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-29 Thread Paul Mooser
Thanks - when I said "small", I was referring to the fact that your tests 
were using the first 1 pages, as opposed to the entire data dump. Sorry 
if I was unclear or misunderstood. 

On Sunday, September 29, 2013 3:20:38 PM UTC-7, Paul Butcher wrote:
>
> The dataset I'm using is a Wikipedia dump, which hardly counts as "small" 
> :-)
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-29 Thread Paul Butcher
On 29 Sep 2013, at 22:58, Paul Mooser  wrote:

> Paul, is there any easy way to get the (small) dataset you're working with, 
> so we can run your actual code against the same data?

The dataset I'm using is a Wikipedia dump, which hardly counts as "small" :-)

Having said that, the first couple of million lines is all you need to 
reproduce the results I'm getting, which you can download with:

curl 
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 | 
bunzip2 | head -n 200 > enwiki-short.xml

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-29 Thread Paul Mooser
Paul, is there any easy way to get the (small) dataset you're working with, 
so we can run your actual code against the same data?

On Saturday, May 25, 2013 9:34:15 AM UTC-7, Paul Butcher wrote:
>
>
> The example counts the words contained within a Wikipedia dump. It should 
> respond well to parallelisation (I have Java and Scala versions that 
> perform excellently) but I've been incapable of coming up with a nice 
> solution in Clojure.
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-29 Thread Stuart Halloway
To be clear, I don't object to the approach, only to naming it "fold"
and/or tying it to interfaces related to folding.

Stu


On Sat, Sep 28, 2013 at 5:29 PM, Paul Butcher  wrote:

> On 28 Sep 2013, at 22:00, Alex Miller  wrote:
>
> Reducers (and fork/join in general) are best suited for fine-grained
> computational parallelism on in-memory data. The problem in question
> involves processing more data than will fit in memory.
>
> So the question is then what is the best way to parallelize computation
> over the stream. There are many ways to do so - pmap is one easy mechanism
> to get parallelism over a lazy sequence but the chunking behavior of pmap
> can easily be non-optimal for particular use cases. Stu's other solution
> (prepare-with-partition-then-fold) reads chunks of data into memory, then
> uses f/j over the top of those chunks.
>
>
> Yes - I'm aware that that's how Stu's solution works. However when I use
> that in my example, the performance sucks. As I said in my earlier message,
> I believe that the reason for this is the additional copying that it
> involves.
>
> To date, I have exactly one implementation that has good parallel
> performance - the one that uses foldable-seq (which is 3x faster than the
> sequential version). Everything else I've tried performs worse than the
> sequential version.
>
> Yet another possible solution is to use a pool of compute threads and to
> read from the stream, give those compute tasks to the pool to execute in a
> background thread, then reassemble the results at the end (or in a separate
> thread). The balance in this approach is to make the tasks the right
> "chunkiness". Using a buffered queue is one technique to avoid having your
> input reader overload the capacity of the processing system.
>
>
> That's basically *exactly* what the foldable-seq implementation does.
>
> I would also mention that using transients as you build input collections
> will be a big win.
>
>
> I'm sure that that's true - but the foldable-seq implementation is the one
> that would gain most from that, and that's the one that already runs 3x
> faster than the sequential implementation. But it's also the one that Stu
> objects to. What I'm trying to find is an implementation that Stu doesn't
> object to, but is faster than the sequential version of the algorithm.
>
> --
> paul.butcher->msgCount++
>
> Snetterton, Castle Combe, Cadwell Park...
> Who says I have a one track mind?
>
> http://www.paulbutcher.com/
> LinkedIn: http://www.linkedin.com/in/paulbutcher
> MSN: p...@paulbutcher.com
> AIM: paulrabutcher
> Skype: paulrabutcher
>
> On 28 Sep 2013, at 22:00, Alex Miller  wrote:
>
> Reducers (and fork/join in general) are best suited for fine-grained
> computational parallelism on in-memory data. The problem in question
> involves processing more data than will fit in memory.
>
> So the question is then what is the best way to parallelize computation
> over the stream. There are many ways to do so - pmap is one easy mechanism
> to get parallelism over a lazy sequence but the chunking behavior of pmap
> can easily be non-optimal for particular use cases. Stu's other solution
> (prepare-with-partition-then-fold) reads chunks of data into memory, then
> uses f/j over the top of those chunks.
>
> Yet another possible solution is to use a pool of compute threads and to
> read from the stream, give those compute tasks to the pool to execute in a
> background thread, then reassemble the results at the end (or in a separate
> thread). The balance in this approach is to make the tasks the right
> "chunkiness". Using a buffered queue is one technique to avoid having your
> input reader overload the capacity of the processing system.
>
> I would also mention that using transients as you build input collections
> will be a big win.
>
> Alex
>
> On Saturday, September 28, 2013 11:14:41 AM UTC-5, Jozef Wagner wrote:
>>
>> I would go a bit more further and suggest that you do not use sequences
>> at all and work only with reducible/foldable collections. Make an input
>> reader which returns a foldable collection and you will have the most
>> performant solution. The thing about holding into the head is being worked
>> on right now, see 
>> http://dev.clojure.org/**jira/browse/CLJ-1250
>>
>> JW
>>
>> On Saturday, September 28, 2013 10:41:20 AM UTC+2, Paul Butcher wrote:
>>>
>>> On 28 Sep 2013, at 00:27, Stuart Halloway  wrote:
>>>
>>> I have posted an example that shows partition-then-fold at
>>> https://github.com/**stuarthalloway/exploring-**
>>> clojure/blob/master/examples/**exploring/reducing_apple_pie.**clj
>>> .
>>>
>>> I would be curious to know how this approach performs with your data.
>>>  With the generated data I used, the partition+fold and partition+pmap
>>> approaches both used most of my cores and had similar perf.
>>>
>>>
>>> He

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
On 28 Sep 2013, at 22:00, Alex Miller  wrote:

> Reducers (and fork/join in general) are best suited for fine-grained 
> computational parallelism on in-memory data. The problem in question involves 
> processing more data than will fit in memory.
> 
> So the question is then what is the best way to parallelize computation over 
> the stream. There are many ways to do so - pmap is one easy mechanism to get 
> parallelism over a lazy sequence but the chunking behavior of pmap can easily 
> be non-optimal for particular use cases. Stu's other solution 
> (prepare-with-partition-then-fold) reads chunks of data into memory, then 
> uses f/j over the top of those chunks. 

Yes - I'm aware that that's how Stu's solution works. However when I use that 
in my example, the performance sucks. As I said in my earlier message, I 
believe that the reason for this is the additional copying that it involves.

To date, I have exactly one implementation that has good parallel performance - 
the one that uses foldable-seq (which is 3x faster than the sequential 
version). Everything else I've tried performs worse than the sequential version.

> Yet another possible solution is to use a pool of compute threads and to read 
> from the stream, give those compute tasks to the pool to execute in a 
> background thread, then reassemble the results at the end (or in a separate 
> thread). The balance in this approach is to make the tasks the right 
> "chunkiness". Using a buffered queue is one technique to avoid having your 
> input reader overload the capacity of the processing system.

That's basically *exactly* what the foldable-seq implementation does.

> I would also mention that using transients as you build input collections 
> will be a big win. 

I'm sure that that's true - but the foldable-seq implementation is the one that 
would gain most from that, and that's the one that already runs 3x faster than 
the sequential implementation. But it's also the one that Stu objects to. What 
I'm trying to find is an implementation that Stu doesn't object to, but is 
faster than the sequential version of the algorithm.

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

On 28 Sep 2013, at 22:00, Alex Miller  wrote:

> Reducers (and fork/join in general) are best suited for fine-grained 
> computational parallelism on in-memory data. The problem in question involves 
> processing more data than will fit in memory.
> 
> So the question is then what is the best way to parallelize computation over 
> the stream. There are many ways to do so - pmap is one easy mechanism to get 
> parallelism over a lazy sequence but the chunking behavior of pmap can easily 
> be non-optimal for particular use cases. Stu's other solution 
> (prepare-with-partition-then-fold) reads chunks of data into memory, then 
> uses f/j over the top of those chunks. 
> 
> Yet another possible solution is to use a pool of compute threads and to read 
> from the stream, give those compute tasks to the pool to execute in a 
> background thread, then reassemble the results at the end (or in a separate 
> thread). The balance in this approach is to make the tasks the right 
> "chunkiness". Using a buffered queue is one technique to avoid having your 
> input reader overload the capacity of the processing system.
> 
> I would also mention that using transients as you build input collections 
> will be a big win. 
> 
> Alex
> 
> On Saturday, September 28, 2013 11:14:41 AM UTC-5, Jozef Wagner wrote:
> I would go a bit more further and suggest that you do not use sequences at 
> all and work only with reducible/foldable collections. Make an input reader 
> which returns a foldable collection and you will have the most performant 
> solution. The thing about holding into the head is being worked on right now, 
> see http://dev.clojure.org/jira/browse/CLJ-1250
> 
> JW
> 
> On Saturday, September 28, 2013 10:41:20 AM UTC+2, Paul Butcher wrote:
> On 28 Sep 2013, at 00:27, Stuart Halloway  wrote:
> 
>> I have posted an example that shows partition-then-fold at 
>> https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj.
>> 
>> I would be curious to know how this approach performs with your data.  With 
>> the generated data I used, the partition+fold and partition+pmap approaches 
>> both used most of my cores and had similar perf.
> 
> Hey Stuart, 
> 
> Thanks for getting back to me.
> 
> I've updated the code for my word count example on GitHub with (I believe) 
> something that works along the lines you suggest:
> 
> https://github.com/paulbutcher/parallel-word-count
> 
> Here are some sample runs on my machine (a 4-core retina MacBook Pro). Each 
> of these runs counts the words in the first 1 pages of Wikipedia:
> 
>> $ lein r

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
Thanks Alex - I've made both of these changes. The shutdown-agents did get rid 
of the pause at the end of the pmap solution, and the -server argument made a 
very slight across-the-board performance improvement. But neither of them 
fundamentally change the basic result (that the implementation that uses 
foldable-seq gives a 3x speedup and the other implementations are slower than 
the sequential version).

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

On 28 Sep 2013, at 21:47, Alex Miller  wrote:

> For your timings, I would also strongly recommend altering your project.clj 
> to force the -server hotspot:
> 
>   :jvm-opts ^:replace ["-Xmx1g" "-server"  ... and whatever else you want 
> here ... ]
> 
> By default lein will use tiered compilation to optimize repl startup, which 
> is not what you want for timing.
> 
> And I second Andy's (shutdown-agents) advice. 

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Jozef Wagner
Can't your last possible solution rather be implemented on top of f/j pool?
Is it possible to beat f/j pool performance with ad-hoc thread-pool in
situations where there are thousands of tasks?

JW


On Sat, Sep 28, 2013 at 11:00 PM, Alex Miller  wrote:

> Reducers (and fork/join in general) are best suited for fine-grained
> computational parallelism on in-memory data. The problem in question
> involves processing more data than will fit in memory.
>
> So the question is then what is the best way to parallelize computation
> over the stream. There are many ways to do so - pmap is one easy mechanism
> to get parallelism over a lazy sequence but the chunking behavior of pmap
> can easily be non-optimal for particular use cases. Stu's other solution
> (prepare-with-partition-then-fold) reads chunks of data into memory, then
> uses f/j over the top of those chunks.
>
> Yet another possible solution is to use a pool of compute threads and to
> read from the stream, give those compute tasks to the pool to execute in a
> background thread, then reassemble the results at the end (or in a separate
> thread). The balance in this approach is to make the tasks the right
> "chunkiness". Using a buffered queue is one technique to avoid having your
> input reader overload the capacity of the processing system.
>
> I would also mention that using transients as you build input collections
> will be a big win.
>
> Alex
>
> On Saturday, September 28, 2013 11:14:41 AM UTC-5, Jozef Wagner wrote:
>>
>> I would go a bit more further and suggest that you do not use sequences
>> at all and work only with reducible/foldable collections. Make an input
>> reader which returns a foldable collection and you will have the most
>> performant solution. The thing about holding into the head is being worked
>> on right now, see 
>> http://dev.clojure.org/**jira/browse/CLJ-1250
>>
>> JW
>>
>>
>> On Saturday, September 28, 2013 10:41:20 AM UTC+2, Paul Butcher wrote:
>>>
>>> On 28 Sep 2013, at 00:27, Stuart Halloway  wrote:
>>>
>>> I have posted an example that shows partition-then-fold at
>>> https://github.com/**stuarthalloway/exploring-**
>>> clojure/blob/master/examples/**exploring/reducing_apple_pie.**clj
>>> .
>>>
>>> I would be curious to know how this approach performs with your data.
>>>  With the generated data I used, the partition+fold and partition+pmap
>>> approaches both used most of my cores and had similar perf.
>>>
>>>
>>> Hey Stuart,
>>>
>>> Thanks for getting back to me.
>>>
>>> I've updated the code for my word count example on GitHub with (I
>>> believe) something that works along the lines you suggest:
>>>
>>> https://github.com/**paulbutcher/parallel-word-**count
>>>
>>> Here are some sample runs on my machine (a 4-core retina MacBook Pro).
>>> Each of these runs counts the words in the first 1 pages of Wikipedia:
>>>
>>> $ lein run 1 ~/enwiki.xml sequential
>>> "Elapsed time: 23630.443 msecs"
>>> $ lein run 1 ~/enwiki.xml fold
>>> "Elapsed time: 8697.79 msecs"
>>> $ lein run 1 ~/enwiki.xml pmap 1
>>> "Elapsed time: 27393.703 msecs"
>>> $ lein run 1 ~/enwiki.xml pthenf 1
>>> "Elapsed time: 37263.193 msecs"
>>>
>>>
>>> As you can see, the the foldable-seq version gives an almost 3x speedup
>>> relative to the sequential version, and both the partition-then-pmap and
>>> partition-then-fold versions are significantly slower.
>>>
>>> The last argument for the partition-then-pmap and partition-then-fold
>>> versions is a partition size. I've tried various different sizes with no
>>> material effect:
>>>
>>> $ lein run 1 ~/enwiki.xml pthenf 1000
>>> "Elapsed time: 43187.385 msecs"
>>> $ lein run 1 ~/enwiki.xml pthenf 10
>>> "Elapsed time: 35702.651 msecs"
>>> $ lein run 1 ~/enwiki.xml pmap 1000
>>> "Elapsed time: 34087.314 msecs"
>>> $ lein run 1 ~/enwiki.xml pmap 10
>>> "Elapsed time: 47340.969 msecs"
>>>
>>>
>>> The performance of the partition-then-pmap version is actually much
>>> worse than the numbers above suggest. There's something very weird going on
>>> with (I guess) garbage collection - it takes a *long* time to finish after
>>> printing the elapsed time and the performance is completely pathological
>>> with larger page counts.
>>>
>>> Bottom line: the only way I've been able to obtain any kind of speedup
>>> remains foldable-seq.
>>>
>>> I'd be very grateful indeed if you could take a look at how I've
>>> implemented partition-then-fold to make sure that I've correctly captured
>>> your intent. Or if you have any suggestions for anything else that might
>>> work, or to explain the poor performance of partition-then-pmap and
>>> partition-then-fold.
>>>
>>> My guess is that the problem with partition-then-fold is the copying
>>> that's going on during the (into

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Alex Miller
I am hoping that this will be fixed for 1.6 but no one is actually 
"working" on it afaik. If someone wants to take it on, I would GREATLY 
appreciate a patch on this ticket (must be a contributor of course).

On Saturday, September 28, 2013 11:24:18 AM UTC-5, Paul Butcher wrote:
>
> On 28 Sep 2013, at 17:14, Jozef Wagner > 
> wrote:
>
> I would go a bit more further and suggest that you do not use sequences at 
> all and work only with reducible/foldable collections. Make an input reader 
> which returns a foldable collection and you will have the most performant 
> solution. The thing about holding into the head is being worked on right 
> now, see http://dev.clojure.org/jira/browse/CLJ-1250
>
>
> That's fantastic news Jozef - is there any idea when that might be fixed? 
> I can see that it's been triaged, but I'm not sure what exactly that means 
> when it comes to the Clojure dev process?
>
> Could you expand on exactly what you mean when you say "an input reader 
> which returns a foldable collection"? Bear in mind that the problem I'm 
> trying to solve involves a 40GB Wikipedia dump, which clearly can't all be 
> in RAM at one time.
>
> --
> paul.butcher->msgCount++
>
> Snetterton, Castle Combe, Cadwell Park...
> Who says I have a one track mind?
>
> http://www.paulbutcher.com/
> LinkedIn: http://www.linkedin.com/in/paulbutcher
> MSN: pa...@paulbutcher.com 
> AIM: paulrabutcher
> Skype: paulrabutcher
>  
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Alex Miller
Reducers (and fork/join in general) are best suited for fine-grained 
computational parallelism on in-memory data. The problem in question 
involves processing more data than will fit in memory.

So the question is then what is the best way to parallelize computation 
over the stream. There are many ways to do so - pmap is one easy mechanism 
to get parallelism over a lazy sequence but the chunking behavior of pmap 
can easily be non-optimal for particular use cases. Stu's other solution 
(prepare-with-partition-then-fold) reads chunks of data into memory, then 
uses f/j over the top of those chunks. 

Yet another possible solution is to use a pool of compute threads and to 
read from the stream, give those compute tasks to the pool to execute in a 
background thread, then reassemble the results at the end (or in a separate 
thread). The balance in this approach is to make the tasks the right 
"chunkiness". Using a buffered queue is one technique to avoid having your 
input reader overload the capacity of the processing system.

I would also mention that using transients as you build input collections 
will be a big win. 

Alex

On Saturday, September 28, 2013 11:14:41 AM UTC-5, Jozef Wagner wrote:
>
> I would go a bit more further and suggest that you do not use sequences at 
> all and work only with reducible/foldable collections. Make an input reader 
> which returns a foldable collection and you will have the most performant 
> solution. The thing about holding into the head is being worked on right 
> now, see http://dev.clojure.org/jira/browse/CLJ-1250
>
> JW
>
> On Saturday, September 28, 2013 10:41:20 AM UTC+2, Paul Butcher wrote:
>>
>> On 28 Sep 2013, at 00:27, Stuart Halloway  wrote:
>>
>> I have posted an example that shows partition-then-fold at 
>> https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj
>> .
>>
>> I would be curious to know how this approach performs with your data. 
>>  With the generated data I used, the partition+fold and partition+pmap 
>> approaches both used most of my cores and had similar perf.
>>
>>
>> Hey Stuart, 
>>
>> Thanks for getting back to me.
>>
>> I've updated the code for my word count example on GitHub with (I 
>> believe) something that works along the lines you suggest:
>>
>> https://github.com/paulbutcher/parallel-word-count
>>
>> Here are some sample runs on my machine (a 4-core retina MacBook Pro). 
>> Each of these runs counts the words in the first 1 pages of Wikipedia:
>>
>> $ lein run 1 ~/enwiki.xml sequential
>> "Elapsed time: 23630.443 msecs"
>> $ lein run 1 ~/enwiki.xml fold
>> "Elapsed time: 8697.79 msecs"
>> $ lein run 1 ~/enwiki.xml pmap 1
>> "Elapsed time: 27393.703 msecs"
>> $ lein run 1 ~/enwiki.xml pthenf 1
>> "Elapsed time: 37263.193 msecs"
>>
>>
>> As you can see, the the foldable-seq version gives an almost 3x speedup 
>> relative to the sequential version, and both the partition-then-pmap and 
>> partition-then-fold versions are significantly slower.
>>
>> The last argument for the partition-then-pmap and partition-then-fold 
>> versions is a partition size. I've tried various different sizes with no 
>> material effect:
>>
>> $ lein run 1 ~/enwiki.xml pthenf 1000
>> "Elapsed time: 43187.385 msecs"
>> $ lein run 1 ~/enwiki.xml pthenf 10
>> "Elapsed time: 35702.651 msecs"
>> $ lein run 1 ~/enwiki.xml pmap 1000
>> "Elapsed time: 34087.314 msecs"
>> $ lein run 1 ~/enwiki.xml pmap 10
>> "Elapsed time: 47340.969 msecs"
>>
>>
>> The performance of the partition-then-pmap version is actually much worse 
>> than the numbers above suggest. There's something very weird going on with 
>> (I guess) garbage collection - it takes a *long* time to finish after 
>> printing the elapsed time and the performance is completely pathological 
>> with larger page counts.
>>
>> Bottom line: the only way I've been able to obtain any kind of speedup 
>> remains foldable-seq.
>>
>> I'd be very grateful indeed if you could take a look at how I've 
>> implemented partition-then-fold to make sure that I've correctly captured 
>> your intent. Or if you have any suggestions for anything else that might 
>> work, or to explain the poor performance of partition-then-pmap and 
>> partition-then-fold.
>>
>> My guess is that the problem with partition-then-fold is the copying 
>> that's going on during the (into [] %). I can see that it is operating in 
>> parallel because the number of cores in use goes up, but the net effect is 
>> an overall slowdown rather than a speedup.
>>
>> That it performs worse than foldable-seq isn't surprising to me, given 
>> that it introduces an unnecessary copy.
>>
>> I still think that it's a crying shame to disallow folding over sequences 
>> - as the above shows, the gains both in performance and programming ease 
>> are significant, and it would only take a small modification to the 
>> reducers API to fix the holding-onto-head prob

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Alex Miller
For your timings, I would also strongly recommend altering your project.clj 
to force the -server hotspot:

  :jvm-opts ^:replace ["-Xmx1g" "-server"  ... and whatever else you want 
here ... ]

By default lein will use tiered compilation to optimize repl startup, which 
is not what you want for timing.

And I second Andy's (shutdown-agents) advice. 


On Saturday, September 28, 2013 3:41:20 AM UTC-5, Paul Butcher wrote:
>
> On 28 Sep 2013, at 00:27, Stuart Halloway > 
> wrote:
>
> I have posted an example that shows partition-then-fold at 
> https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj
> .
>
> I would be curious to know how this approach performs with your data. 
>  With the generated data I used, the partition+fold and partition+pmap 
> approaches both used most of my cores and had similar perf.
>
>
> Hey Stuart, 
>
> Thanks for getting back to me.
>
> I've updated the code for my word count example on GitHub with (I believe) 
> something that works along the lines you suggest:
>
> https://github.com/paulbutcher/parallel-word-count
>
> Here are some sample runs on my machine (a 4-core retina MacBook Pro). 
> Each of these runs counts the words in the first 1 pages of Wikipedia:
>
> $ lein run 1 ~/enwiki.xml sequential
> "Elapsed time: 23630.443 msecs"
> $ lein run 1 ~/enwiki.xml fold
> "Elapsed time: 8697.79 msecs"
> $ lein run 1 ~/enwiki.xml pmap 1
> "Elapsed time: 27393.703 msecs"
> $ lein run 1 ~/enwiki.xml pthenf 1
> "Elapsed time: 37263.193 msecs"
>
>
> As you can see, the the foldable-seq version gives an almost 3x speedup 
> relative to the sequential version, and both the partition-then-pmap and 
> partition-then-fold versions are significantly slower.
>
> The last argument for the partition-then-pmap and partition-then-fold 
> versions is a partition size. I've tried various different sizes with no 
> material effect:
>
> $ lein run 1 ~/enwiki.xml pthenf 1000
> "Elapsed time: 43187.385 msecs"
> $ lein run 1 ~/enwiki.xml pthenf 10
> "Elapsed time: 35702.651 msecs"
> $ lein run 1 ~/enwiki.xml pmap 1000
> "Elapsed time: 34087.314 msecs"
> $ lein run 1 ~/enwiki.xml pmap 10
> "Elapsed time: 47340.969 msecs"
>
>
> The performance of the partition-then-pmap version is actually much worse 
> than the numbers above suggest. There's something very weird going on with 
> (I guess) garbage collection - it takes a *long* time to finish after 
> printing the elapsed time and the performance is completely pathological 
> with larger page counts.
>
> Bottom line: the only way I've been able to obtain any kind of speedup 
> remains foldable-seq.
>
> I'd be very grateful indeed if you could take a look at how I've 
> implemented partition-then-fold to make sure that I've correctly captured 
> your intent. Or if you have any suggestions for anything else that might 
> work, or to explain the poor performance of partition-then-pmap and 
> partition-then-fold.
>
> My guess is that the problem with partition-then-fold is the copying 
> that's going on during the (into [] %). I can see that it is operating in 
> parallel because the number of cores in use goes up, but the net effect is 
> an overall slowdown rather than a speedup.
>
> That it performs worse than foldable-seq isn't surprising to me, given 
> that it introduces an unnecessary copy.
>
> I still think that it's a crying shame to disallow folding over sequences 
> - as the above shows, the gains both in performance and programming ease 
> are significant, and it would only take a small modification to the 
> reducers API to fix the holding-onto-head problem. What would be the 
> downside of making this modification and allowing foldable sequences?
>
> --
> paul.butcher->msgCount++
>
> Snetterton, Castle Combe, Cadwell Park...
> Who says I have a one track mind?
>
> http://www.paulbutcher.com/
> LinkedIn: http://www.linkedin.com/in/paulbutcher
> MSN: pa...@paulbutcher.com 
> AIM: paulrabutcher
> Skype: paulrabutcher
>  
> On 28 Sep 2013, at 00:27, Stuart Halloway > 
> wrote:
>
> Hi Paul,
>
> I have posted an example that shows partition-then-fold at 
> https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj
> .
>
> I would be curious to know how this approach performs with your data. 
>  With the generated data I used, the partition+fold and partition+pmap 
> approaches both used most of my cores and had similar perf.
>
> Enjoying your book!
>
> Stu
>
>
> On Sat, May 25, 2013 at 12:34 PM, Paul Butcher 
> 
> > wrote:
>
>> I'm currently working on a book on concurrent/parallel development for 
>> The Pragmatic Programmers. One of the subjects I'm covering is parallel 
>> programming in Clojure, but I've hit a roadblock with one of the examples. 
>> I'm hoping that I can get some help to work through it here.
>>
>> The example counts the words contained within a Wikipedia dump. It should 
>> respond 

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
On 28 Sep 2013, at 19:51, Jozef Wagner  wrote:

> Anyway, I think the bottleneck in your code is at 
> https://github.com/paulbutcher/parallel-word-count/blob/master/src/wordcount/core.clj#L9
>  Instead of creating new persistent map for each word, you should use a 
> transient here.

I would love it if that were true. However the code you reference is from the 
algorithm that uses foldable-seq and runs quickly (i.e. not the one that has 
the performance problem).

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Andy Fingerhut
If a Clojure ticket is triaged, it means that one of the Clojure screeners
believe the ticket's description describes a real issue with Clojure that
ought to be changed in some way, and would like Rich Hickey to look at it
and see whether he agress.  If he does, it becomes vetted.  A diagram of
the process that Clojure tickets go through is shown at the link below (you
will have to scroll down a bit to see it):

http://dev.clojure.org/display/community/JIRA+workflow

If you are curious to know about changes in any particular ticket's state
as they occur, create an account on JIRA [1] and click the "Watch" link
when viewing that ticket.  You will be emailed about changes in its state.

The soonest that any ticket might become part of a modified version of
Clojure released as a JAR file is when Clojure 1.6.0 is released.  If
forced to guess, I would put a small bet on that happening in first half of
2014 some time, but I have no actual knowledge from those deciding when
that will happen.  In the past it seems to have been based more upon when
certain features were added than on a particular amount of time passing.
Until the release, there is no promise that any particular ticket will or
will not become part of the release.

[1] http://dev.clojure.org/jira/secure/Signup!default.jspa

Andy



On Sat, Sep 28, 2013 at 9:24 AM, Paul Butcher  wrote:

> On 28 Sep 2013, at 17:14, Jozef Wagner  wrote:
>
> I would go a bit more further and suggest that you do not use sequences at
> all and work only with reducible/foldable collections. Make an input reader
> which returns a foldable collection and you will have the most performant
> solution. The thing about holding into the head is being worked on right
> now, see http://dev.clojure.org/jira/browse/CLJ-1250
>
>
> That's fantastic news Jozef - is there any idea when that might be fixed?
> I can see that it's been triaged, but I'm not sure what exactly that means
> when it comes to the Clojure dev process?
>
> Could you expand on exactly what you mean when you say "an input reader
> which returns a foldable collection"? Bear in mind that the problem I'm
> trying to solve involves a 40GB Wikipedia dump, which clearly can't all be
> in RAM at one time.
>
> --
> paul.butcher->msgCount++
>
> Snetterton, Castle Combe, Cadwell Park...
> Who says I have a one track mind?
>
> http://www.paulbutcher.com/
> LinkedIn: http://www.linkedin.com/in/paulbutcher
> MSN: p...@paulbutcher.com
> AIM: paulrabutcher
> Skype: paulrabutcher
>
>  --
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Jozef Wagner
Or even better, use guava's Multiset there...

On Saturday, September 28, 2013 8:51:56 PM UTC+2, Jozef Wagner wrote:
>
> Well it should be possible to implement a foldseq variant which takes a 
> reducible collection as an input. This would speed things, as you don't 
> create so much garbage with reducers. XML parser which produces reducible 
> collection will be a bit harder :). 
>
> Anyway, I think the bottleneck in your code is at 
> https://github.com/paulbutcher/parallel-word-count/blob/master/src/wordcount/core.clj#L9Instead
>  of creating new persistent map for each word, you should use a 
> transient here.
>
> JW
>
> On Saturday, September 28, 2013 7:12:08 PM UTC+2, Paul Butcher wrote:
>>
>> On 28 Sep 2013, at 17:42, Jozef Wagner  wrote:
>>
>> I mean that you should forgot about lazy sequences and sequences in 
>> general, if you want to have a cutting edge performance with reducers. 
>> Example of reducible slurp, https://gist.github.com/wagjo/6743885 , does 
>> not hold into the head.
>>
>>
>> OK - I buy that logic. But I'm unsure whether it can be applied to the 
>> case that I'm interested in.
>>
>> Firstly, I'm dealing with an XML file, so I need to parse it. Right now 
>> I'm using data.xml/parse to do so, which returns a lazy sequence. So, short 
>> of writing my own XML parser, I have to go via a lazy sequence at some 
>> point along the way?
>>
>> Secondly, the primary point of this is to exploit parallelism, not 
>> reducers per-se. So supporting CollReduce isn't enough, I also need to 
>> support CollFold. Which is exactly what foldable-seq does - I'm not sure 
>> how that differs from what you're proposing?
>>
>> Am I missing something?
>>
>> --
>> paul.butcher->msgCount++
>>
>> Snetterton, Castle Combe, Cadwell Park...
>> Who says I have a one track mind?
>>
>> http://www.paulbutcher.com/
>> LinkedIn: http://www.linkedin.com/in/paulbutcher
>> MSN: pa...@paulbutcher.com
>> AIM: paulrabutcher
>> Skype: paulrabutcher
>>  
>>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Jozef Wagner
Well it should be possible to implement a foldseq variant which takes a 
reducible collection as an input. This would speed things, as you don't 
create so much garbage with reducers. XML parser which produces reducible 
collection will be a bit harder :). 

Anyway, I think the bottleneck in your code is at 
https://github.com/paulbutcher/parallel-word-count/blob/master/src/wordcount/core.clj#L9
 
Instead of creating new persistent map for each word, you should use a 
transient here.

JW

On Saturday, September 28, 2013 7:12:08 PM UTC+2, Paul Butcher wrote:
>
> On 28 Sep 2013, at 17:42, Jozef Wagner > 
> wrote:
>
> I mean that you should forgot about lazy sequences and sequences in 
> general, if you want to have a cutting edge performance with reducers. 
> Example of reducible slurp, https://gist.github.com/wagjo/6743885 , does 
> not hold into the head.
>
>
> OK - I buy that logic. But I'm unsure whether it can be applied to the 
> case that I'm interested in.
>
> Firstly, I'm dealing with an XML file, so I need to parse it. Right now 
> I'm using data.xml/parse to do so, which returns a lazy sequence. So, short 
> of writing my own XML parser, I have to go via a lazy sequence at some 
> point along the way?
>
> Secondly, the primary point of this is to exploit parallelism, not 
> reducers per-se. So supporting CollReduce isn't enough, I also need to 
> support CollFold. Which is exactly what foldable-seq does - I'm not sure 
> how that differs from what you're proposing?
>
> Am I missing something?
>
> --
> paul.butcher->msgCount++
>
> Snetterton, Castle Combe, Cadwell Park...
> Who says I have a one track mind?
>
> http://www.paulbutcher.com/
> LinkedIn: http://www.linkedin.com/in/paulbutcher
> MSN: pa...@paulbutcher.com 
> AIM: paulrabutcher
> Skype: paulrabutcher
>  
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
On 28 Sep 2013, at 17:42, Jozef Wagner  wrote:

> I mean that you should forgot about lazy sequences and sequences in general, 
> if you want to have a cutting edge performance with reducers. Example of 
> reducible slurp, https://gist.github.com/wagjo/6743885 , does not hold into 
> the head.

OK - I buy that logic. But I'm unsure whether it can be applied to the case 
that I'm interested in.

Firstly, I'm dealing with an XML file, so I need to parse it. Right now I'm 
using data.xml/parse to do so, which returns a lazy sequence. So, short of 
writing my own XML parser, I have to go via a lazy sequence at some point along 
the way?

Secondly, the primary point of this is to exploit parallelism, not reducers 
per-se. So supporting CollReduce isn't enough, I also need to support CollFold. 
Which is exactly what foldable-seq does - I'm not sure how that differs from 
what you're proposing?

Am I missing something?

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Jozef Wagner
I mean that you should forgot about lazy sequences and sequences in
general, if you want to have a cutting edge performance with reducers.
Example of reducible slurp, https://gist.github.com/wagjo/6743885 , does
not hold into the head.

JW



On Sat, Sep 28, 2013 at 6:24 PM, Paul Butcher  wrote:

> On 28 Sep 2013, at 17:14, Jozef Wagner  wrote:
>
> I would go a bit more further and suggest that you do not use sequences at
> all and work only with reducible/foldable collections. Make an input reader
> which returns a foldable collection and you will have the most performant
> solution. The thing about holding into the head is being worked on right
> now, see http://dev.clojure.org/jira/browse/CLJ-1250
>
>
> That's fantastic news Jozef - is there any idea when that might be fixed?
> I can see that it's been triaged, but I'm not sure what exactly that means
> when it comes to the Clojure dev process?
>
> Could you expand on exactly what you mean when you say "an input reader
> which returns a foldable collection"? Bear in mind that the problem I'm
> trying to solve involves a 40GB Wikipedia dump, which clearly can't all be
> in RAM at one time.
>
> --
> paul.butcher->msgCount++
>
> Snetterton, Castle Combe, Cadwell Park...
> Who says I have a one track mind?
>
> http://www.paulbutcher.com/
> LinkedIn: http://www.linkedin.com/in/paulbutcher
> MSN: p...@paulbutcher.com
> AIM: paulrabutcher
> Skype: paulrabutcher
>
>  --
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
On 28 Sep 2013, at 17:14, Jozef Wagner  wrote:

> I would go a bit more further and suggest that you do not use sequences at 
> all and work only with reducible/foldable collections. Make an input reader 
> which returns a foldable collection and you will have the most performant 
> solution. The thing about holding into the head is being worked on right now, 
> see http://dev.clojure.org/jira/browse/CLJ-1250


That's fantastic news Jozef - is there any idea when that might be fixed? I can 
see that it's been triaged, but I'm not sure what exactly that means when it 
comes to the Clojure dev process?

Could you expand on exactly what you mean when you say "an input reader which 
returns a foldable collection"? Bear in mind that the problem I'm trying to 
solve involves a 40GB Wikipedia dump, which clearly can't all be in RAM at one 
time.

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
Ah - one mystery down. Thanks Andy!

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

On 28 Sep 2013, at 16:26, Andy Fingerhut  wrote:

> I do not know about the most important parts of your performance 
> difficulties, but on a more trivial point I might be able to shed some light.
> 
> See the ClojureDocs page for pmap, which refers to the page for future, 
> linked below.  If you call (shutdown-agents) the 60-second wait to exit 
> should go away.
> 
> http://clojuredocs.org/clojure_core/clojure.core/future
> 
> Andy
> 
> Sent from my iPhone
> 
> On Sep 28, 2013, at 1:41 AM, Paul Butcher  wrote:
> 
>> On 28 Sep 2013, at 00:27, Stuart Halloway  wrote:
>> 
>>> I have posted an example that shows partition-then-fold at 
>>> https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj.
>>> 
>>> I would be curious to know how this approach performs with your data.  With 
>>> the generated data I used, the partition+fold and partition+pmap approaches 
>>> both used most of my cores and had similar perf.
>> 
>> Hey Stuart, 
>> 
>> Thanks for getting back to me.
>> 
>> I've updated the code for my word count example on GitHub with (I believe) 
>> something that works along the lines you suggest:
>> 
>> https://github.com/paulbutcher/parallel-word-count
>> 
>> Here are some sample runs on my machine (a 4-core retina MacBook Pro). Each 
>> of these runs counts the words in the first 1 pages of Wikipedia:
>> 
>>> $ lein run 1 ~/enwiki.xml sequential
>>> "Elapsed time: 23630.443 msecs"
>>> $ lein run 1 ~/enwiki.xml fold
>>> "Elapsed time: 8697.79 msecs"
>>> $ lein run 1 ~/enwiki.xml pmap 1
>>> "Elapsed time: 27393.703 msecs"
>>> $ lein run 1 ~/enwiki.xml pthenf 1
>>> "Elapsed time: 37263.193 msecs"
>> 
>> As you can see, the the foldable-seq version gives an almost 3x speedup 
>> relative to the sequential version, and both the partition-then-pmap and 
>> partition-then-fold versions are significantly slower.
>> 
>> The last argument for the partition-then-pmap and partition-then-fold 
>> versions is a partition size. I've tried various different sizes with no 
>> material effect:
>> 
>>> $ lein run 1 ~/enwiki.xml pthenf 1000
>>> "Elapsed time: 43187.385 msecs"
>>> $ lein run 1 ~/enwiki.xml pthenf 10
>>> "Elapsed time: 35702.651 msecs"
>>> $ lein run 1 ~/enwiki.xml pmap 1000
>>> "Elapsed time: 34087.314 msecs"
>>> $ lein run 1 ~/enwiki.xml pmap 10
>>> "Elapsed time: 47340.969 msecs"
>> 
>> The performance of the partition-then-pmap version is actually much worse 
>> than the numbers above suggest. There's something very weird going on with 
>> (I guess) garbage collection - it takes a *long* time to finish after 
>> printing the elapsed time and the performance is completely pathological 
>> with larger page counts.
>> 
>> Bottom line: the only way I've been able to obtain any kind of speedup 
>> remains foldable-seq.
>> 
>> I'd be very grateful indeed if you could take a look at how I've implemented 
>> partition-then-fold to make sure that I've correctly captured your intent. 
>> Or if you have any suggestions for anything else that might work, or to 
>> explain the poor performance of partition-then-pmap and partition-then-fold.
>> 
>> My guess is that the problem with partition-then-fold is the copying that's 
>> going on during the (into [] %). I can see that it is operating in parallel 
>> because the number of cores in use goes up, but the net effect is an overall 
>> slowdown rather than a speedup.
>> 
>> That it performs worse than foldable-seq isn't surprising to me, given that 
>> it introduces an unnecessary copy.
>> 
>> I still think that it's a crying shame to disallow folding over sequences - 
>> as the above shows, the gains both in performance and programming ease are 
>> significant, and it would only take a small modification to the reducers API 
>> to fix the holding-onto-head problem. What would be the downside of making 
>> this modification and allowing foldable sequences?
>> 
>> --
>> paul.butcher->msgCount++
>> 
>> Snetterton, Castle Combe, Cadwell Park...
>> Who says I have a one track mind?
>> 
>> http://www.paulbutcher.com/
>> LinkedIn: http://www.linkedin.com/in/paulbutcher
>> MSN: p...@paulbutcher.com
>> AIM: paulrabutcher
>> Skype: paulrabutcher
>> 
>> On 28 Sep 2013, at 00:27, Stuart Halloway  wrote:
>> 
>>> Hi Paul,
>>> 
>>> I have posted an example that shows partition-then-fold at 
>>> https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj.
>>> 
>>> I would be curious to know how this approach performs with your data.  With 
>>> the generated data I used, the partition+fold and partition+pmap approaches 
>>> both used most of my cores

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Jozef Wagner
I would go a bit more further and suggest that you do not use sequences at 
all and work only with reducible/foldable collections. Make an input reader 
which returns a foldable collection and you will have the most performant 
solution. The thing about holding into the head is being worked on right 
now, see http://dev.clojure.org/jira/browse/CLJ-1250

JW

On Saturday, September 28, 2013 10:41:20 AM UTC+2, Paul Butcher wrote:
>
> On 28 Sep 2013, at 00:27, Stuart Halloway > 
> wrote:
>
> I have posted an example that shows partition-then-fold at 
> https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj
> .
>
> I would be curious to know how this approach performs with your data. 
>  With the generated data I used, the partition+fold and partition+pmap 
> approaches both used most of my cores and had similar perf.
>
>
> Hey Stuart, 
>
> Thanks for getting back to me.
>
> I've updated the code for my word count example on GitHub with (I believe) 
> something that works along the lines you suggest:
>
> https://github.com/paulbutcher/parallel-word-count
>
> Here are some sample runs on my machine (a 4-core retina MacBook Pro). 
> Each of these runs counts the words in the first 1 pages of Wikipedia:
>
> $ lein run 1 ~/enwiki.xml sequential
> "Elapsed time: 23630.443 msecs"
> $ lein run 1 ~/enwiki.xml fold
> "Elapsed time: 8697.79 msecs"
> $ lein run 1 ~/enwiki.xml pmap 1
> "Elapsed time: 27393.703 msecs"
> $ lein run 1 ~/enwiki.xml pthenf 1
> "Elapsed time: 37263.193 msecs"
>
>
> As you can see, the the foldable-seq version gives an almost 3x speedup 
> relative to the sequential version, and both the partition-then-pmap and 
> partition-then-fold versions are significantly slower.
>
> The last argument for the partition-then-pmap and partition-then-fold 
> versions is a partition size. I've tried various different sizes with no 
> material effect:
>
> $ lein run 1 ~/enwiki.xml pthenf 1000
> "Elapsed time: 43187.385 msecs"
> $ lein run 1 ~/enwiki.xml pthenf 10
> "Elapsed time: 35702.651 msecs"
> $ lein run 1 ~/enwiki.xml pmap 1000
> "Elapsed time: 34087.314 msecs"
> $ lein run 1 ~/enwiki.xml pmap 10
> "Elapsed time: 47340.969 msecs"
>
>
> The performance of the partition-then-pmap version is actually much worse 
> than the numbers above suggest. There's something very weird going on with 
> (I guess) garbage collection - it takes a *long* time to finish after 
> printing the elapsed time and the performance is completely pathological 
> with larger page counts.
>
> Bottom line: the only way I've been able to obtain any kind of speedup 
> remains foldable-seq.
>
> I'd be very grateful indeed if you could take a look at how I've 
> implemented partition-then-fold to make sure that I've correctly captured 
> your intent. Or if you have any suggestions for anything else that might 
> work, or to explain the poor performance of partition-then-pmap and 
> partition-then-fold.
>
> My guess is that the problem with partition-then-fold is the copying 
> that's going on during the (into [] %). I can see that it is operating in 
> parallel because the number of cores in use goes up, but the net effect is 
> an overall slowdown rather than a speedup.
>
> That it performs worse than foldable-seq isn't surprising to me, given 
> that it introduces an unnecessary copy.
>
> I still think that it's a crying shame to disallow folding over sequences 
> - as the above shows, the gains both in performance and programming ease 
> are significant, and it would only take a small modification to the 
> reducers API to fix the holding-onto-head problem. What would be the 
> downside of making this modification and allowing foldable sequences?
>
> --
> paul.butcher->msgCount++
>
> Snetterton, Castle Combe, Cadwell Park...
> Who says I have a one track mind?
>
> http://www.paulbutcher.com/
> LinkedIn: http://www.linkedin.com/in/paulbutcher
> MSN: pa...@paulbutcher.com 
> AIM: paulrabutcher
> Skype: paulrabutcher
>  
> On 28 Sep 2013, at 00:27, Stuart Halloway > 
> wrote:
>
> Hi Paul,
>
> I have posted an example that shows partition-then-fold at 
> https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj
> .
>
> I would be curious to know how this approach performs with your data. 
>  With the generated data I used, the partition+fold and partition+pmap 
> approaches both used most of my cores and had similar perf.
>
> Enjoying your book!
>
> Stu
>
>
> On Sat, May 25, 2013 at 12:34 PM, Paul Butcher 
> 
> > wrote:
>
>> I'm currently working on a book on concurrent/parallel development for 
>> The Pragmatic Programmers. One of the subjects I'm covering is parallel 
>> programming in Clojure, but I've hit a roadblock with one of the examples. 
>> I'm hoping that I can get some help to work through it here.
>>
>> The example counts the words contained within a Wikipedia dump. It should 
>> re

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Andy Fingerhut
I do not know about the most important parts of your performance difficulties, 
but on a more trivial point I might be able to shed some light.

See the ClojureDocs page for pmap, which refers to the page for future, linked 
below.  If you call (shutdown-agents) the 60-second wait to exit should go away.

http://clojuredocs.org/clojure_core/clojure.core/future

Andy

Sent from my iPhone

On Sep 28, 2013, at 1:41 AM, Paul Butcher  wrote:

> On 28 Sep 2013, at 00:27, Stuart Halloway  wrote:
> 
>> I have posted an example that shows partition-then-fold at 
>> https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj.
>> 
>> I would be curious to know how this approach performs with your data.  With 
>> the generated data I used, the partition+fold and partition+pmap approaches 
>> both used most of my cores and had similar perf.
> 
> Hey Stuart, 
> 
> Thanks for getting back to me.
> 
> I've updated the code for my word count example on GitHub with (I believe) 
> something that works along the lines you suggest:
> 
> https://github.com/paulbutcher/parallel-word-count
> 
> Here are some sample runs on my machine (a 4-core retina MacBook Pro). Each 
> of these runs counts the words in the first 1 pages of Wikipedia:
> 
>> $ lein run 1 ~/enwiki.xml sequential
>> "Elapsed time: 23630.443 msecs"
>> $ lein run 1 ~/enwiki.xml fold
>> "Elapsed time: 8697.79 msecs"
>> $ lein run 1 ~/enwiki.xml pmap 1
>> "Elapsed time: 27393.703 msecs"
>> $ lein run 1 ~/enwiki.xml pthenf 1
>> "Elapsed time: 37263.193 msecs"
> 
> As you can see, the the foldable-seq version gives an almost 3x speedup 
> relative to the sequential version, and both the partition-then-pmap and 
> partition-then-fold versions are significantly slower.
> 
> The last argument for the partition-then-pmap and partition-then-fold 
> versions is a partition size. I've tried various different sizes with no 
> material effect:
> 
>> $ lein run 1 ~/enwiki.xml pthenf 1000
>> "Elapsed time: 43187.385 msecs"
>> $ lein run 1 ~/enwiki.xml pthenf 10
>> "Elapsed time: 35702.651 msecs"
>> $ lein run 1 ~/enwiki.xml pmap 1000
>> "Elapsed time: 34087.314 msecs"
>> $ lein run 1 ~/enwiki.xml pmap 10
>> "Elapsed time: 47340.969 msecs"
> 
> The performance of the partition-then-pmap version is actually much worse 
> than the numbers above suggest. There's something very weird going on with (I 
> guess) garbage collection - it takes a *long* time to finish after printing 
> the elapsed time and the performance is completely pathological with larger 
> page counts.
> 
> Bottom line: the only way I've been able to obtain any kind of speedup 
> remains foldable-seq.
> 
> I'd be very grateful indeed if you could take a look at how I've implemented 
> partition-then-fold to make sure that I've correctly captured your intent. Or 
> if you have any suggestions for anything else that might work, or to explain 
> the poor performance of partition-then-pmap and partition-then-fold.
> 
> My guess is that the problem with partition-then-fold is the copying that's 
> going on during the (into [] %). I can see that it is operating in parallel 
> because the number of cores in use goes up, but the net effect is an overall 
> slowdown rather than a speedup.
> 
> That it performs worse than foldable-seq isn't surprising to me, given that 
> it introduces an unnecessary copy.
> 
> I still think that it's a crying shame to disallow folding over sequences - 
> as the above shows, the gains both in performance and programming ease are 
> significant, and it would only take a small modification to the reducers API 
> to fix the holding-onto-head problem. What would be the downside of making 
> this modification and allowing foldable sequences?
> 
> --
> paul.butcher->msgCount++
> 
> Snetterton, Castle Combe, Cadwell Park...
> Who says I have a one track mind?
> 
> http://www.paulbutcher.com/
> LinkedIn: http://www.linkedin.com/in/paulbutcher
> MSN: p...@paulbutcher.com
> AIM: paulrabutcher
> Skype: paulrabutcher
> 
> On 28 Sep 2013, at 00:27, Stuart Halloway  wrote:
> 
>> Hi Paul,
>> 
>> I have posted an example that shows partition-then-fold at 
>> https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj.
>> 
>> I would be curious to know how this approach performs with your data.  With 
>> the generated data I used, the partition+fold and partition+pmap approaches 
>> both used most of my cores and had similar perf.
>> 
>> Enjoying your book!
>> 
>> Stu
>> 
>> 
>> On Sat, May 25, 2013 at 12:34 PM, Paul Butcher  wrote:
>>> I'm currently working on a book on concurrent/parallel development for The 
>>> Pragmatic Programmers. One of the subjects I'm covering is parallel 
>>> programming in Clojure, but I've hit a roadblock with one of the examples. 
>>> I'm hoping that I can get some help to work through it here.
>>> 
>>> The example counts the words co

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
On 28 Sep 2013, at 01:22, Rich Morin  wrote:

>> On Sat, May 25, 2013 at 12:34 PM, Paul Butcher  wrote:
>> I'm currently working on a book on concurrent/parallel development for The 
>> Pragmatic Programmers. ...
> 
> Ordered; PDF just arrived (:-).

Cool - very interested to hear your feedback once you've had a chance to read 
it.

> I don't know yet whether the book has anything like this, but I'd
> like to see a table that shows which concurrency and parallelism
> approaches are supported (and to what extent) by various languages.
> 
> So, actors, agents, queues, reducers (etc) would be on one axis;
> Clojure, Erlang, Go, and Ruby (etc) would be on the other.  I'd
> expect Clojure to have pretty complete coverage; Ruby, not so much.

That might make a good appendix or section in the (as yet unwritten) wrap-up 
chapter - thanks for the suggestion.

And yes, I suspect that you're right that Clojure will fare well in the 
comparison :-)

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

On 28 Sep 2013, at 01:22, Rich Morin  wrote:

>> On Sat, May 25, 2013 at 12:34 PM, Paul Butcher  wrote:
>> I'm currently working on a book on concurrent/parallel development for The 
>> Pragmatic Programmers. ...
> 
> Ordered; PDF just arrived (:-).
> 
> 
> I don't know yet whether the book has anything like this, but I'd
> like to see a table that shows which concurrency and parallelism
> approaches are supported (and to what extent) by various languages.
> 
> So, actors, agents, queues, reducers (etc) would be on one axis;
> Clojure, Erlang, Go, and Ruby (etc) would be on the other.  I'd
> expect Clojure to have pretty complete coverage; Ruby, not so much.
> 
> -r
> 
> -- 
> http://www.cfcl.com/rdmRich Morin
> http://www.cfcl.com/rdm/resume r...@cfcl.com
> http://www.cfcl.com/rdm/weblog +1 650-873-7841
> 
> Software system design, development, and documentation
> 
> 
> -- 
> -- 
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with your 
> first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> --- 
> You received this message because you are subscribed to the Google Groups 
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
On 28 Sep 2013, at 00:27, Stuart Halloway  wrote:

> I have posted an example that shows partition-then-fold at 
> https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj.
> 
> I would be curious to know how this approach performs with your data.  With 
> the generated data I used, the partition+fold and partition+pmap approaches 
> both used most of my cores and had similar perf.

Hey Stuart, 

Thanks for getting back to me.

I've updated the code for my word count example on GitHub with (I believe) 
something that works along the lines you suggest:

https://github.com/paulbutcher/parallel-word-count

Here are some sample runs on my machine (a 4-core retina MacBook Pro). Each of 
these runs counts the words in the first 1 pages of Wikipedia:

> $ lein run 1 ~/enwiki.xml sequential
> "Elapsed time: 23630.443 msecs"
> $ lein run 1 ~/enwiki.xml fold
> "Elapsed time: 8697.79 msecs"
> $ lein run 1 ~/enwiki.xml pmap 1
> "Elapsed time: 27393.703 msecs"
> $ lein run 1 ~/enwiki.xml pthenf 1
> "Elapsed time: 37263.193 msecs"

As you can see, the the foldable-seq version gives an almost 3x speedup 
relative to the sequential version, and both the partition-then-pmap and 
partition-then-fold versions are significantly slower.

The last argument for the partition-then-pmap and partition-then-fold versions 
is a partition size. I've tried various different sizes with no material effect:

> $ lein run 1 ~/enwiki.xml pthenf 1000
> "Elapsed time: 43187.385 msecs"
> $ lein run 1 ~/enwiki.xml pthenf 10
> "Elapsed time: 35702.651 msecs"
> $ lein run 1 ~/enwiki.xml pmap 1000
> "Elapsed time: 34087.314 msecs"
> $ lein run 1 ~/enwiki.xml pmap 10
> "Elapsed time: 47340.969 msecs"

The performance of the partition-then-pmap version is actually much worse than 
the numbers above suggest. There's something very weird going on with (I guess) 
garbage collection - it takes a *long* time to finish after printing the 
elapsed time and the performance is completely pathological with larger page 
counts.

Bottom line: the only way I've been able to obtain any kind of speedup remains 
foldable-seq.

I'd be very grateful indeed if you could take a look at how I've implemented 
partition-then-fold to make sure that I've correctly captured your intent. Or 
if you have any suggestions for anything else that might work, or to explain 
the poor performance of partition-then-pmap and partition-then-fold.

My guess is that the problem with partition-then-fold is the copying that's 
going on during the (into [] %). I can see that it is operating in parallel 
because the number of cores in use goes up, but the net effect is an overall 
slowdown rather than a speedup.

That it performs worse than foldable-seq isn't surprising to me, given that it 
introduces an unnecessary copy.

I still think that it's a crying shame to disallow folding over sequences - as 
the above shows, the gains both in performance and programming ease are 
significant, and it would only take a small modification to the reducers API to 
fix the holding-onto-head problem. What would be the downside of making this 
modification and allowing foldable sequences?

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

On 28 Sep 2013, at 00:27, Stuart Halloway  wrote:

> Hi Paul,
> 
> I have posted an example that shows partition-then-fold at 
> https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj.
> 
> I would be curious to know how this approach performs with your data.  With 
> the generated data I used, the partition+fold and partition+pmap approaches 
> both used most of my cores and had similar perf.
> 
> Enjoying your book!
> 
> Stu
> 
> 
> On Sat, May 25, 2013 at 12:34 PM, Paul Butcher  wrote:
> I'm currently working on a book on concurrent/parallel development for The 
> Pragmatic Programmers. One of the subjects I'm covering is parallel 
> programming in Clojure, but I've hit a roadblock with one of the examples. 
> I'm hoping that I can get some help to work through it here.
> 
> The example counts the words contained within a Wikipedia dump. It should 
> respond well to parallelisation (I have Java and Scala versions that perform 
> excellently) but I've been incapable of coming up with a nice solution in 
> Clojure.
> 
> The code I'm working with is checked into GitHub: 
> 
> The basic sequential algorithm is:
> 
>> (frequencies (mapcat get-words pages))
> 
> 
> If I run that on the first 10k pages in Wikipedia dump, it takes ~21s on my 
> MacBook Pro.
> 
> One way to parallelise it is to create a parallel version of frequencies that 
> uses reducers:
> 
>> (defn frequencies-fold [words]
>>   (r/fold (partial merge-with +)
>> 

Re: Parallelising over a lazy sequence - request for help

2013-09-27 Thread Rich Morin
> On Sat, May 25, 2013 at 12:34 PM, Paul Butcher  wrote:
> I'm currently working on a book on concurrent/parallel development for The 
> Pragmatic Programmers. ...

Ordered; PDF just arrived (:-).


I don't know yet whether the book has anything like this, but I'd
like to see a table that shows which concurrency and parallelism
approaches are supported (and to what extent) by various languages.

So, actors, agents, queues, reducers (etc) would be on one axis;
Clojure, Erlang, Go, and Ruby (etc) would be on the other.  I'd
expect Clojure to have pretty complete coverage; Ruby, not so much.

-r

 -- 
http://www.cfcl.com/rdmRich Morin
http://www.cfcl.com/rdm/resume r...@cfcl.com
http://www.cfcl.com/rdm/weblog +1 650-873-7841

Software system design, development, and documentation


-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-27 Thread Stuart Halloway
Hi Paul,

I have posted an example that shows partition-then-fold at
https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj
.

I would be curious to know how this approach performs with your data.  With
the generated data I used, the partition+fold and partition+pmap approaches
both used most of my cores and had similar perf.

Enjoying your book!

Stu


On Sat, May 25, 2013 at 12:34 PM, Paul Butcher  wrote:

> I'm currently working on a book on concurrent/parallel development for The
> Pragmatic Programmers. One of the subjects I'm covering is parallel
> programming in Clojure, but I've hit a roadblock with one of the examples.
> I'm hoping that I can get some help to work through it here.
>
> The example counts the words contained within a Wikipedia dump. It should
> respond well to parallelisation (I have Java and Scala versions that
> perform excellently) but I've been incapable of coming up with a nice
> solution in Clojure.
>
> The code I'm working with is checked into 
> GitHub
> :
>
> The basic sequential algorithm is:
>
> (frequencies (mapcat get-words pages))
>
>
> If I run that on the first 10k pages in Wikipedia dump, it takes ~21s on
> my MacBook Pro.
>
> One way to parallelise it is to create a parallel version of frequencies
> that uses reducers:
>
> (defn frequencies-fold [words]
>   (r/fold (partial merge-with +)
>   (fn [counts word] (assoc counts word (inc (get counts word 0
>
>   words))
>
>
> And sure enough, if I use that, along with use the 
> foldable-seq
>  utility
> I posted about 
> here are
> while ago it runs in ~8s, almost a 3x speedup, not bad given that the
> parallel version is unable to use transients.
>
> Unfortunately, as they currently stand, reducers have a fatal flaw that
> means that, even with foldable-seq, they're basically useless with lazy
> sequences. Reducers always hold onto the 
> head of
> the sequence they're given, so there's no way to use this approach for a
> complete Wikipedia dump (which runs to around 40GiB).
>
> So the other approach I've tried is to use pmap:
>
> (defn frequencies-pmap [words]
>   (reduce (partial merge-with +)
> (pmap frequencies
>   (partition-all 1 words
>
>
> But, for reasons I don't understand, this performs dreadfully - taking
> ~26s, i.e. significantly slower than the sequential version.
>
> I've tried playing with different partition sizes without materially
> affecting the result.
>
> So, what I'm looking for is either:
>
> a) An explanation for why the pmap-based solution performs so badly
>
> b) A way to fix the "holding onto head" problem that's inherent within
> reducers.
>
> With the last of these in mind, it strikes me that the problem
> fundamentally arises from the desire for reducers to follow the same basic
> API as "normal" code. So:
>
> (reduce (filter ... (map ... coll)))
>
> becomes:
>
> (r/fold (r/filter ... (r/map ... coll)))
>
> A very small change to the reducers API - passing the collection to the
> reduce and/or fold - would avoid the problem:
>
> (r/fold (r/filter ... (r/map ...)) coll)
>
> Anyway - I'd be very grateful for any insight into either of the above
> questions. Or for suggestions for an alternative approach that might be
> more fruitful.
>
> Many thanks in advance,
>
> --
> paul.butcher->msgCount++
>
> Snetterton, Castle Combe, Cadwell Park...
> Who says I have a one track mind?
>
> http://www.paulbutcher.com/
> LinkedIn: http://www.linkedin.com/in/paulbutcher
> MSN: p...@paulbutcher.com
> AIM: paulrabutcher
> Skype: paulrabutcher
>
>  --
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups

Parallelising over a lazy sequence - request for help

2013-05-25 Thread Paul Butcher
I'm currently working on a book on concurrent/parallel development for The 
Pragmatic Programmers. One of the subjects I'm covering is parallel programming 
in Clojure, but I've hit a roadblock with one of the examples. I'm hoping that 
I can get some help to work through it here.

The example counts the words contained within a Wikipedia dump. It should 
respond well to parallelisation (I have Java and Scala versions that perform 
excellently) but I've been incapable of coming up with a nice solution in 
Clojure.

The code I'm working with is checked into GitHub: 

The basic sequential algorithm is:

> (frequencies (mapcat get-words pages))


If I run that on the first 10k pages in Wikipedia dump, it takes ~21s on my 
MacBook Pro.

One way to parallelise it is to create a parallel version of frequencies that 
uses reducers:

> (defn frequencies-fold [words]
>   (r/fold (partial merge-with +)
>   (fn [counts word] (assoc counts word (inc (get counts word 0
>   words))

And sure enough, if I use that, along with use the foldable-seq utility I 
posted about here are while ago it runs in ~8s, almost a 3x speedup, not bad 
given that the parallel version is unable to use transients.

Unfortunately, as they currently stand, reducers have a fatal flaw that means 
that, even with foldable-seq, they're basically useless with lazy sequences. 
Reducers always hold onto the head of the sequence they're given, so there's no 
way to use this approach for a complete Wikipedia dump (which runs to around 
40GiB).

So the other approach I've tried is to use pmap:

> (defn frequencies-pmap [words]
>   (reduce (partial merge-with +) 
> (pmap frequencies 
>   (partition-all 1 words

But, for reasons I don't understand, this performs dreadfully - taking ~26s, 
i.e. significantly slower than the sequential version.

I've tried playing with different partition sizes without materially affecting 
the result.

So, what I'm looking for is either:

a) An explanation for why the pmap-based solution performs so badly

b) A way to fix the "holding onto head" problem that's inherent within reducers.

With the last of these in mind, it strikes me that the problem fundamentally 
arises from the desire for reducers to follow the same basic API as "normal" 
code. So:

(reduce (filter ... (map ... coll)))

becomes:

(r/fold (r/filter ... (r/map ... coll)))

A very small change to the reducers API - passing the collection to the reduce 
and/or fold - would avoid the problem:

(r/fold (r/filter ... (r/map ...)) coll)

Anyway - I'd be very grateful for any insight into either of the above 
questions. Or for suggestions for an alternative approach that might be more 
fruitful.

Many thanks in advance,

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.