Re: Parallelising over a lazy sequence - request for help

2013-10-02 Thread Paul Butcher
Alan, Apologies for the delayed reply - I remember Iota well (there was some cross-fertilisation between it and foldable-seq a few months back IIRC :-) Having said that, I don't think that Iota will help in my particular situation (although I'd be delighted to be proven wrong)? Given that the

Re: Parallelising over a lazy sequence - request for help

2013-09-30 Thread Alan Busby
Sorry to jump in, but I thought it worthwhile to add a couple points; (sorry for being brief) 1. Reducers work fine with data much larger than memory, you just need to mmap() the data you're working with so Clojure thinks everything is in memory when it isn't. Reducer access is fairly sequential,

Re: Parallelising over a lazy sequence - request for help

2013-09-29 Thread Stuart Halloway
To be clear, I don't object to the approach, only to naming it fold and/or tying it to interfaces related to folding. Stu On Sat, Sep 28, 2013 at 5:29 PM, Paul Butcher p...@paulbutcher.com wrote: On 28 Sep 2013, at 22:00, Alex Miller a...@puredanger.com wrote: Reducers (and fork/join in

Re: Parallelising over a lazy sequence - request for help

2013-09-29 Thread Paul Mooser
Paul, is there any easy way to get the (small) dataset you're working with, so we can run your actual code against the same data? On Saturday, May 25, 2013 9:34:15 AM UTC-7, Paul Butcher wrote: The example counts the words contained within a Wikipedia dump. It should respond well to

Re: Parallelising over a lazy sequence - request for help

2013-09-29 Thread Paul Butcher
On 29 Sep 2013, at 22:58, Paul Mooser taron...@gmail.com wrote: Paul, is there any easy way to get the (small) dataset you're working with, so we can run your actual code against the same data? The dataset I'm using is a Wikipedia dump, which hardly counts as small :-) Having said that, the

Re: Parallelising over a lazy sequence - request for help

2013-09-29 Thread Paul Mooser
Thanks - when I said small, I was referring to the fact that your tests were using the first 1 pages, as opposed to the entire data dump. Sorry if I was unclear or misunderstood. On Sunday, September 29, 2013 3:20:38 PM UTC-7, Paul Butcher wrote: The dataset I'm using is a Wikipedia

Re: Parallelising over a lazy sequence - request for help

2013-09-29 Thread Brian Craft
On the other hand it is 2013, not 2003. 40G is small in terms of modern hardware. Terabyte ram servers have been available for awhile, at prices within the reach of many projects. Large data in this decade is measured in petabytes, at least. On Sunday, September 29, 2013 5:13:14 PM UTC-7,

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
On 28 Sep 2013, at 00:27, Stuart Halloway stuart.hallo...@gmail.com wrote: I have posted an example that shows partition-then-fold at https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj. I would be curious to know how this approach

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
On 28 Sep 2013, at 01:22, Rich Morin r...@cfcl.com wrote: On Sat, May 25, 2013 at 12:34 PM, Paul Butcher p...@paulbutcher.com wrote: I'm currently working on a book on concurrent/parallel development for The Pragmatic Programmers. ... Ordered; PDF just arrived (:-). Cool - very interested

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Andy Fingerhut
I do not know about the most important parts of your performance difficulties, but on a more trivial point I might be able to shed some light. See the ClojureDocs page for pmap, which refers to the page for future, linked below. If you call (shutdown-agents) the 60-second wait to exit should

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Jozef Wagner
I would go a bit more further and suggest that you do not use sequences at all and work only with reducible/foldable collections. Make an input reader which returns a foldable collection and you will have the most performant solution. The thing about holding into the head is being worked on

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
Ah - one mystery down. Thanks Andy! -- paul.butcher-msgCount++ Snetterton, Castle Combe, Cadwell Park... Who says I have a one track mind? http://www.paulbutcher.com/ LinkedIn: http://www.linkedin.com/in/paulbutcher MSN: p...@paulbutcher.com AIM: paulrabutcher Skype: paulrabutcher On 28 Sep

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
On 28 Sep 2013, at 17:14, Jozef Wagner jozef.wag...@gmail.com wrote: I would go a bit more further and suggest that you do not use sequences at all and work only with reducible/foldable collections. Make an input reader which returns a foldable collection and you will have the most

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Jozef Wagner
I mean that you should forgot about lazy sequences and sequences in general, if you want to have a cutting edge performance with reducers. Example of reducible slurp, https://gist.github.com/wagjo/6743885 , does not hold into the head. JW On Sat, Sep 28, 2013 at 6:24 PM, Paul Butcher

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
On 28 Sep 2013, at 17:42, Jozef Wagner jozef.wag...@gmail.com wrote: I mean that you should forgot about lazy sequences and sequences in general, if you want to have a cutting edge performance with reducers. Example of reducible slurp, https://gist.github.com/wagjo/6743885 , does not hold

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Jozef Wagner
Well it should be possible to implement a foldseq variant which takes a reducible collection as an input. This would speed things, as you don't create so much garbage with reducers. XML parser which produces reducible collection will be a bit harder :). Anyway, I think the bottleneck in your

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Jozef Wagner
Or even better, use guava's Multiset there... On Saturday, September 28, 2013 8:51:56 PM UTC+2, Jozef Wagner wrote: Well it should be possible to implement a foldseq variant which takes a reducible collection as an input. This would speed things, as you don't create so much garbage with

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Andy Fingerhut
If a Clojure ticket is triaged, it means that one of the Clojure screeners believe the ticket's description describes a real issue with Clojure that ought to be changed in some way, and would like Rich Hickey to look at it and see whether he agress. If he does, it becomes vetted. A diagram of

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
On 28 Sep 2013, at 19:51, Jozef Wagner jozef.wag...@gmail.com wrote: Anyway, I think the bottleneck in your code is at https://github.com/paulbutcher/parallel-word-count/blob/master/src/wordcount/core.clj#L9 Instead of creating new persistent map for each word, you should use a transient

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Alex Miller
For your timings, I would also strongly recommend altering your project.clj to force the -server hotspot: :jvm-opts ^:replace [-Xmx1g -server ... and whatever else you want here ... ] By default lein will use tiered compilation to optimize repl startup, which is not what you want for

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Alex Miller
Reducers (and fork/join in general) are best suited for fine-grained computational parallelism on in-memory data. The problem in question involves processing more data than will fit in memory. So the question is then what is the best way to parallelize computation over the stream. There are

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Alex Miller
I am hoping that this will be fixed for 1.6 but no one is actually working on it afaik. If someone wants to take it on, I would GREATLY appreciate a patch on this ticket (must be a contributor of course). On Saturday, September 28, 2013 11:24:18 AM UTC-5, Paul Butcher wrote: On 28 Sep 2013,

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Jozef Wagner
Can't your last possible solution rather be implemented on top of f/j pool? Is it possible to beat f/j pool performance with ad-hoc thread-pool in situations where there are thousands of tasks? JW On Sat, Sep 28, 2013 at 11:00 PM, Alex Miller a...@puredanger.com wrote: Reducers (and fork/join

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
Thanks Alex - I've made both of these changes. The shutdown-agents did get rid of the pause at the end of the pmap solution, and the -server argument made a very slight across-the-board performance improvement. But neither of them fundamentally change the basic result (that the implementation

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
On 28 Sep 2013, at 22:00, Alex Miller a...@puredanger.com wrote: Reducers (and fork/join in general) are best suited for fine-grained computational parallelism on in-memory data. The problem in question involves processing more data than will fit in memory. So the question is then what is

Re: Parallelising over a lazy sequence - request for help

2013-09-27 Thread Stuart Halloway
Hi Paul, I have posted an example that shows partition-then-fold at https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj . I would be curious to know how this approach performs with your data. With the generated data I used, the

Re: Parallelising over a lazy sequence - request for help

2013-09-27 Thread Rich Morin
On Sat, May 25, 2013 at 12:34 PM, Paul Butcher p...@paulbutcher.com wrote: I'm currently working on a book on concurrent/parallel development for The Pragmatic Programmers. ... Ordered; PDF just arrived (:-). I don't know yet whether the book has anything like this, but I'd like to see a

Parallelising over a lazy sequence - request for help

2013-05-25 Thread Paul Butcher
I'm currently working on a book on concurrent/parallel development for The Pragmatic Programmers. One of the subjects I'm covering is parallel programming in Clojure, but I've hit a roadblock with one of the examples. I'm hoping that I can get some help to work through it here. The example