You should try transients if you're looking to quickly fill collections -
you might not even need to split up the work this way.
On Saturday, February 15, 2014 5:06:24 PM UTC-6, Jules wrote:
>
> Guys,
>
> I've been playing with reducers on and off for a while but have been
> frustrated because they don't seem to fit a particular usecase that I have
> in mind... specifically: getting as many associations into a hash-map as as
> I can in as short a time as possible.
>
> My understanding of the reason for this is that reducers practice a divide
> and conquer strategy. The incoming sequence is divided up. Each
> sub-sequence is reduced into a sub-result (possibly in parallel) and then
> the sub-results are combined into the the final outgoing result.
>
> Unfortunately, there does not seem to be a better way of combining two
> hash-maps other than to read each entry from one and create a new
> corresponding association in the other. This means that each recombination
> in the above process essentially repeats most of the work already performed
> in the previous reduction stage.
>
> Hash-sets are implemented via hash-maps, and simpler with which to
> demonstrate this problem:
>
> user=> (def a (doall (range 10000000)))
> #'user/a
> user=> (def b (doall (range 5000000 15000000)))
> #'user/b
> user=> (time (def c (into #{} a)))
> "Elapsed time: 6319.392669 msecs"
> #'user/c
> user=> (time (def d (into #{} b)))
> "Elapsed time: 5389.805233 msecs"
> #'user/d
> user=> (time (def e (into c d)))
> "Elapsed time: 8486.032191 msecs"
> #'user/e
>
>
> In the example above, you can see that the reduction into hash-sets of two
> overlapping lists of 10,000,000 elements takes 6.3 and 5.4 seconds. This
> stage can be carried out in parallel i.e. time elapsed for this stage would
> be 6.3 seconds - but we now have two hash-sets and we want one, so we have
> to combine them.
>
>
> user=> (time (def e (into c d)))
> "Elapsed time: 8486.032191 msecs"
> #'user/e
>
> As you can see, all the advantages of splitting the original sequence into
> 2 and processing the two halves in parallel are lost since the
> recombination or their results takes 8.5 seconds - more than we saved by
> doing the reduction in parallel.
>
> So, what can we do about it ?
>
> I had a look at the code for PersistentHashMap (PersistentHashSet uses
> PersistantHashMap internally). I realised that it was possible to "splice"
> together the internal structure of two hash maps into a single one without
> repeating most of the work required to build one from scratch. So, I had a
> go at implementing it:
>
>
> user=> (time (def f (clojure.lang.PersistentHashSet/splice c d)))
> "Elapsed time: 3052.690911 msecs"
> #'user/f
>
> and:
>
> user=> (= e f)
> true
>
> Whilst this is still adding 3 seconds to our time, that 3 seconds is half
> the time that we would have added had we executed the second reduction in
> serial, rather than in parallel.
>
> This means that we can now reduce large datasets into sets/maps more
> quickly in parallel than we can in serial :-) As an added benefit, because
> splice reuses as much of the internal structure of both inputs as possible,
> it's impact in terms of heap consumption and churn is less - although I
> think that a full implementation might add some Java-side code complexity.
>
> If you would like to give 'splice' a try out, you will need to clone my
> fork of clojure at github
>
> https://github.com/JulesGosnell/clojure
>
> Please let me know if you try out the code. I would be interested to hear
> if people think it is worth pursuing.
>
> I was also thinking that it should be possible to use a similar trick to
> quickly and cheaply split a map/set into [roughly] equal sized pieces
> (assuming an good hash distribution). This would enable the use of a
> map/set as an input sequence into the parallel reduction process outlined
> above. Currently, I believe that only a vector can be used in this way. It
> would be harder to arrange that 'count' could be implemented efficiently on
> these sub-maps/sets, but this is not important for the reduction process.
>
> BTW - benchmarks were run on a 3.2ghz Phenom II / clojure/master /
> openjdk-1.7.0_51 / Fedora 20 with min and max 4gb ram.
>
> regards,
>
>
>
> Jules
>
>
>
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.