Re: fast parallel reduction into hash-set/map

Jules Mon, 17 Feb 2014 12:10:21 -0800

Glen,

I did start the implementation in Clojure, but had to move it under the 
skin of PersistentHashMap to achieve what I needed, so it is now written in 
Java and is part of PersistentHashMap... I don't think it would be 
practical to make it an add-on - but it would be nice :-). I'll keep it in 
mind.


Jules



On Monday, 17 February 2014 18:42:28 UTC, Glen Mailer wrote:
>
> Is there a specific part of this implementation which means it needs to 
> live in core?
>
> It would be cool to have this as a library that could be used with 
> existing versions of clojure (I have no idea if enough of the internals are 
> exposed to make this viable)
>
> Glen
>
> On Saturday, 15 February 2014 23:06:24 UTC, Jules wrote:
>>
>> Guys,
>>
>> I've been playing with reducers on and off for a while but have been 
>> frustrated because they don't seem to fit a particular usecase that I have 
>> in mind... specifically: getting as many associations into a hash-map as as 
>> I can in as short a time as possible.
>>
>> My understanding of the reason for this is that reducers practice a 
>> divide and conquer strategy. The incoming sequence is divided up. Each 
>> sub-sequence is reduced into a sub-result (possibly in parallel) and then 
>> the sub-results are combined into the the final outgoing result.
>>
>> Unfortunately, there does not seem to be a better way of combining two 
>> hash-maps other than to read each entry from one and create a new 
>> corresponding association in the other. This means that each recombination 
>> in the above process essentially repeats most of the work already performed 
>> in the previous reduction stage.
>>
>> Hash-sets are implemented via hash-maps, and simpler with which to 
>> demonstrate this problem:
>>
>> user=> (def a (doall (range 10000000)))
>> #'user/a
>> user=> (def b (doall (range 5000000 15000000)))
>> #'user/b
>> user=> (time (def c (into #{} a)))
>> "Elapsed time: 6319.392669 msecs"
>> #'user/c
>> user=> (time (def d (into #{} b)))
>> "Elapsed time: 5389.805233 msecs"
>> #'user/d
>> user=> (time (def e (into c d)))
>> "Elapsed time: 8486.032191 msecs"
>> #'user/e
>>
>>
>> In the example above, you can see that the reduction into hash-sets of 
>> two overlapping lists of 10,000,000 elements takes 6.3 and 5.4 seconds. 
>> This stage can be carried out in parallel i.e. time elapsed for this stage 
>> would be 6.3 seconds - but we now have two hash-sets and we want one, so we 
>> have to combine them.
>>
>>
>> user=> (time (def e (into c d)))
>> "Elapsed time: 8486.032191 msecs"
>> #'user/e
>>
>> As you can see, all the advantages of splitting the original sequence 
>> into 2 and processing the two halves in parallel are lost since the 
>> recombination or their results takes 8.5 seconds - more than we saved by 
>> doing the reduction in parallel.
>>
>> So, what can we do about it ?
>>
>> I had a look at the code for PersistentHashMap (PersistentHashSet uses 
>> PersistantHashMap internally). I realised that it was possible to "splice" 
>> together the internal structure of two hash maps into a single one without 
>> repeating most of the work required to build one from scratch. So, I had a 
>> go at implementing it:
>>
>>
>> user=> (time (def f (clojure.lang.PersistentHashSet/splice c d)))
>> "Elapsed time: 3052.690911 msecs"
>> #'user/f
>>
>> and:
>>
>> user=> (= e f)
>> true
>>
>> Whilst this is still adding 3 seconds to our time, that 3 seconds is half 
>> the time that we would have added had we executed the second reduction in 
>> serial, rather than in parallel.
>>
>> This means that we can now reduce large datasets into sets/maps more 
>> quickly in parallel than we can in serial :-) As an added benefit, because 
>> splice reuses as much of the internal structure of both inputs as possible, 
>> it's impact in terms of heap consumption and churn is less - although I 
>> think that a full implementation might add some Java-side code complexity.
>>
>> If you would like to give 'splice' a try out, you will need to clone my 
>> fork of clojure at github 
>>
>> https://github.com/JulesGosnell/clojure
>>
>> Please let me know if you try out the code. I would be interested to hear 
>> if people think it is worth pursuing.
>>
>> I was also thinking that it should be possible to use a similar trick to 
>> quickly and cheaply split a map/set into [roughly] equal sized pieces 
>> (assuming an good hash distribution). This would enable the use of a 
>> map/set as an input sequence into the parallel reduction process outlined 
>> above. Currently, I believe that only a vector can be used in this way. It 
>> would be harder to arrange that 'count' could be implemented efficiently on 
>> these sub-maps/sets, but this is not important for the reduction process.
>>
>> BTW - benchmarks were run on a 3.2ghz Phenom II / clojure/master / 
>> openjdk-1.7.0_51 / Fedora 20 with min and max 4gb ram.
>>
>> regards,
>>
>>
>>
>> Jules
>>
>>
>>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: fast parallel reduction into hash-set/map

Reply via email to