Re: fast parallel reduction into hash-set/map

Andy Fingerhut Thu, 20 Feb 2014 15:25:55 -0800

Have you looked at core.rrb-vector, which implements Relaxed Radix Trees
for vectors that implement subvec and concatenating two arbitrary vectors
in O(log n) time?


    https://github.com/clojure/core.rrb-vector

Unless I am missing something, the fast concatenation is what you are
trying to achieve?

Andy


On Thu, Feb 20, 2014 at 2:59 PM, Jules <jules.gosn...@gmail.com> wrote:

>
> So, having broken the back of fast re-combination of hash sets and maps, I
> wanted to take a look at doing a similar sort of thing for vectors -
> another type of seq that I use very heavily in this sort of situation.
>
> Let's see the cake first:
>
> seqspert.vector=> (def a (into [] (range 10000000)))
> #'seqspert.vector/a
> seqspert.vector=> (time (def b (into a a)))
> "Elapsed time: 3695.682654 msecs"
> #'seqspert.vector/b
> seqspert.vector=> (time (def c (supervec a a)))
> "Elapsed time: 0.253922 msecs"
> #'seqspert.vector/c
> seqspert.vector=> (= b c)
> true
>
> As you can see, the usual method of combining two vectors into a third
> takes 3-4 secs to combine two 10M entry vectors.
>
> Using 'supervec' it takes MUCH less time.....
>
> How ?
>
> I looked at vector's impl and quickly realised that I would not be able to
> use the same trick that I had used for hash set/map. Then I looked at
> subvec and realised that I could do a very similar trick to create supervec.
>
> Subvec provides a view on a subseq of a vector that behaves like a full
> vector. Supervec provides a similar view that makes two vectors behave like
> a single one:
>
> seqspert.vector=> (supervec [1 2 3] [4 5 6])
> [1 2 3 4 5 6]
> seqspert.vector=> (type (supervec [1 2 3] [4 5 6]))
> clojure.lang.APersistentVector$SuperVector
>
> We can use seqspert (see my other posting this evening) to look inside a
> SuperVector:
>
> seqspert.vector=> (decloak (supervec [1 2 3] [4 5 6]))
> #seqspert.vector.SuperVector{:left #seqspert.vector.Vector{:cnt 3, :shift
> 5, :root #seqspert.vector.VectorNode{:array [nil nil nil nil nil nil nil
> nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil
> nil nil nil nil nil nil]}, :tail [1 2 3]}, :right
> #seqspert.vector.Vector{:cnt 3, :shift 5, :root
> #seqspert.vector.VectorNode{:array [nil nil nil nil nil nil nil nil nil nil
> nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil
> nil nil nil]}, :tail [4 5 6]}, :middle 3, :count 6}
> seqspert.vector=> (p/pprint (decloak (supervec [1 2 3] [4 5 6])))
> {:left
>  {:cnt 3,
>   :shift 5,
>   :root
>   {:array
>    [nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil]},
>   :tail [1 2 3]},
>  :right
>  {:cnt 3,
>   :shift 5,
>   :root
>   {:array
>    [nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil
>     nil]},
>   :tail [4 5 6]},
>  :middle 3,
>  :count 6}
> nil
> seqspert.vector=>
>
> It looks very complicated - that is the two clojurer vectors held inside !
> A SuperVector has fields left, right (the two vectors that we are
> combining), middle (the length of the left vector), and count (the sum of
> the counts of left and right).
>
> With just this info and very little code, I  found myself with a working
> impl (except 'pop' but who 'pop's a vector?).
>
> OK - there is a tiny extra runtime overhead associated with using a
> supervec - but, when you look at what is involved in dereffing into a
> vector, this quickly appears insignificant. It's about the same a using a
> subvec.
>
> The SuperVector code is checked into my clojure fork on github - see
> beginning of this thread. Seqspert is also at github - see the announcement
> I posted this evening.
>
> Finally 'supervec' is currently defined as:
>
> (defn supervec [l r] (APersistentVector$SuperVector. {} l r))
>
> I am thinking or renaming 'splice' from my earlier posting to 'combine'
> and creating a Combinable interface, to be implemented by any seq type that
> is able to do a fast re-combination with another seq of the same type. This
> could be leveraged by e.g. 'into' to deliver significant performance
> benefits and footprint reduction.
>
> more as and when,
>
>
> Jules
>
>
>
>
>
> On Saturday, 15 February 2014 23:06:24 UTC, Jules wrote:
>
>> Guys,
>>
>> I've been playing with reducers on and off for a while but have been
>> frustrated because they don't seem to fit a particular usecase that I have
>> in mind... specifically: getting as many associations into a hash-map as as
>> I can in as short a time as possible.
>>
>> My understanding of the reason for this is that reducers practice a
>> divide and conquer strategy. The incoming sequence is divided up. Each
>> sub-sequence is reduced into a sub-result (possibly in parallel) and then
>> the sub-results are combined into the the final outgoing result.
>>
>> Unfortunately, there does not seem to be a better way of combining two
>> hash-maps other than to read each entry from one and create a new
>> corresponding association in the other. This means that each recombination
>> in the above process essentially repeats most of the work already performed
>> in the previous reduction stage.
>>
>> Hash-sets are implemented via hash-maps, and simpler with which to
>> demonstrate this problem:
>>
>> user=> (def a (doall (range 10000000)))
>> #'user/a
>> user=> (def b (doall (range 5000000 15000000)))
>> #'user/b
>> user=> (time (def c (into #{} a)))
>> "Elapsed time: 6319.392669 msecs"
>> #'user/c
>> user=> (time (def d (into #{} b)))
>> "Elapsed time: 5389.805233 msecs"
>> #'user/d
>> user=> (time (def e (into c d)))
>> "Elapsed time: 8486.032191 msecs"
>> #'user/e
>>
>>
>> In the example above, you can see that the reduction into hash-sets of
>> two overlapping lists of 10,000,000 elements takes 6.3 and 5.4 seconds.
>> This stage can be carried out in parallel i.e. time elapsed for this stage
>> would be 6.3 seconds - but we now have two hash-sets and we want one, so we
>> have to combine them.
>>
>>
>> user=> (time (def e (into c d)))
>> "Elapsed time: 8486.032191 msecs"
>> #'user/e
>>
>> As you can see, all the advantages of splitting the original sequence
>> into 2 and processing the two halves in parallel are lost since the
>> recombination or their results takes 8.5 seconds - more than we saved by
>> doing the reduction in parallel.
>>
>> So, what can we do about it ?
>>
>> I had a look at the code for PersistentHashMap (PersistentHashSet uses
>> PersistantHashMap internally). I realised that it was possible to "splice"
>> together the internal structure of two hash maps into a single one without
>> repeating most of the work required to build one from scratch. So, I had a
>> go at implementing it:
>>
>>
>> user=> (time (def f (clojure.lang.PersistentHashSet/splice c d)))
>> "Elapsed time: 3052.690911 msecs"
>> #'user/f
>>
>> and:
>>
>> user=> (= e f)
>> true
>>
>> Whilst this is still adding 3 seconds to our time, that 3 seconds is half
>> the time that we would have added had we executed the second reduction in
>> serial, rather than in parallel.
>>
>> This means that we can now reduce large datasets into sets/maps more
>> quickly in parallel than we can in serial :-) As an added benefit, because
>> splice reuses as much of the internal structure of both inputs as possible,
>> it's impact in terms of heap consumption and churn is less - although I
>> think that a full implementation might add some Java-side code complexity.
>>
>> If you would like to give 'splice' a try out, you will need to clone my
>> fork of clojure at github
>>
>> https://github.com/JulesGosnell/clojure
>>
>> Please let me know if you try out the code. I would be interested to hear
>> if people think it is worth pursuing.
>>
>> I was also thinking that it should be possible to use a similar trick to
>> quickly and cheaply split a map/set into [roughly] equal sized pieces
>> (assuming an good hash distribution). This would enable the use of a
>> map/set as an input sequence into the parallel reduction process outlined
>> above. Currently, I believe that only a vector can be used in this way. It
>> would be harder to arrange that 'count' could be implemented efficiently on
>> these sub-maps/sets, but this is not important for the reduction process.
>>
>> BTW - benchmarks were run on a 3.2ghz Phenom II / clojure/master /
>> openjdk-1.7.0_51 / Fedora 20 with min and max 4gb ram.
>>
>> regards,
>>
>>
>>
>> Jules
>>
>>
>>  --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: fast parallel reduction into hash-set/map

Reply via email to