Have you looked at core.rrb-vector, which implements Relaxed Radix Trees for vectors that implement subvec and concatenating two arbitrary vectors in O(log n) time?
https://github.com/clojure/core.rrb-vector Unless I am missing something, the fast concatenation is what you are trying to achieve? Andy On Thu, Feb 20, 2014 at 2:59 PM, Jules <jules.gosn...@gmail.com> wrote: > > So, having broken the back of fast re-combination of hash sets and maps, I > wanted to take a look at doing a similar sort of thing for vectors - > another type of seq that I use very heavily in this sort of situation. > > Let's see the cake first: > > seqspert.vector=> (def a (into [] (range 10000000))) > #'seqspert.vector/a > seqspert.vector=> (time (def b (into a a))) > "Elapsed time: 3695.682654 msecs" > #'seqspert.vector/b > seqspert.vector=> (time (def c (supervec a a))) > "Elapsed time: 0.253922 msecs" > #'seqspert.vector/c > seqspert.vector=> (= b c) > true > > As you can see, the usual method of combining two vectors into a third > takes 3-4 secs to combine two 10M entry vectors. > > Using 'supervec' it takes MUCH less time..... > > How ? > > I looked at vector's impl and quickly realised that I would not be able to > use the same trick that I had used for hash set/map. Then I looked at > subvec and realised that I could do a very similar trick to create supervec. > > Subvec provides a view on a subseq of a vector that behaves like a full > vector. Supervec provides a similar view that makes two vectors behave like > a single one: > > seqspert.vector=> (supervec [1 2 3] [4 5 6]) > [1 2 3 4 5 6] > seqspert.vector=> (type (supervec [1 2 3] [4 5 6])) > clojure.lang.APersistentVector$SuperVector > > We can use seqspert (see my other posting this evening) to look inside a > SuperVector: > > seqspert.vector=> (decloak (supervec [1 2 3] [4 5 6])) > #seqspert.vector.SuperVector{:left #seqspert.vector.Vector{:cnt 3, :shift > 5, :root #seqspert.vector.VectorNode{:array [nil nil nil nil nil nil nil > nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil > nil nil nil nil nil nil]}, :tail [1 2 3]}, :right > #seqspert.vector.Vector{:cnt 3, :shift 5, :root > #seqspert.vector.VectorNode{:array [nil nil nil nil nil nil nil nil nil nil > nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil > nil nil nil]}, :tail [4 5 6]}, :middle 3, :count 6} > seqspert.vector=> (p/pprint (decloak (supervec [1 2 3] [4 5 6]))) > {:left > {:cnt 3, > :shift 5, > :root > {:array > [nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil]}, > :tail [1 2 3]}, > :right > {:cnt 3, > :shift 5, > :root > {:array > [nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil > nil]}, > :tail [4 5 6]}, > :middle 3, > :count 6} > nil > seqspert.vector=> > > It looks very complicated - that is the two clojurer vectors held inside ! > A SuperVector has fields left, right (the two vectors that we are > combining), middle (the length of the left vector), and count (the sum of > the counts of left and right). > > With just this info and very little code, I found myself with a working > impl (except 'pop' but who 'pop's a vector?). > > OK - there is a tiny extra runtime overhead associated with using a > supervec - but, when you look at what is involved in dereffing into a > vector, this quickly appears insignificant. It's about the same a using a > subvec. > > The SuperVector code is checked into my clojure fork on github - see > beginning of this thread. Seqspert is also at github - see the announcement > I posted this evening. > > Finally 'supervec' is currently defined as: > > (defn supervec [l r] (APersistentVector$SuperVector. {} l r)) > > I am thinking or renaming 'splice' from my earlier posting to 'combine' > and creating a Combinable interface, to be implemented by any seq type that > is able to do a fast re-combination with another seq of the same type. This > could be leveraged by e.g. 'into' to deliver significant performance > benefits and footprint reduction. > > more as and when, > > > Jules > > > > > > On Saturday, 15 February 2014 23:06:24 UTC, Jules wrote: > >> Guys, >> >> I've been playing with reducers on and off for a while but have been >> frustrated because they don't seem to fit a particular usecase that I have >> in mind... specifically: getting as many associations into a hash-map as as >> I can in as short a time as possible. >> >> My understanding of the reason for this is that reducers practice a >> divide and conquer strategy. The incoming sequence is divided up. Each >> sub-sequence is reduced into a sub-result (possibly in parallel) and then >> the sub-results are combined into the the final outgoing result. >> >> Unfortunately, there does not seem to be a better way of combining two >> hash-maps other than to read each entry from one and create a new >> corresponding association in the other. This means that each recombination >> in the above process essentially repeats most of the work already performed >> in the previous reduction stage. >> >> Hash-sets are implemented via hash-maps, and simpler with which to >> demonstrate this problem: >> >> user=> (def a (doall (range 10000000))) >> #'user/a >> user=> (def b (doall (range 5000000 15000000))) >> #'user/b >> user=> (time (def c (into #{} a))) >> "Elapsed time: 6319.392669 msecs" >> #'user/c >> user=> (time (def d (into #{} b))) >> "Elapsed time: 5389.805233 msecs" >> #'user/d >> user=> (time (def e (into c d))) >> "Elapsed time: 8486.032191 msecs" >> #'user/e >> >> >> In the example above, you can see that the reduction into hash-sets of >> two overlapping lists of 10,000,000 elements takes 6.3 and 5.4 seconds. >> This stage can be carried out in parallel i.e. time elapsed for this stage >> would be 6.3 seconds - but we now have two hash-sets and we want one, so we >> have to combine them. >> >> >> user=> (time (def e (into c d))) >> "Elapsed time: 8486.032191 msecs" >> #'user/e >> >> As you can see, all the advantages of splitting the original sequence >> into 2 and processing the two halves in parallel are lost since the >> recombination or their results takes 8.5 seconds - more than we saved by >> doing the reduction in parallel. >> >> So, what can we do about it ? >> >> I had a look at the code for PersistentHashMap (PersistentHashSet uses >> PersistantHashMap internally). I realised that it was possible to "splice" >> together the internal structure of two hash maps into a single one without >> repeating most of the work required to build one from scratch. So, I had a >> go at implementing it: >> >> >> user=> (time (def f (clojure.lang.PersistentHashSet/splice c d))) >> "Elapsed time: 3052.690911 msecs" >> #'user/f >> >> and: >> >> user=> (= e f) >> true >> >> Whilst this is still adding 3 seconds to our time, that 3 seconds is half >> the time that we would have added had we executed the second reduction in >> serial, rather than in parallel. >> >> This means that we can now reduce large datasets into sets/maps more >> quickly in parallel than we can in serial :-) As an added benefit, because >> splice reuses as much of the internal structure of both inputs as possible, >> it's impact in terms of heap consumption and churn is less - although I >> think that a full implementation might add some Java-side code complexity. >> >> If you would like to give 'splice' a try out, you will need to clone my >> fork of clojure at github >> >> https://github.com/JulesGosnell/clojure >> >> Please let me know if you try out the code. I would be interested to hear >> if people think it is worth pursuing. >> >> I was also thinking that it should be possible to use a similar trick to >> quickly and cheaply split a map/set into [roughly] equal sized pieces >> (assuming an good hash distribution). This would enable the use of a >> map/set as an input sequence into the parallel reduction process outlined >> above. Currently, I believe that only a vector can be used in this way. It >> would be harder to arrange that 'count' could be implemented efficiently on >> these sub-maps/sets, but this is not important for the reduction process. >> >> BTW - benchmarks were run on a 3.2ghz Phenom II / clojure/master / >> openjdk-1.7.0_51 / Fedora 20 with min and max 4gb ram. >> >> regards, >> >> >> >> Jules >> >> >> -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with > your first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en > --- > You received this message because you are subscribed to the Google Groups > "Clojure" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to clojure+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/groups/opt_out. > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.