Re: fast parallel reduction into hash-set/map

Jules Sat, 15 Mar 2014 15:37:27 -0700

Sorry that it has taken me so long to reply to this - I wanted to make sure 
that I had a look at rrb-vector and properly digest the points that you 
have made.

1. I've dumped the supervector idea :-) I ran some timings and as the 
number of levels of supervector increased, iteration times degraded faster 
than I had expected - exactly as you said.

2. I've had a look at rrb-vector - very fast for catvec - agreed, but there 
does appear to be a performance penalty in terms of conj[!]-ing - I can 
post my results if you  are interested. I read the Bagwell paper on which 
your work was based and came away with the impression that he did not think 
he was introducing significant overhead with his improvements for constant 
tiime concatenation - so maybe this is due to rrb being impled in clojure 
and builting vec in java ? 

3 whilst doing this, I have also been looking at sumatra 
(http://openjdk.java.net/projects/sumatra/) and considering how I would 
leverage all my gpgpu's cores to do this sort of thing efficiently. I came 
up with a very specific solution for mapping a function across a vector and 
into another vector which is extremely lightweight in both sequential and 
parallel modes:

(def a (into [] (range 1000000)))
(time (do (into [] a) nil)) ;; 124 ms
(time (do (mapv identity a) nil)) ;; 220 ms
(time (do (vmap identity a) nil)) ;; 140 ms
(time (do (mapv inc a) nil)) ;; 280 ms
(time (do (vmap inc a) nil)) ;; 240 ms

(def n (/ (count a) (.availableProcessors (Runtime/getRuntime)))) ;; = 
625000 ;; I have 16 :-)

(do (time (r/fold (r/monoid into vector) conj a)) nil) ;; 380 ms
(do (time (r/fold (r/monoid v/catvec v/vector) conj a)) nil) ;; 380 ms

(do (time (r/fold (r/monoid into vector) conj (r/map inc a))) nil) ;; 620 ms
(do (time (r/fold (r/monoid v/catvec v/vector) conj (r/map inc a))) nil) ;; 
680 ms

(do (time (r/fold n (r/monoid into vector) conj (r/map inc a))) nil) ;; 590 
ms
(do (time (r/fold n (r/monoid v/catvec v/vector) conj (r/map inc a))) nil) 
;; 320 - 520 - erratic !

(time (count (vmap inc a))) ;; 230 ms ;; sequential
(time (count (fjvmap inc a))) ;; 40 ms ;; parallel
(= (time (r/fold n (r/monoid v/catvec v/vector) conj (r/map inc a))) (pvmap 
inc a)) ;; -> true

The timings are very rough, but give a reasonable idea of performance 
differences. My code is hidden behind vmap  (sequential) and fjvmap 
(parallel using fork/join).

vmap is very simple - it understands a vector's internal structure - 
basically a tree in which each node holds an array size 32 of branches or 
leaves (but not both). It recurses down to a node containing a source 
array, makes a corresponding target array then applies either itself 
recursively, if it is operating at branch level, or the function to be 
applied, if it operating at leaf level, to each source array item putting 
the result into the same slot in the target array. fjvmap does the same, 
but hands off each of the 32 top-level jobs to an f/j pool.

I came across this approach because it was the way that I expected the gpu 
to want its work laid out for it - an array of inputs and an  array into 
which to place corresponding outputs that could be handed off to a 
workgroup of  e.g. 32 gpu cores - but then I found that it worked very well 
without a gpu, since it bypasses all the infrastructure that makes seqs 
look the same and only deals with a particular  case.

This got me thinking though. core.reducers already makes a special case for 
vectors, so I don't think I am cheating. It seems to me that at an 
application level we should think of seqs being different impls of the same 
abstraction, but as library writers we should be prepared to leverage 
specific knowledge to provide a more performant solution. Seq impls do 
break down in a natural way and therefore efficient way - many into trees 
of arrays of size 32. Taking advantage of this seems to yield better 
performance and less churn so maybe we should pursue it.

I'll be checking the source for vmap and fjmap into seqspert soon if anyone 
is interested.

I am going to spend some time looking at fitting it into the core.reducers 
framework. I've also discovered that the APU (CPU/GPU combo) that I am 
thinking of buying seems to arrange its gpu cores in workgroups of 64, not 
32 :-), so I might have to investigate parameterising the branching factor 
of the builtin vector. I am also very interested in rrb-vector because it 
allows different size nodes. If I was mapping over a clojure hashmap/set 
which are tries of size 32 arrays and wanted to build up a result vector in 
the same way as vmap does, I would need a vector type with variable sized 
nodes - so I will try to find time to look into this as well....

sorry posting is so long - wanted to just put it all out there and see if 
anyone was interested in discussing it further - it is always good to 
bounce stuff of other people.

that's all folks !

Jules

On Friday, 21 February 2014 01:09:26 UTC, Jean Niklas L'orange wrote:
>
> Hi Jules,
>
> On Thursday, February 20, 2014 11:59:03 PM UTC+1, Jules wrote:
>>
>> Subvec provides a view on a subseq of a vector that behaves like a full 
>> vector. Supervec provides a similar view that makes two vectors behave like 
>> a single one
>>
>
> This data structure ("supervec") is usually known as a rope, and your 
> implementation degrades the *~O(1)* to worst case *O(k)* time for 
> lookups, assoc, conj and pop, where *k* is the amount of concatenations 
> performed. A good rope implementation can reduce that factor down to *O(log 
> k)* through balancing, but it will still break the performance guarantees 
> a persistent vector is supposed to have. If you try to use this in the 
> reducers library, the amount of concatenations fork/join performs might 
> actually show notable performance degradation for those operations. 
> Further, passing it around to code which might perform more concatenations 
> will lead to even worse performance, which may be hard to debug for people 
> not knowing Clojure internals.
>
> However, it is a very nice solution if you know there will be limited 
> concatenations, that the concatenations result in a balanced tree, and you 
> don't pass this vector around to other people. :)
>
> For a small discussion on this vs. RRB-trees, see Section 1.1 in 
> http://infoscience.epfl.ch/record/169879/files/RMTrees.pdf.
>
> Hopefully I'm not destroying your fun playing around with these 
> things—that is not the intent at all. I'm just saying that these things are 
> (sadly?) harder than it first looks like.
>
> -- JN
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: fast parallel reduction into hash-set/map

Reply via email to