I was going to spend my evening writing more parallel c, but I decided it would probably be more relaxing (if less exciting) to write some parallel j instead. So here we are: some parallel reimplementations of primitives, in increasing order of interest:

First, some basic definitions to start us off:

0&T.@'' ::] ^:_'' NB.eat all the cores
NPAR=. 1 T.''

Now, let's try to implement rank, the quintessence of parallel iteration in j. Actually, let's just implement monadic "_1, which iterates in parallel over the major cells of y. The obvious formulation is:

peach0=: <@:((t.'')"_1)

but this is almost certainly not what we want; virtual nouns cannot be passed between threads, so this will incur copies which will be expensive unless u is expensive and y's major cells are small. We also need y to have a small number of cells, or else we will cause scheduler contention (or, in the current implementation, we'll block everything on the main thread).

peach0 is important as an archetype of all forms of parallel iteration, because the limitations I mentioned are fundamental to the current implementation. However it's necessary to layer additional abstractions on top of this primitive to get an actually performant parallel implementation of "_1.

Here's a first testcase:

   a=. i.4 1e8
   timex'+/"_1 a'
0.225933
   timex'+/peach0 a'
0.911939

The solution to this problem that I presented the other day is to pass a task identifier to the task, and use it to construct the virtual noun there instead of in the creating thread, viz:

peach1=: {{ > ([: u {&y) t.''"_1 i.#y }}

now we see an improvement!

   timex'+/peach1 a'
0.0738621

This is not completely general, though; it starts up a thread for _every_ major cell of y; if there are many small cells, this will perform quite badly. A better strategy then will be to start up NPAR tasks, giving each a contiguous slice of values, and using the primitive u"n to operate on the cells within that slice. Here I use the dyads {. and }. to construct that slice:

prankb=: {{ par=. NPAR <. #y
 c=. (#y) <.@% par          NB.chunk size
 d=. c*i.par                NB.drops, see also +/\par#c
 t=. (c#~<:par) , c+par|#y  NB.takes, last of which may be oversized
 ([: u"n {&t {. {&d }. y"_) t.''"_1 i.par }}
prank=: ;@:prank
peach2=: prank _1

Another nice thing about this definition is that, since it also uses the primitive ", it is able to take advantage of IRS once hardware parallelism runs out. Performance compares favourably on a transposed a:

   a=. i.1e8 4
   timex'+/"_1 a'
0.458253
   timex'+/peach2 a'
0.311498

but it is not quite so impressive as in the previous case. The majority of the time is actually spent in ;, copying result cells out of boxes, because there are so many results, and each is individually so cheap to compute. It's for this reason I also expose the definition of prankb, which leaves its result chunks in their boxes, as it may be possible to profit by operating on those results in their boxes. If, for instance, we try to sum up all the elements in the array, the numbers get nicer:

   timex'+/+/"_1 a'
0.529847
   timex'+/+/&>+/prankb _1 a'
0.130087

we'll make this even faster later.

There is more work to be done on prank; aside from the obvious bug and the lack of ambivalence, it assumes there are enough major cells to satisfy all threads. In the event that there are fewer major cells than NPAR, and n is deeper than _1, it would be able to profit by subdividing along other axes. I haven't bothered doing this yet.

I also made a parallel implementation of x u\ y, but it proved not very interesting, so I won't reproduce it here. It is effectively the same as prankb--including the metacircular reversion to the primitive \--except that d and t need to be computed differently. I think that the template used there is a general one. It's effectively the same as how you would write it in c.

Next: reductions.
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to