(Note: even smarter parallel / is possible, given non-homogenous performance of u, but needs primitives in order to build a concurrent queue. A primitive 'wait for any' (instead of 'wait for all') would be of use here--I cannot bear to poll--but proper queues would be better.)

On Tue, 12 Apr 2022, Elijah Stone wrote:

Part 2: reduction.

I cannot in good conscience write an adverb which claims to be a parallel analogue to / but which blindly assumes all verbs are associative. In fact, I will require a stronger than associativity: that verbs reduced in parallel are monoids. This starts with the following definition:

monoid=: 2 :'x u y NB.MONOID'

x u monoid n y behaves exactly the same as x u y, but the former form serves as an assertion that u is associative and that it has an identity of n. We can then write our first substitute for u/, which ensures that u takes the form u monoid n before executing in parallel. That looks like this:

slash=: {{ par=. NPAR <. #y
NB. annoying; no late binding for non-verbs
if. -. u`'' -:&(0&{@,@>) + monoid 0 `'' do.
 u/y return. end.
id=. 1{::^:3,>u
u=. (0{1{::,>u`'') `:6  NB. pull out actual verb to take advantage of SC
if. 0-:#y do. id return. end.
c=. (#y) <.@% par
d=. c*i.par
t=. (c#~<:par) , c+par|#y
> u&.>/ ([: u/ {&t {. {&d }. y"_) t.''"_1 i.par }}

We farm out par threads, each of which reduces a chunk, and then reduce the final results in the main thread before returning. Good enough? It looks the part:

  a=. i.1e8
  timex'+/a'
0.0680711
  timex'+ monoid 0 slash a'
0.0139139

A thousand times no! This code is asymptotically suboptimal. This is clearest when NPAR is _, but alas, it is capped at 14 in the present implementation. We can fake it, however, by making u very expensive and y very short. For instance:

  f=. 0:@(6!:3@0.1)
  timex'f monoid 0 slash 16#0'
1.50183

In other words, we have 15 serial applications of f, which is far too many; there should only be 4. slash is linear. We need to continue to parallelise even when there are fewer elements of y than there are available threads, viz the following updated definition:

slash2=: {{ par=. NPAR <. 2 <.@%~ #y
NB. annoying; no late binding for non-verbs
if. -. u`'' -:&(0&{@,@>) + monoid 0 `'' do.
 u/y return. end.
id=. 1{::^:3,>u
u=. (0{1{::,>u`'') `:6  NB. pull out actual verb to take advantage of SC
if. 0-:#y do. id return. end.
c=. (#y) <.@% par
d=. c*i.par
t=. (c#~<:par) , c+par|#y
y=. >([: u/ {&t {. {&d }. y"_) t.''"_1 i.par
u monoid id slash2^:(1~:#) y }}

then:

  timex'f monoid 0 slash2 16#0'
0.400691

much better!

(Note: in practice, there would be a minimum number of elements processed by a given task, which would in many cases be >2; coming up with this minimum requires a cost model for u. But 2 is asymptotically fine and serves its pedagogical purpose here.)

Another way of seeing the performance difference between slash and slash2 is to display the execution tree, by analogy to https://www.jsoftware.com/help/dictionary/samp19.htm:

  ]r0=. ,&< monoid '' slash 16#0
┌─┬─────────────────────────────────────────────────────────┐
│0│┌─┬─────────────────────────────────────────────────────┐│
│ ││0│┌─┬─────────────────────────────────────────────────┐││
│ ││ ││0│┌─┬─────────────────────────────────────────────┐│││
│ ││ ││ ││0│┌─┬─────────────────────────────────────────┐││││
│ ││ ││ ││ ││0│┌─┬─────────────────────────────────────┐│││││
│ ││ ││ ││ ││ ││0│┌─┬─────────────────────────────────┐││││││
│ ││ ││ ││ ││ ││ ││0│┌─┬─────────────────────────────┐│││││││
│ ││ ││ ││ ││ ││ ││ ││0│┌─┬─────────────────────────┐││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││0│┌─┬─────────────────────┐│││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││0│┌─┬─────────────────┐││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││0│┌─┬─────────────┐│││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││0│┌─┬─────────┐││││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││0│┌─┬─────┐│││││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││0│┌─┬─┐││││││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││0│0│││││││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ │└─┴─┘││││││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ │└─┴─────┘│││││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ │└─┴─────────┘││││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ │└─┴─────────────┘│││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││ │└─┴─────────────────┘││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ │└─┴─────────────────────┘│││││││││
│ ││ ││ ││ ││ ││ ││ ││ │└─┴─────────────────────────┘││││││││
│ ││ ││ ││ ││ ││ ││ │└─┴─────────────────────────────┘│││││││
│ ││ ││ ││ ││ ││ │└─┴─────────────────────────────────┘││││││
│ ││ ││ ││ ││ │└─┴─────────────────────────────────────┘│││││
│ ││ ││ ││ │└─┴─────────────────────────────────────────┘││││
│ ││ ││ │└─┴─────────────────────────────────────────────┘│││
│ ││ │└─┴─────────────────────────────────────────────────┘││
│ │└─┴─────────────────────────────────────────────────────┘│
└─┴─────────────────────────────────────────────────────────┘
  ]r1=. ,&< monoid '' slash2 16#0
┌─────────────────────────────┬─────────────────────────────┐
│┌─────────────┬─────────────┐│┌─────────────┬─────────────┐│
││┌─────┬─────┐│┌─────┬─────┐│││┌─────┬─────┐│┌─────┬─────┐││
│││┌─┬─┐│┌─┬─┐│││┌─┬─┐│┌─┬─┐│││││┌─┬─┐│┌─┬─┐│││┌─┬─┐│┌─┬─┐│││
││││0│0│││0│0│││││0│0│││0│0│││││││0│0│││0│0│││││0│0│││0│0││││
│││└─┴─┘│└─┴─┘│││└─┴─┘│└─┴─┘│││││└─┴─┘│└─┴─┘│││└─┴─┘│└─┴─┘│││
││└─────┴─────┘│└─────┴─────┘│││└─────┴─────┘│└─────┴─────┘││
│└─────────────┴─────────────┘│└─────────────┴─────────────┘│
└─────────────────────────────┴─────────────────────────────┘

r1 is completely balanced, but r0 is lopsided. The depths of these trees also agree with the previous run times:

  L.r0
15
  L.r1
4

Parallel monoid reduction has been studied extensively, e.g. by fortress. Also well-studied is parallelization of monoid/\, particularly +/\.
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to