Part 2: reduction.

I cannot in good conscience write an adverb which claims to be a parallel 
analogue to / but which blindly assumes all verbs are associative.  In 
fact, I will require a stronger than associativity: that verbs reduced in 
parallel are monoids.  This starts with the following definition:

monoid=: 2 :'x u y NB.MONOID'

x u monoid n y behaves exactly the same as x u y, but the former form 
serves as an assertion that u is associative and that it has an identity 
of n.  We can then write our first substitute for u/, which ensures that u 
takes the form u monoid n before executing in parallel.  That looks like 
this:

slash=: {{ par=. NPAR <. #y
 NB. annoying; no late binding for non-verbs
 if. -. u`'' -:&(0&{@,@>) + monoid 0 `'' do.
  u/y return. end.
 id=. 1{::^:3,>u
 u=. (0{1{::,>u`'') `:6  NB. pull out actual verb to take advantage of SC
 if. 0-:#y do. id return. end.
 c=. (#y) <.@% par
 d=. c*i.par
 t=. (c#~<:par) , c+par|#y
 > u&.>/ ([: u/ {&t {. {&d }. y"_) t.''"_1 i.par }}

We farm out par threads, each of which reduces a chunk, and then reduce 
the final results in the main thread before returning.  Good enough?  It 
looks the part:

   a=. i.1e8
   timex'+/a'
0.0680711
   timex'+ monoid 0 slash a'
0.0139139

A thousand times no!  This code is asymptotically suboptimal.  This is 
clearest when NPAR is _, but alas, it is capped at 14 in the present 
implementation.  We can fake it, however, by making u very expensive and y 
very short.  For instance:

   f=. 0:@(6!:3@0.1)
   timex'f monoid 0 slash 16#0'
1.50183

In other words, we have 15 serial applications of f, which is far too 
many; there should only be 4.  slash is linear.  We need to continue to 
parallelise even when there are fewer elements of y than there are 
available threads, viz the following updated definition:

slash2=: {{ par=. NPAR <. 2 <.@%~ #y
 NB. annoying; no late binding for non-verbs
 if. -. u`'' -:&(0&{@,@>) + monoid 0 `'' do.
  u/y return. end.
 id=. 1{::^:3,>u
 u=. (0{1{::,>u`'') `:6  NB. pull out actual verb to take advantage of SC
 if. 0-:#y do. id return. end.
 c=. (#y) <.@% par
 d=. c*i.par
 t=. (c#~<:par) , c+par|#y
 y=. >([: u/ {&t {. {&d }. y"_) t.''"_1 i.par
 u monoid id slash2^:(1~:#) y }}

then:

   timex'f monoid 0 slash2 16#0'
0.400691

much better!

(Note: in practice, there would be a minimum number of elements processed 
by a given task, which would in many cases be >2; coming up with this 
minimum requires a cost model for u.  But 2 is asymptotically fine and 
serves its pedagogical purpose here.)

Another way of seeing the performance difference between slash and slash2 
is to display the execution tree, by analogy to 
https://www.jsoftware.com/help/dictionary/samp19.htm:

   ]r0=. ,&< monoid '' slash 16#0
┌─┬─────────────────────────────────────────────────────────┐
│0│┌─┬─────────────────────────────────────────────────────┐│
│ ││0│┌─┬─────────────────────────────────────────────────┐││
│ ││ ││0│┌─┬─────────────────────────────────────────────┐│││
│ ││ ││ ││0│┌─┬─────────────────────────────────────────┐││││
│ ││ ││ ││ ││0│┌─┬─────────────────────────────────────┐│││││
│ ││ ││ ││ ││ ││0│┌─┬─────────────────────────────────┐││││││
│ ││ ││ ││ ││ ││ ││0│┌─┬─────────────────────────────┐│││││││
│ ││ ││ ││ ││ ││ ││ ││0│┌─┬─────────────────────────┐││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││0│┌─┬─────────────────────┐│││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││0│┌─┬─────────────────┐││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││0│┌─┬─────────────┐│││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││0│┌─┬─────────┐││││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││0│┌─┬─────┐│││││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││0│┌─┬─┐││││││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││0│0│││││││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ │└─┴─┘││││││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ │└─┴─────┘│││││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ │└─┴─────────┘││││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ │└─┴─────────────┘│││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││ │└─┴─────────────────┘││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ │└─┴─────────────────────┘│││││││││
│ ││ ││ ││ ││ ││ ││ ││ │└─┴─────────────────────────┘││││││││
│ ││ ││ ││ ││ ││ ││ │└─┴─────────────────────────────┘│││││││
│ ││ ││ ││ ││ ││ │└─┴─────────────────────────────────┘││││││
│ ││ ││ ││ ││ │└─┴─────────────────────────────────────┘│││││
│ ││ ││ ││ │└─┴─────────────────────────────────────────┘││││
│ ││ ││ │└─┴─────────────────────────────────────────────┘│││
│ ││ │└─┴─────────────────────────────────────────────────┘││
│ │└─┴─────────────────────────────────────────────────────┘│
└─┴─────────────────────────────────────────────────────────┘
   ]r1=. ,&< monoid '' slash2 16#0
┌─────────────────────────────┬─────────────────────────────┐
│┌─────────────┬─────────────┐│┌─────────────┬─────────────┐│
││┌─────┬─────┐│┌─────┬─────┐│││┌─────┬─────┐│┌─────┬─────┐││
│││┌─┬─┐│┌─┬─┐│││┌─┬─┐│┌─┬─┐│││││┌─┬─┐│┌─┬─┐│││┌─┬─┐│┌─┬─┐│││
││││0│0│││0│0│││││0│0│││0│0│││││││0│0│││0│0│││││0│0│││0│0││││
│││└─┴─┘│└─┴─┘│││└─┴─┘│└─┴─┘│││││└─┴─┘│└─┴─┘│││└─┴─┘│└─┴─┘│││
││└─────┴─────┘│└─────┴─────┘│││└─────┴─────┘│└─────┴─────┘││
│└─────────────┴─────────────┘│└─────────────┴─────────────┘│
└─────────────────────────────┴─────────────────────────────┘

r1 is completely balanced, but r0 is lopsided.  The depths of these trees 
also agree with the previous run times:

   L.r0
15
   L.r1
4

Parallel monoid reduction has been studied extensively, e.g. by fortress. 
Also well-studied is parallelization of monoid/\, particularly +/\.
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to