(Note: even smarter parallel / is possible, given non-homogenous
performance of u, but needs primitives in order to build a concurrent
queue. A primitive 'wait for any' (instead of 'wait for all') would be of
use here--I cannot bear to poll--but proper queues would be better.)
On Tue, 12 Apr 2022, Elijah Stone wrote:
Part 2: reduction.
I cannot in good conscience write an adverb which claims to be a parallel
analogue to / but which blindly assumes all verbs are associative. In
fact, I will require a stronger than associativity: that verbs reduced in
parallel are monoids. This starts with the following definition:
monoid=: 2 :'x u y NB.MONOID'
x u monoid n y behaves exactly the same as x u y, but the former form
serves as an assertion that u is associative and that it has an identity
of n. We can then write our first substitute for u/, which ensures that u
takes the form u monoid n before executing in parallel. That looks like
this:
slash=: {{ par=. NPAR <. #y
NB. annoying; no late binding for non-verbs
if. -. u`'' -:&(0&{@,@>) + monoid 0 `'' do.
u/y return. end.
id=. 1{::^:3,>u
u=. (0{1{::,>u`'') `:6 NB. pull out actual verb to take advantage of SC
if. 0-:#y do. id return. end.
c=. (#y) <.@% par
d=. c*i.par
t=. (c#~<:par) , c+par|#y
> u&.>/ ([: u/ {&t {. {&d }. y"_) t.''"_1 i.par }}
We farm out par threads, each of which reduces a chunk, and then reduce
the final results in the main thread before returning. Good enough? It
looks the part:
a=. i.1e8
timex'+/a'
0.0680711
timex'+ monoid 0 slash a'
0.0139139
A thousand times no! This code is asymptotically suboptimal. This is
clearest when NPAR is _, but alas, it is capped at 14 in the present
implementation. We can fake it, however, by making u very expensive and y
very short. For instance:
f=. 0:@(6!:3@0.1)
timex'f monoid 0 slash 16#0'
1.50183
In other words, we have 15 serial applications of f, which is far too
many; there should only be 4. slash is linear. We need to continue to
parallelise even when there are fewer elements of y than there are
available threads, viz the following updated definition:
slash2=: {{ par=. NPAR <. 2 <.@%~ #y
NB. annoying; no late binding for non-verbs
if. -. u`'' -:&(0&{@,@>) + monoid 0 `'' do.
u/y return. end.
id=. 1{::^:3,>u
u=. (0{1{::,>u`'') `:6 NB. pull out actual verb to take advantage of SC
if. 0-:#y do. id return. end.
c=. (#y) <.@% par
d=. c*i.par
t=. (c#~<:par) , c+par|#y
y=. >([: u/ {&t {. {&d }. y"_) t.''"_1 i.par
u monoid id slash2^:(1~:#) y }}
then:
timex'f monoid 0 slash2 16#0'
0.400691
much better!
(Note: in practice, there would be a minimum number of elements processed
by a given task, which would in many cases be >2; coming up with this
minimum requires a cost model for u. But 2 is asymptotically fine and
serves its pedagogical purpose here.)
Another way of seeing the performance difference between slash and slash2
is to display the execution tree, by analogy to
https://www.jsoftware.com/help/dictionary/samp19.htm:
]r0=. ,&< monoid '' slash 16#0
┌─┬─────────────────────────────────────────────────────────┐
│0│┌─┬─────────────────────────────────────────────────────┐│
│ ││0│┌─┬─────────────────────────────────────────────────┐││
│ ││ ││0│┌─┬─────────────────────────────────────────────┐│││
│ ││ ││ ││0│┌─┬─────────────────────────────────────────┐││││
│ ││ ││ ││ ││0│┌─┬─────────────────────────────────────┐│││││
│ ││ ││ ││ ││ ││0│┌─┬─────────────────────────────────┐││││││
│ ││ ││ ││ ││ ││ ││0│┌─┬─────────────────────────────┐│││││││
│ ││ ││ ││ ││ ││ ││ ││0│┌─┬─────────────────────────┐││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││0│┌─┬─────────────────────┐│││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││0│┌─┬─────────────────┐││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││0│┌─┬─────────────┐│││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││0│┌─┬─────────┐││││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││0│┌─┬─────┐│││││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││0│┌─┬─┐││││││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││0│0│││││││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ │└─┴─┘││││││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ │└─┴─────┘│││││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ │└─┴─────────┘││││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ │└─┴─────────────┘│││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ ││ │└─┴─────────────────┘││││││││││
│ ││ ││ ││ ││ ││ ││ ││ ││ │└─┴─────────────────────┘│││││││││
│ ││ ││ ││ ││ ││ ││ ││ │└─┴─────────────────────────┘││││││││
│ ││ ││ ││ ││ ││ ││ │└─┴─────────────────────────────┘│││││││
│ ││ ││ ││ ││ ││ │└─┴─────────────────────────────────┘││││││
│ ││ ││ ││ ││ │└─┴─────────────────────────────────────┘│││││
│ ││ ││ ││ │└─┴─────────────────────────────────────────┘││││
│ ││ ││ │└─┴─────────────────────────────────────────────┘│││
│ ││ │└─┴─────────────────────────────────────────────────┘││
│ │└─┴─────────────────────────────────────────────────────┘│
└─┴─────────────────────────────────────────────────────────┘
]r1=. ,&< monoid '' slash2 16#0
┌─────────────────────────────┬─────────────────────────────┐
│┌─────────────┬─────────────┐│┌─────────────┬─────────────┐│
││┌─────┬─────┐│┌─────┬─────┐│││┌─────┬─────┐│┌─────┬─────┐││
│││┌─┬─┐│┌─┬─┐│││┌─┬─┐│┌─┬─┐│││││┌─┬─┐│┌─┬─┐│││┌─┬─┐│┌─┬─┐│││
││││0│0│││0│0│││││0│0│││0│0│││││││0│0│││0│0│││││0│0│││0│0││││
│││└─┴─┘│└─┴─┘│││└─┴─┘│└─┴─┘│││││└─┴─┘│└─┴─┘│││└─┴─┘│└─┴─┘│││
││└─────┴─────┘│└─────┴─────┘│││└─────┴─────┘│└─────┴─────┘││
│└─────────────┴─────────────┘│└─────────────┴─────────────┘│
└─────────────────────────────┴─────────────────────────────┘
r1 is completely balanced, but r0 is lopsided. The depths of these trees
also agree with the previous run times:
L.r0
15
L.r1
4
Parallel monoid reduction has been studied extensively, e.g. by fortress.
Also well-studied is parallelization of monoid/\, particularly +/\.
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm