Part 2: reduction. I cannot in good conscience write an adverb which claims to be a parallel analogue to / but which blindly assumes all verbs are associative. In fact, I will require a stronger than associativity: that verbs reduced in parallel are monoids. This starts with the following definition:
monoid=: 2 :'x u y NB.MONOID' x u monoid n y behaves exactly the same as x u y, but the former form serves as an assertion that u is associative and that it has an identity of n. We can then write our first substitute for u/, which ensures that u takes the form u monoid n before executing in parallel. That looks like this: slash=: {{ par=. NPAR <. #y NB. annoying; no late binding for non-verbs if. -. u`'' -:&(0&{@,@>) + monoid 0 `'' do. u/y return. end. id=. 1{::^:3,>u u=. (0{1{::,>u`'') `:6 NB. pull out actual verb to take advantage of SC if. 0-:#y do. id return. end. c=. (#y) <.@% par d=. c*i.par t=. (c#~<:par) , c+par|#y > u&.>/ ([: u/ {&t {. {&d }. y"_) t.''"_1 i.par }} We farm out par threads, each of which reduces a chunk, and then reduce the final results in the main thread before returning. Good enough? It looks the part: a=. i.1e8 timex'+/a' 0.0680711 timex'+ monoid 0 slash a' 0.0139139 A thousand times no! This code is asymptotically suboptimal. This is clearest when NPAR is _, but alas, it is capped at 14 in the present implementation. We can fake it, however, by making u very expensive and y very short. For instance: f=. 0:@(6!:3@0.1) timex'f monoid 0 slash 16#0' 1.50183 In other words, we have 15 serial applications of f, which is far too many; there should only be 4. slash is linear. We need to continue to parallelise even when there are fewer elements of y than there are available threads, viz the following updated definition: slash2=: {{ par=. NPAR <. 2 <.@%~ #y NB. annoying; no late binding for non-verbs if. -. u`'' -:&(0&{@,@>) + monoid 0 `'' do. u/y return. end. id=. 1{::^:3,>u u=. (0{1{::,>u`'') `:6 NB. pull out actual verb to take advantage of SC if. 0-:#y do. id return. end. c=. (#y) <.@% par d=. c*i.par t=. (c#~<:par) , c+par|#y y=. >([: u/ {&t {. {&d }. y"_) t.''"_1 i.par u monoid id slash2^:(1~:#) y }} then: timex'f monoid 0 slash2 16#0' 0.400691 much better! (Note: in practice, there would be a minimum number of elements processed by a given task, which would in many cases be >2; coming up with this minimum requires a cost model for u. But 2 is asymptotically fine and serves its pedagogical purpose here.) Another way of seeing the performance difference between slash and slash2 is to display the execution tree, by analogy to https://www.jsoftware.com/help/dictionary/samp19.htm: ]r0=. ,&< monoid '' slash 16#0 ┌─┬─────────────────────────────────────────────────────────┐ │0│┌─┬─────────────────────────────────────────────────────┐│ │ ││0│┌─┬─────────────────────────────────────────────────┐││ │ ││ ││0│┌─┬─────────────────────────────────────────────┐│││ │ ││ ││ ││0│┌─┬─────────────────────────────────────────┐││││ │ ││ ││ ││ ││0│┌─┬─────────────────────────────────────┐│││││ │ ││ ││ ││ ││ ││0│┌─┬─────────────────────────────────┐││││││ │ ││ ││ ││ ││ ││ ││0│┌─┬─────────────────────────────┐│││││││ │ ││ ││ ││ ││ ││ ││ ││0│┌─┬─────────────────────────┐││││││││ │ ││ ││ ││ ││ ││ ││ ││ ││0│┌─┬─────────────────────┐│││││││││ │ ││ ││ ││ ││ ││ ││ ││ ││ ││0│┌─┬─────────────────┐││││││││││ │ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││0│┌─┬─────────────┐│││││││││││ │ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││0│┌─┬─────────┐││││││││││││ │ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││0│┌─┬─────┐│││││││││││││ │ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││0│┌─┬─┐││││││││││││││ │ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││0│0│││││││││││││││ │ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ │└─┴─┘││││││││││││││ │ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ │└─┴─────┘│││││││││││││ │ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ │└─┴─────────┘││││││││││││ │ ││ ││ ││ ││ ││ ││ ││ ││ ││ ││ │└─┴─────────────┘│││││││││││ │ ││ ││ ││ ││ ││ ││ ││ ││ ││ │└─┴─────────────────┘││││││││││ │ ││ ││ ││ ││ ││ ││ ││ ││ │└─┴─────────────────────┘│││││││││ │ ││ ││ ││ ││ ││ ││ ││ │└─┴─────────────────────────┘││││││││ │ ││ ││ ││ ││ ││ ││ │└─┴─────────────────────────────┘│││││││ │ ││ ││ ││ ││ ││ │└─┴─────────────────────────────────┘││││││ │ ││ ││ ││ ││ │└─┴─────────────────────────────────────┘│││││ │ ││ ││ ││ │└─┴─────────────────────────────────────────┘││││ │ ││ ││ │└─┴─────────────────────────────────────────────┘│││ │ ││ │└─┴─────────────────────────────────────────────────┘││ │ │└─┴─────────────────────────────────────────────────────┘│ └─┴─────────────────────────────────────────────────────────┘ ]r1=. ,&< monoid '' slash2 16#0 ┌─────────────────────────────┬─────────────────────────────┐ │┌─────────────┬─────────────┐│┌─────────────┬─────────────┐│ ││┌─────┬─────┐│┌─────┬─────┐│││┌─────┬─────┐│┌─────┬─────┐││ │││┌─┬─┐│┌─┬─┐│││┌─┬─┐│┌─┬─┐│││││┌─┬─┐│┌─┬─┐│││┌─┬─┐│┌─┬─┐│││ ││││0│0│││0│0│││││0│0│││0│0│││││││0│0│││0│0│││││0│0│││0│0││││ │││└─┴─┘│└─┴─┘│││└─┴─┘│└─┴─┘│││││└─┴─┘│└─┴─┘│││└─┴─┘│└─┴─┘│││ ││└─────┴─────┘│└─────┴─────┘│││└─────┴─────┘│└─────┴─────┘││ │└─────────────┴─────────────┘│└─────────────┴─────────────┘│ └─────────────────────────────┴─────────────────────────────┘ r1 is completely balanced, but r0 is lopsided. The depths of these trees also agree with the previous run times: L.r0 15 L.r1 4 Parallel monoid reduction has been studied extensively, e.g. by fortress. Also well-studied is parallelization of monoid/\, particularly +/\. ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm