Hi @mratsim —

I may have been unclear in my last response: I didn't mean to imply that I 
didn't believe there are realistic workloads that would result in recursive 
tasks or benefit from task serialization/squashing/stealing techniques. Just 
that the computations and users that have found their way to Chapel typically 
haven't exercised that style, so it hasn't been an area of focus for our team 
(combined with our choice to make Chapel tasks more arbitrary/unpredictable 
than most task-based programming systems).

Also, I should note that it's still possible to write recursive or dynamically 
balanced tasking programs in Chapel, simply that we typically express those 
patterns within the language or library rather than the implementation. For 
example, our libraries have iterators that can dynamically distribute 
iterations either locally across a compute node's cores or globally across 
multiple compute nodes and their cores. As another example, if I rewrite my 
naive Fibonacci computation in a way that manually throttles the creation of 
tasks rather than naively creating exponentially many of them, I can reduce the 
fib(40) case to just over a second on my 4-core laptop (I'll include my fib 
codes at the bottom of this for reference).

My colleage pointed out something from your earlier post that I didn't pick up 
on (and I imagine other readers may not have either): the Nemesis and Sherwood 
lines in the graphs you showed above are scheduler options in the Sandia 
Qthreads library that Chapel targets by default, so are representative of 
what's possible in Chapel today. Qthreads defaults to using the Sherwood 
scheduler, but Chapel chooses to use the Nemesis scheduler by default instead 
(though the user can override this default). We found that while the Sherwood 
scheduler was highly tuned for UTS (Unbalanced Tree Search) and great for 
highly unbalanced workloads with many many tasks, it had too much overhead for 
well-balanced operations, which is what we tend to care most about. In 
contrast, the Nemesis scheduler avoids those overheads and has much better 
performance for balanced workloads, but gives up work-stealing.

Ultimately we hope to have the best of both worlds. We had started some work 
with the Qthreads team to create a distrib scheduler that would do that 
([https://github.com/Qthreads/qthreads/issues/39](https://github.com/Qthreads/qthreads/issues/39)
 ), but the effort is currently stalled out.

Wrapping up, here's my naive Fibonacci in Chapel:
    
    
    config const n = 10;
    
    writeln("fib(",n,") = ", fib(n));
    
    proc fib(n): int {
      if n < 2 then
        return n;
      else {
        var i, j: int;
        sync {
          begin with (ref i) i = fib(n-1);
          j = fib(n-2);
        }
        return i+j;
      }
    }

Here's my manually throttled version:
    
    
    config const n = 10;
    
    writeln("fib(",n,") = ", fib(n));
    
    proc fib(n): int {
      if n < 2 then
        return n;
      else {
        if (here.runningTasks() >= here.numPUs()) then
          return fib(n-1) + fib(n-2);
        else {
          var i, j: int;
          sync {
            begin with (ref i) i = fib(n-1);
            j = fib(n-2);
          }
          return i+j;
        }
      }
    }

And here's a very simple example of the bounded buffer pattern I was 
mentioning. We've never really treated this as a benchmark, more of a 
demonstration of how easy task parallelism can be in Chapel, where the key 
points are (a) task creation is simple and (b) sync variables avoid the need to 
worry about the typical overflow/underflow conditions:
    
    
    config const buffsize = 10,
                 n = 101;
    
    var A: [0..#n] sync int;
    
    cobegin     {
      producer();
      consumer();
    }
    
    proc producer() {
      for i in 0..#n do          // write n values to the buffer
        A[i%buffsize] = i;
      A[n%buffsize] = -1;   // write a "done" sentinel
    }
    
    proc consumer() {
      var pos = 0;
      do {
        const val = A[pos];     // read value
        if val != -1 then
          writeln("got: ", val);
        
        pos += 1;               // advance cursor
        pos %= buffsize;
      } while (val != -1);
    }

A more full-blown version of this pattern that we use as an exercise for 
students in hands-on exercises (where their goal is to add support for multiple 
producers and consumers on the same buffer) can be found here (with sample 
solutions in the subdirectory): 
[https://github.com/chapel-lang/chapel/tree/master/test/exercises/boundedBuffer](https://github.com/chapel-lang/chapel/tree/master/test/exercises/boundedBuffer)

Reply via email to