Re: [Chapel-developers] Interesting parallel performance of 'cobegin' block

Ferguson, Michael Paul Pratt (Chapel Developer) Mon, 29 Jun 2020 06:59:04 -0700

Hi -

> I would have thought the parallel overhead of a 'cobegin' would have
> been much less than that in a 'forall' loop

I'm not certain about what is happening in your specific case,
but here are some guesses:


* the forall approach might be more cache-friendly
    because modern systems share cache among cores
    as compared to the cobegin which is using different data
    in each task.

* the forall by default will try to detect if there is already
   enough tasks running to keep the system busy. You can
   pass `dataParIgnoreRunningTasks=true` to your program
   to change this behavior to experiment with it. As a result,
   in a larger program, the `forall`s might have really low
   parallel overhead. In contrast, the cobegin will always
   create tasks, which can be bad for overhead when these
   are small.

 * the forall lets you use 6 cores but the cobegin only 2

* as far as the Chapel language is concerned, the forall
   indicates to the compiler that the body of the loop
   is order independent. This can aid in vectorization - but
   it should not matter with the C backend today and
   having good vectorization in this case is something I'm
   looking at improving with `--llvm` soon.

Best,

-michael

     
    
    Given an LAPACK-like linear algebra routine to apply a planar rotation to
    two vectors
    
        proc rot(r : complex, ref acj : [?aciD] Real, ref aci : [?acjD] Real)
        {
                for (ai, aj) in zip(aci, acj) do
                {
                        const w = r * (ai, aj):complex;
    
                        (ai, aj) = (w.re, w.im);
                        }
        }
    
    I am computing an SVD. Any vector size more than 2000 with my algorithm
    is probably suspect. That is cool because I do not need to go above 1000
    for the types of problem I am attacking. Within the algorithm, there are
    two independent tasks operating on two independend matrices 'v' and 'a'
    
        rot(tv, v[i..j, k - 1], v[i..j, k])
    and
        rot(ta, a[m..n, k - 1], a[m..n, k])
    
    which are the major numerical computations in the workload, a QR iteration.
    Neither of these two computations depend on the other.
    
    Note that the computations being done with 'rot'
    
    a)  use vectors which have stride > 1 as they work on rows of a matrix
        with column major ordering, and
    
    b)  have vectors from size 6 to 1600 for the problems at hand, which are
        largely test cases with well known solutions.
    
    So, using Chapel code like
    
        cobegin
        {
                rot(tv, v[i..j, k - 1], v[i..j, k]);
                rot(ta, a[m..n, k - 1], a[m..n, k]);
        }
    
    seemed like a reasonable idea to maybe halve the computation time.
    if the 2 'tasks' could run independently of each other.
    
    I only have 6 cores to exploit at this development stage. It normally
    gives me a good indication of what will or not work in parallel.
    
    I left the 'for' loop in 'rot' alone because I did not want to have it
    trying to run in parallel and confusing the scheduler. I thought halving
    the run-time was more than a good performance gain.
    
    But it is slower than serial C for a vector size of 1600.
    
    So, let's throw out the cobegin, run these two calls to 'rot' one after
    another but let the routine 'rot' run its loop as a forall in parallel.
    
    And it is twice as fast as serial C for the same vector size.
    
    Nice!
    
    On just 6 cores, with a vector length of 1600, the implementation of SVD 
    running in parallel is twice the speed of a serial C implementation with 
    very little work.  And the algorithm is still very obvious and readable. 
    The nice clean code appears to parallelize nicely. But for reasons I do 
    not quite comprehend fully.
    
    I would have thought the parallel overhead of a 'cobegin' would have
    been much less than that in a 'forall' loop (over what in my cases is
    1600 complex numbers being read out of cache, multiplied together, and
    then copied back into cache memory).
    
    Is my thinking wrong?
    
    My data is still living in cache, but only just.
    
    I am using 1.20.0 for the moment, not the latest.
    
    Thanks - Damian
    
    Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
    Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
    Views & opinions here are mine and not those of any past or present employer
    
    
    _______________________________________________
    Chapel-developers mailing list
    [email protected]
    https://lists.sourceforge.net/lists/listinfo/chapel-developers 
    


_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Re: [Chapel-developers] Interesting parallel performance of 'cobegin' block

Reply via email to