(resending to the mailing list)

On 9/2/20, 2:28 PM, "Albrecht, Ben" <[email protected]> wrote:

    Hi Damian,

    I don’t have much to add on improving this algorithm implementation.
    At a high level, the load imbalance and data dependence across parallel
    iterations makes me wonder if a lower-level recursive task-parallel
    implementation would be more suitable for this algorithm.

    For example, say task 0 processes column pairs (1..N, 2) serially. After
    completing (2,2), task 0 create a new parallel task (task 1) to process the
    pairs (2..N, 3). After task 1 processes (3,3), it creates a new parallel 
task
    to process (3..N, 4), and so on.

    In the task-parallel representation, you are creating the tasks as needed,
    rather than spinning them all up at once and having them wait for work.
    However, this may require some effort to reach desired performance. For
    example, a naive implementation of this representation would create N tasks,
    rather than creating a number tasks appropriate for your hardware like the
    forall approach does.

    See https://chapel-lang.org/docs/master/primers/taskParallel.html and 
https://chapel-lang.org/docs/master/users-guide/index.html#task-parallelism for 
more background on task parallelism in Chapel

    Hope that helps.

    Thanks,
    Ben


    On 8/29/20, 12:10 AM, "Damian McGuckin" <[email protected]> wrote:


        Hi,

        There exists a need to process the N columns of an array 'U' as

                for j in 1..N do
                {
                        // treat column j as the anchor column

                        for k in j+1 .. N do  // Stage 'j' processing
                        {
                                // mess with operations on column 'j' and 'k'
                        }
                }

        The inner loops can be treated as independent of each other subject to 
a 
        constraint, i.e. there is a need to guarantee (or somehow enforce) that

                when stage 'j' wants to process column 'k', it knows (or can
                check) that processing of that same column 'k' by stage(s) 
prior,
                i.e. 'j-1', is completed (or can twiddle its thumbs waiting for
                that to happen before proceeding).

        That last statement is obviously recursive although it does not need to
        be programmed as such.

        One can program this (maybe poorly) as

                var stage : [0..N] atomic int;

                stage[0].write(N);
                // define that the 'ghost' precursor stage
                // has completed processing of all columns
                forall j in 1..N-1 do
                {
                        // define that this stage has completed NO columns
                        stage[j].write(0);
                        // this routine must update stage[j] with 'k' when
                        // it has finished its own processing of column 'k'
                        processStage(j, U, ......);
                }

        and then update stage[j] within

                processStage(j, U, ......)

        to reflect the column 'last processed', say 'k', in that stage. This
        then allows the next stage

                processStage(j+1, U, ......)

        to inspect that variable, i.e. stage[j],  during its own operation to
        ensure that it does not attempt to use any column 'i' where 'i > k'.

        This approach involves waiting which is a big no-no and demands that 
any 
        distributed implementation update what amounts to a variable in the 
        primary locale.  This is less than ideal although the overhead is yet 
to 
        be quantified. There is an upside. Because one would probably process 
        columns in blocks, say of 4 or 8 (or at a pinch 16), the apparent need 
to 
        test the atomic variable every columns drops by that same blocking 
factor. 
        So, while the waiting is not so critical (even if it is detrimental to 
the 
        algorithm's readability), the atomic variables in the primary locale 
are 
        still a worry.

        Are there better ways to attack this problem?

        And yes, if looking at parallel Jacobi SVD sweeps, there are algorithms 
        that try and parallelize that logic very differently. But they do/did 
not 
        have the benefit of Chapel at their disposal at the time there were 
being 
        developed. And besides, they are quite complex, do even naughtier 
things 
        to the readability of the algorithm. Avoiding them, would really, 
really,
        be a desirable thing.

        Thanks - Damian


        _______________________________________________
        Chapel-developers mailing list
        [email protected]
        https://lists.sourceforge.net/lists/listinfo/chapel-developers 



_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Reply via email to