Hi,

There exists a need to process the N columns of an array 'U' as

        for j in 1..N do
        {
                // treat column j as the anchor column

                for k in j+1 .. N do  // Stage 'j' processing
                {
                        // mess with operations on column 'j' and 'k'
                }
        }

The inner loops can be treated as independent of each other subject to a constraint, i.e. there is a need to guarantee (or somehow enforce) that

        when stage 'j' wants to process column 'k', it knows (or can
        check) that processing of that same column 'k' by stage(s) prior,
        i.e. 'j-1', is completed (or can twiddle its thumbs waiting for
        that to happen before proceeding).

That last statement is obviously recursive although it does not need to
be programmed as such.

One can program this (maybe poorly) as

        var stage : [0..N] atomic int;

        stage[0].write(N);
        // define that the 'ghost' precursor stage
        // has completed processing of all columns
        forall j in 1..N-1 do
        {
                // define that this stage has completed NO columns
                stage[j].write(0);
                // this routine must update stage[j] with 'k' when
                // it has finished its own processing of column 'k'
                processStage(j, U, ......);
        }

and then update stage[j] within

        processStage(j, U, ......)

to reflect the column 'last processed', say 'k', in that stage. This
then allows the next stage

        processStage(j+1, U, ......)

to inspect that variable, i.e. stage[j],  during its own operation to
ensure that it does not attempt to use any column 'i' where 'i > k'.

This approach involves waiting which is a big no-no and demands that any distributed implementation update what amounts to a variable in the primary locale. This is less than ideal although the overhead is yet to be quantified. There is an upside. Because one would probably process columns in blocks, say of 4 or 8 (or at a pinch 16), the apparent need to test the atomic variable every columns drops by that same blocking factor. So, while the waiting is not so critical (even if it is detrimental to the algorithm's readability), the atomic variables in the primary locale are still a worry.

Are there better ways to attack this problem?

And yes, if looking at parallel Jacobi SVD sweeps, there are algorithms that try and parallelize that logic very differently. But they do/did not have the benefit of Chapel at their disposal at the time there were being developed. And besides, they are quite complex, do even naughtier things to the readability of the algorithm. Avoiding them, would really, really,
be a desirable thing.

Thanks - Damian


_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Reply via email to