Hi,
There exists a need to process the N columns of an array 'U' as
for j in 1..N do
{
// treat column j as the anchor column
for k in j+1 .. N do // Stage 'j' processing
{
// mess with operations on column 'j' and 'k'
}
}
The inner loops can be treated as independent of each other subject to a
constraint, i.e. there is a need to guarantee (or somehow enforce) that
when stage 'j' wants to process column 'k', it knows (or can
check) that processing of that same column 'k' by stage(s) prior,
i.e. 'j-1', is completed (or can twiddle its thumbs waiting for
that to happen before proceeding).
That last statement is obviously recursive although it does not need to
be programmed as such.
One can program this (maybe poorly) as
var stage : [0..N] atomic int;
stage[0].write(N);
// define that the 'ghost' precursor stage
// has completed processing of all columns
forall j in 1..N-1 do
{
// define that this stage has completed NO columns
stage[j].write(0);
// this routine must update stage[j] with 'k' when
// it has finished its own processing of column 'k'
processStage(j, U, ......);
}
and then update stage[j] within
processStage(j, U, ......)
to reflect the column 'last processed', say 'k', in that stage. This
then allows the next stage
processStage(j+1, U, ......)
to inspect that variable, i.e. stage[j], during its own operation to
ensure that it does not attempt to use any column 'i' where 'i > k'.
This approach involves waiting which is a big no-no and demands that any
distributed implementation update what amounts to a variable in the
primary locale. This is less than ideal although the overhead is yet to
be quantified. There is an upside. Because one would probably process
columns in blocks, say of 4 or 8 (or at a pinch 16), the apparent need to
test the atomic variable every columns drops by that same blocking factor.
So, while the waiting is not so critical (even if it is detrimental to the
algorithm's readability), the atomic variables in the primary locale are
still a worry.
Are there better ways to attack this problem?
And yes, if looking at parallel Jacobi SVD sweeps, there are algorithms
that try and parallelize that logic very differently. But they do/did not
have the benefit of Chapel at their disposal at the time there were being
developed. And besides, they are quite complex, do even naughtier things
to the readability of the algorithm. Avoiding them, would really, really,
be a desirable thing.
Thanks - Damian
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers