(resending to the mailing list)
On 9/2/20, 2:28 PM, "Albrecht, Ben" <[email protected]> wrote:
Hi Damian,
I don’t have much to add on improving this algorithm implementation.
At a high level, the load imbalance and data dependence across parallel
iterations makes me wonder if a lower-level recursive task-parallel
implementation would be more suitable for this algorithm.
For example, say task 0 processes column pairs (1..N, 2) serially. After
completing (2,2), task 0 create a new parallel task (task 1) to process the
pairs (2..N, 3). After task 1 processes (3,3), it creates a new parallel
task
to process (3..N, 4), and so on.
In the task-parallel representation, you are creating the tasks as needed,
rather than spinning them all up at once and having them wait for work.
However, this may require some effort to reach desired performance. For
example, a naive implementation of this representation would create N tasks,
rather than creating a number tasks appropriate for your hardware like the
forall approach does.
See https://chapel-lang.org/docs/master/primers/taskParallel.html and
https://chapel-lang.org/docs/master/users-guide/index.html#task-parallelism for
more background on task parallelism in Chapel
Hope that helps.
Thanks,
Ben
On 8/29/20, 12:10 AM, "Damian McGuckin" <[email protected]> wrote:
Hi,
There exists a need to process the N columns of an array 'U' as
for j in 1..N do
{
// treat column j as the anchor column
for k in j+1 .. N do // Stage 'j' processing
{
// mess with operations on column 'j' and 'k'
}
}
The inner loops can be treated as independent of each other subject to
a
constraint, i.e. there is a need to guarantee (or somehow enforce) that
when stage 'j' wants to process column 'k', it knows (or can
check) that processing of that same column 'k' by stage(s)
prior,
i.e. 'j-1', is completed (or can twiddle its thumbs waiting for
that to happen before proceeding).
That last statement is obviously recursive although it does not need to
be programmed as such.
One can program this (maybe poorly) as
var stage : [0..N] atomic int;
stage[0].write(N);
// define that the 'ghost' precursor stage
// has completed processing of all columns
forall j in 1..N-1 do
{
// define that this stage has completed NO columns
stage[j].write(0);
// this routine must update stage[j] with 'k' when
// it has finished its own processing of column 'k'
processStage(j, U, ......);
}
and then update stage[j] within
processStage(j, U, ......)
to reflect the column 'last processed', say 'k', in that stage. This
then allows the next stage
processStage(j+1, U, ......)
to inspect that variable, i.e. stage[j], during its own operation to
ensure that it does not attempt to use any column 'i' where 'i > k'.
This approach involves waiting which is a big no-no and demands that
any
distributed implementation update what amounts to a variable in the
primary locale. This is less than ideal although the overhead is yet
to
be quantified. There is an upside. Because one would probably process
columns in blocks, say of 4 or 8 (or at a pinch 16), the apparent need
to
test the atomic variable every columns drops by that same blocking
factor.
So, while the waiting is not so critical (even if it is detrimental to
the
algorithm's readability), the atomic variables in the primary locale
are
still a worry.
Are there better ways to attack this problem?
And yes, if looking at parallel Jacobi SVD sweeps, there are algorithms
that try and parallelize that logic very differently. But they do/did
not
have the benefit of Chapel at their disposal at the time there were
being
developed. And besides, they are quite complex, do even naughtier
things
to the readability of the algorithm. Avoiding them, would really,
really,
be a desirable thing.
Thanks - Damian
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers