Hi Damian —
Sounds like you've largely sorted this out, but just to touch on a few
points:
As:
var y = [1..columns] real = 0:real;
for r in rows do
{
y += x[r] * z[r, ..];
}
where y and x are vectors. It is a direct mapping. What's wrong with that?
Modulo the row- vs. column-major memory issues you eventually arrived at,
Chapel's support for rank-change slicing is currently known to be fairly
heavyweight, so the expression 'z[r, ..]' is, unfortunately, not cheap at
present. I expect that this accounts for a fair amount of the overhead.
Silly me, I decided. Let's use iterators. Now the code is:
y = 0;
for (r, c) in zD do // effectively called with ref z : [?zD] real
{
y[c] += x[r] * z[r, c];
}
This one seems to me to be fairly well-written serial Chapel code in that
the 2D iteration over zD will do it in row-major order, which Chapel
likes, and you'll have good re-use on one of your vectors (x) and multiple
sweeps over your other (y). If you wanted to push this further, it'd be
interesting to break the loop into pure scalar loops (not that we want
people to do this forever) and see whether tucking that repeated read of
x[r] into a scalar helps. That is:
for r in rows {
const xr = x[r];
for c in cols {
y[c] += xr * z[r,c];
}
}
This is the sort of thing that one would really like the Chapel and/or C
compilers to do automatically given the previouos loop, but I don't happen
to know how successful or not we tend to be at optimizing it (and it could
also depend on the choice and version of C compiler, obviously).
Note that none of this is meant to suggest that this is the preferred way
to write it in Chapel; writing it the cleaner ways (which is how we'd like
to see it written) and then embarrassing us into tuning the Chapel code is
completely valid as well. And of course partial reductions ought to end
up being both super-clean and efficient. But I am curious how close we
can get to Fortran.
You mention the vectorization as well, and that's something we're
continuing to look into in the context of the LLVM back-end.
The second is fascinating. I would never have guessed that an O(n) write to
memory would have 20% the impact of O(n^2) floating point operations.
Welcome to the exciting world of modern architectures and cost models! (?)
I will put working partial reductions on my Santa wish-list for 2021 now!
Sounds good.
-Brad
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers