Hi Damian —

Sounds like you've largely sorted this out, but just to touch on a few points:

As:

        var y = [1..columns] real = 0:real;

        for r in rows do
        {
                y += x[r] * z[r, ..];
        }

where y and x are vectors. It is a direct mapping. What's wrong with that?

Modulo the row- vs. column-major memory issues you eventually arrived at, Chapel's support for rank-change slicing is currently known to be fairly heavyweight, so the expression 'z[r, ..]' is, unfortunately, not cheap at present. I expect that this accounts for a fair amount of the overhead.


Silly me, I decided. Let's use iterators. Now the code is:

        y = 0;
        for (r, c) in zD do // effectively called with ref z : [?zD] real
        {
                y[c] += x[r] * z[r, c];
        }

This one seems to me to be fairly well-written serial Chapel code in that the 2D iteration over zD will do it in row-major order, which Chapel likes, and you'll have good re-use on one of your vectors (x) and multiple sweeps over your other (y). If you wanted to push this further, it'd be
interesting to break the loop into pure scalar loops (not that we want
people to do this forever) and see whether tucking that repeated read of
x[r] into a scalar helps.  That is:

        for r in rows {
          const xr = x[r];
          for c in cols {
            y[c] += xr * z[r,c];
          }
        }

This is the sort of thing that one would really like the Chapel and/or C compilers to do automatically given the previouos loop, but I don't happen to know how successful or not we tend to be at optimizing it (and it could also depend on the choice and version of C compiler, obviously).

Note that none of this is meant to suggest that this is the preferred way to write it in Chapel; writing it the cleaner ways (which is how we'd like to see it written) and then embarrassing us into tuning the Chapel code is completely valid as well. And of course partial reductions ought to end up being both super-clean and efficient. But I am curious how close we
can get to Fortran.

You mention the vectorization as well, and that's something we're continuing to look into in the context of the LLVM back-end.


The second is fascinating. I would never have guessed that an O(n) write to memory would have 20% the impact of O(n^2) floating point operations.

Welcome to the exciting world of modern architectures and cost models! (?)


I will put working partial reductions on my Santa wish-list for 2021 now!

Sounds good.

-Brad
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Reply via email to