Hi Damian —

Sorry not to keep up with this thread lately. Last week was very busy and I'm still catching up.

I was probably unclear. What I meant by "wholesale" is that when the language is told "reduce this 2D array by columns" or "by rows", a partial reduction operator permits the implementation to have each task iterate over its local portion of the 2D array in whatever order it wants (e.g., a cache-friendly one) accumulating values into a local scratch vector before combining them with those of other tasks. So the "wholesale" was meant to reflect the idea that a partial reduction specifies the operation as a whole permitting the implementation to implement it all at once.

By contrast, typical workarounds when partial reductions are not available tend to do things like compute a complete reduction over a row-sized slice of the 2D array, one row at a time, which is a lot more constrained in terms of how it's computed. This is a big part of why we want partial reductions in the language. E.g., imagine something like:

        colVect = + reduce(...something specifying "across rows")
                      (Mat * rowVect);

as opposed to something like:

        forall r in rows do
          colVect[r] = + reduce (Mat[r, ..] * rowVect);

(where both of these are intended as rough sketches). The overheads of this second form for a shared-memory Mat are not great; for a distributed-memory Mat, they're even more embarrassing.

I'm surprised your serial version would still be 60% behind Fortran; it seems like we ought to be able to reach parity without too much trouble. Can you share the code?

-Brad




Brad Chamberlain  Principal Engineer
Cray, a Hewlett Packard Enterprise company
901 Fifth Ave, Suite 1000 | Seattle, WA 98164
+1-206-701-2077  [email protected]  www.cray.com


On Tue, 28 Jul 2020, Damian McGuckin wrote:

On Mon, 6 Jul 2020, Brad Chamberlain wrote:

I want to note that the pattern you're expressing here is what we'd call a partial reduction in Chapel; a long-intended but not-yet-completed feature, the motivation for it being that expressing a row- or column-wise reduction like this can be done much more efficiently when computed wholesale rather than a row or column at a time, particularly for distributed arrays. There's no good excuse for why we haven't implemented it yet other than a lack of users breathing down our neck over it.


I have a version which (in parallel) computes the elements of the result
vector using dot products but it is seriously affected by cache misses.
That great in parallel. But comparing the code in serial mode against a
Fortran program looks pretty bad.

I can replace this reduction by some serial code and I got it to within a 60% penalty over the Fortran. But then that is not parallelizable, so I cannot win.

Let me know when these 'partial reductions' work.  Not urgent.

I cannot figure out how to do that wholesale operation you mention in my case (without consuming inordinate amounts of temporary memory). The
operation is

        Z := (I - lambda * x * y^T) * Z

where Z is a matrix, x and y are vectors, and lambda is a reciprocal of a scalar.

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Reply via email to