Hi Damian —
Sorry not to keep up with this thread lately. Last week was very busy and
I'm still catching up.
I was probably unclear. What I meant by "wholesale" is that when the
language is told "reduce this 2D array by columns" or "by rows", a partial
reduction operator permits the implementation to have each task iterate
over its local portion of the 2D array in whatever order it wants (e.g., a
cache-friendly one) accumulating values into a local scratch vector before
combining them with those of other tasks. So the "wholesale" was meant to
reflect the idea that a partial reduction specifies the operation as a
whole permitting the implementation to implement it all at once.
By contrast, typical workarounds when partial reductions are not available
tend to do things like compute a complete reduction over a row-sized slice
of the 2D array, one row at a time, which is a lot more constrained in
terms of how it's computed. This is a big part of why we want partial
reductions in the language. E.g., imagine something like:
colVect = + reduce(...something specifying "across rows")
(Mat * rowVect);
as opposed to something like:
forall r in rows do
colVect[r] = + reduce (Mat[r, ..] * rowVect);
(where both of these are intended as rough sketches). The overheads of
this second form for a shared-memory Mat are not great; for a
distributed-memory Mat, they're even more embarrassing.
I'm surprised your serial version would still be 60% behind Fortran; it
seems like we ought to be able to reach parity without too much trouble.
Can you share the code?
-Brad
Brad Chamberlain Principal Engineer
Cray, a Hewlett Packard Enterprise company
901 Fifth Ave, Suite 1000 | Seattle, WA 98164
+1-206-701-2077 [email protected] www.cray.com
On Tue, 28 Jul 2020, Damian McGuckin wrote:
On Mon, 6 Jul 2020, Brad Chamberlain wrote:
I want to note that the pattern you're expressing here is what we'd call a
partial reduction in Chapel; a long-intended but not-yet-completed feature,
the motivation for it being that expressing a row- or column-wise reduction
like this can be done much more efficiently when computed wholesale rather
than a row or column at a time, particularly for distributed arrays.
There's no good excuse for why we haven't implemented it yet other than a
lack of users breathing down our neck over it.
I have a version which (in parallel) computes the elements of the result
vector using dot products but it is seriously affected by cache misses.
That great in parallel. But comparing the code in serial mode against a
Fortran program looks pretty bad.
I can replace this reduction by some serial code and I got it to within a 60%
penalty over the Fortran. But then that is not parallelizable, so I cannot
win.
Let me know when these 'partial reductions' work. Not urgent.
I cannot figure out how to do that wholesale operation you mention in my case
(without consuming inordinate amounts of temporary memory). The
operation is
Z := (I - lambda * x * y^T) * Z
where Z is a matrix, x and y are vectors, and lambda is a reciprocal of a
scalar.
Regards - Damian
Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers