On Mon, 6 Jul 2020, Brad Chamberlain wrote:
I want to note that the pattern you're expressing here is what we'd call
a partial reduction in Chapel; a long-intended but not-yet-completed
feature, the motivation for it being that expressing a row- or
column-wise reduction like this can be done much more efficiently when
computed wholesale rather than a row or column at a time, particularly
for distributed arrays. There's no good excuse for why we haven't
implemented it yet other than a lack of users breathing down our neck
over it.
I have a version which (in parallel) computes the elements of the result
vector using dot products but it is seriously affected by cache misses.
That great in parallel. But comparing the code in serial mode against a
Fortran program looks pretty bad.
I can replace this reduction by some serial code and I got it to within a
60% penalty over the Fortran. But then that is not parallelizable, so I
cannot win.
Let me know when these 'partial reductions' work. Not urgent.
I cannot figure out how to do that wholesale operation you mention in my
case (without consuming inordinate amounts of temporary memory). The
operation is
Z := (I - lambda * x * y^T) * Z
where Z is a matrix, x and y are vectors, and lambda is a reciprocal of a
scalar.
Regards - Damian
Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers