Near line 287 in hpl.chpl in the examples/benchmarks directory, we see
this implemented as
[ (i,j) in panel[k+1.., k+1..] ] Ab[i,j] -= Ab[i,k] * Ab[k,j];
This is a prime example of the common linear algebra case of the need to
compute something like the following:
B := (I + u * v^T) * B (A)
where 'B' is a Chapel m*n 2D-array/matrix stored in column-major order,
and 'u' and 'v' are 1D-arrays/vectors of size m and n respectively.
Going back to my case, I was implementing my Eq(A) using basically
const (rows, columns) = B.domain.dims()
// v has dimensions rows and v dimensions columns
[i in rows] B[i, columns] += u[i] * v[columns];
I was doing this because I thought it has best data locality so that from
the perspective of maximizing cache hits, would parallelize best. I am also
living in hope that when vectorization works well, the above will vectorize
and run faster.
I tried
[ (i, j) in B.domain [ B[i, j] += u[i] * v[j];
I notice that this form is 25% faster than the earlier form in serial mode
(which baffles me totally) and is consistently 5-7% faster in parallel
mode. That sort of margin in (at least) the parallelization case will be
lost if my form vectorizes well but it is interesting.
Is there any reason for that? The number of rows and columns is 1600 or
more.
Is there a better way to implement Eq(A) than either of the above.
P.S. Who is the hpl.chpl 'guru'?
Sadly my real underlying problem is a computation that is dominated by
// apply Givens rotations to consecutive columns of a matrix 'x'
// the Givens rotation is stored in 'g'
// R is some real(w) type and C is complex(2*w)
inline proc apply(ref z : [?zD] ?R, const ref g : [?gD] ?C)
{
const (rows, columns) = zD.dims();
const (cFirst, cLast) = (columns.first, columns.last);
const cExceptFirst = cFirst + 1 .. cLast;
forall r in rows do
{
ref zr = z[r, ..];
var zrj = zr[cFirst];
for i in cExceptFirst do // highly recursive
{
(zrj, zr[i - 1]) = (g[i] * cmplx(zr[i], zrj)).tuple;
}
zr[cLast] = zrj;
}
}
I wish I knew how to improve the speed of that.
As I am reliably informated that casts are anti-social, I use
cmplx(a, b)
to make a complex number from two reals instead of
(a, b):complex(numBits(a.type) << 1)
The .tuple method returns the 2 parts of a complex number as a tuple.
Neither of those little ways of programming affect the speed.
Thanks - Damian
Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers