Consider the following:

        var qr : [1..n] real;
        var u : [1..m, 1..n] real;

Let's assume that each row of 'u' must be processed, either 1 at a time, or in batches of 2 (with an row remaining to be processed alone) or in batches of 4 (with one, two or three rows remaining to be processed by themselves). That just an 'unrolling' where the processing in batches
of 4 will be done in parallel.

An example of such processing is the application of Given's planar rotations to columns j..k of 'u' above. This could be done as:

        // do Givens Rotations to all 'm' rows of the columns j..k, in
        // 'vmRot' doing this processing in blocks of 4 rows at a time

        const jk = j .. k;
        var i = 1;

        // When 'm' is not a multiple of 4, there are 3 (1+2) or 2 or 1
        // rows that remain to be processed separately - do these all up
        // front, i.e. 1..1 & 2..3 or just 1..2 or maybe even just 1..1
        // where this little things are done serially one after another

        for r in 1..2 do if (m & r) != 0 then
        {
                const rows = i .. i + r - 1;

                vmRot(q[jk], u[rows, jk]);
                i += r;
        }

        // do the rest in parallel
        {
                const rows = i .. m;

                [r in rows by 4] vmRot(q[jk], u[r..r+3, jk]);
        }

Why do things with Slices
=========================

The use of array slicing self-documents the code that sub-matrices are
being used.  But sometimes, there is a need for speed and we throw out
the nicer ways of expressing things.

Making things go faster
=======================

It was determined that an alternative solution

        vmRot(jk, q, u[r..r+3, ..]);

is measurably faster than the above but does not convey that vmRot works
on the submatrix [r..r+3, j..k]. Considering the floating point operation
count and number of memory accesses in the benchmark using a matrix with
sometimes 2000 columns, the fact that it is measurable is a bit scary.

Should we give up on slicing altogether and do the above as

        [r in rows by 4] vmRot(r..r+3, jk, q, u);

where vmRot must changed to handle submatrices directly. This avoids any slicing.

But anyway, progressing:

What is the overhead in passing

        u[r..r+3, jk]

and is it the same overhead as

        u[r..r+3, ..]

Curious minds would like to know.

Regards - Damian

P.S. I have a sense of Deja-Vu here. My 2nd post to the list in 2011 was about this topic but in a whole different context,

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer


_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Reply via email to