Hi Brad,

On Tue, 28 Jul 2020, Brad Chamberlain wrote:

I'm surprised your serial version would still be 60% behind Fortran; it seems like we ought to be able to reach parity without too much trouble.

I originally re-wrote the non-functioning partial reduction code:

        const y = + reduce [r in rows] x[r] * z[r, ..];

As:

        var y = [1..columns] real = 0:real;

        for r in rows do
        {
                y += x[r] * z[r, ..];
        }

where y and x are vectors. It is a direct mapping. What's wrong with that?

Silly me, I decided. Let's use iterators. Now the code is:

        y = 0;
        for (r, c) in zD do // effectively called with ref z : [?zD] real
        {
                y[c] += x[r] * z[r, c];
        }

and the elapsed time dropped by 20%.  That is a serious drop.

So my original code has a penalty of 25%, or a performance hit of 125%.

Also, the Fortran version of the target SVD algorithm does things like

        do i=1,n
                a = ddot(n, x, z(1, i)) // i.e. a = x.(column j of z(*,*))
                axpy(n, a, x, z(1, i)) // (column j of z(*, *)) += a * x
        end do

To be fair and match what I to do in Chapel, I instead should save the
results of those dot products in an intermediate vector, t(*) as:

        do i=1,n
                t(i) = ddot(n, x, z(1, i)) // i.e. a = x.(column j of z(*,*))
        end do
        do i=1,n
                axpy(n, a, x, z(1, i)) // (column j of z(*, *)) += a * x
        end do

This runs a whole 20% Wow. Incredible! I fell off my chair!

So this Fortran technique of having the dot product in the same loop as
the vector addition has a performance gain of 1.0/0.8 = 125%.  Or forcing
the storage of the intermediate results has the penalty.

Anyway, multiply those two performance hits/gains and you have:

        125% * 125% = 156.25%

which, with rounding, is where my 60% extra came from.

This first expression which prefers iterators over indexing is technically a fundamental problem with Chapel but when there is time for the team to look at vectorization, this will disappear. Anyway, I can fix it today with iterators even if they do look a bit ugly. Besides: when partial reductions become available by Christmas 2021, this whole issue is moot. See the last line of my email.

The second is fascinating. I would never have guessed that an O(n) write to memory would have 20% the impact of O(n^2) floating point operations. It is not really resolvable, so I am stuck with 25%. This means something which should take (say) 10 seconds will take 12.5 seconds. But in the SVD algorithm in question, this problematic chunk of codeis always paired with a second operation which also takes 10 seconds or so, the equivalent Fortran of which has no penalty. So, in totality, there is just 12.5% penalty for the whole section of code. Anything under 15% I can handle.

I will put working partial reductions on my Santa wish-list for 2021 now!

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer


_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Reply via email to