Re: [Chapel-developers] Reduce bug in Chapel 1.22

Damian McGuckin Wed, 29 Jul 2020 23:24:17 -0700


Hi Brad,

On Tue, 28 Jul 2020, Brad Chamberlain wrote:

I'm surprised your serial version would still be 60% behind Fortran; itseems like we ought to be able to reach parity without too much trouble.


I originally re-wrote the non-functioning partial reduction code:

        const y = + reduce [r in rows] x[r] * z[r, ..];

As:

        var y = [1..columns] real = 0:real;

        for r in rows do
        {
                y += x[r] * z[r, ..];
        }

where y and x are vectors. It is a direct mapping. What's wrong with that?

Silly me, I decided. Let's use iterators. Now the code is:

        y = 0;
        for (r, c) in zD do // effectively called with ref z : [?zD] real
        {
                y[c] += x[r] * z[r, c];
        }

and the elapsed time dropped by 20%.  That is a serious drop.

So my original code has a penalty of 25%, or a performance hit of 125%.

Also, the Fortran version of the target SVD algorithm does things like

        do i=1,n
                a = ddot(n, x, z(1, i)) // i.e. a = x.(column j of z(*,*))
                axpy(n, a, x, z(1, i)) // (column j of z(*, *)) += a * x
        end do

To be fair and match what I to do in Chapel, I instead should save the
results of those dot products in an intermediate vector, t(*) as:

        do i=1,n
                t(i) = ddot(n, x, z(1, i)) // i.e. a = x.(column j of z(*,*))
        end do
        do i=1,n
                axpy(n, a, x, z(1, i)) // (column j of z(*, *)) += a * x
        end do

This runs a whole 20% Wow. Incredible! I fell off my chair!

So this Fortran technique of having the dot product in the same loop as
the vector addition has a performance gain of 1.0/0.8 = 125%.  Or forcing
the storage of the intermediate results has the penalty.

Anyway, multiply those two performance hits/gains and you have:

        125% * 125% = 156.25%

which, with rounding, is where my 60% extra came from.

This first expression which prefers iterators over indexing is technicallya fundamental problem with Chapel but when there is time for the team tolook at vectorization, this will disappear. Anyway, I can fix it todaywith iterators even if they do look a bit ugly. Besides: when partialreductions become available by Christmas 2021, this whole issue is moot.See the last line of my email.

The second is fascinating. I would never have guessed that an O(n) writeto memory would have 20% the impact of O(n^2) floating point operations.It is not really resolvable, so I am stuck with 25%. This means somethingwhich should take (say) 10 seconds will take 12.5 seconds. But in the SVDalgorithm in question, this problematic chunk of codeis always paired witha second operation which also takes 10 seconds or so, the equivalentFortran of which has no penalty. So, in totality, there is just 12.5%penalty for the whole section of code. Anything under 15% I can handle.


I will put working partial reductions on my Santa wish-list for 2021 now!

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer


_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Re: [Chapel-developers] Reduce bug in Chapel 1.22

Reply via email to