Hi Brad,
On Tue, 28 Jul 2020, Brad Chamberlain wrote:
I'm surprised your serial version would still be 60% behind Fortran; it
seems like we ought to be able to reach parity without too much trouble.
I originally re-wrote the non-functioning partial reduction code:
const y = + reduce [r in rows] x[r] * z[r, ..];
As:
var y = [1..columns] real = 0:real;
for r in rows do
{
y += x[r] * z[r, ..];
}
where y and x are vectors. It is a direct mapping. What's wrong with that?
Silly me, I decided. Let's use iterators. Now the code is:
y = 0;
for (r, c) in zD do // effectively called with ref z : [?zD] real
{
y[c] += x[r] * z[r, c];
}
and the elapsed time dropped by 20%. That is a serious drop.
So my original code has a penalty of 25%, or a performance hit of 125%.
Also, the Fortran version of the target SVD algorithm does things like
do i=1,n
a = ddot(n, x, z(1, i)) // i.e. a = x.(column j of z(*,*))
axpy(n, a, x, z(1, i)) // (column j of z(*, *)) += a * x
end do
To be fair and match what I to do in Chapel, I instead should save the
results of those dot products in an intermediate vector, t(*) as:
do i=1,n
t(i) = ddot(n, x, z(1, i)) // i.e. a = x.(column j of z(*,*))
end do
do i=1,n
axpy(n, a, x, z(1, i)) // (column j of z(*, *)) += a * x
end do
This runs a whole 20% Wow. Incredible! I fell off my chair!
So this Fortran technique of having the dot product in the same loop as
the vector addition has a performance gain of 1.0/0.8 = 125%. Or forcing
the storage of the intermediate results has the penalty.
Anyway, multiply those two performance hits/gains and you have:
125% * 125% = 156.25%
which, with rounding, is where my 60% extra came from.
This first expression which prefers iterators over indexing is technically
a fundamental problem with Chapel but when there is time for the team to
look at vectorization, this will disappear. Anyway, I can fix it today
with iterators even if they do look a bit ugly. Besides: when partial
reductions become available by Christmas 2021, this whole issue is moot.
See the last line of my email.
The second is fascinating. I would never have guessed that an O(n) write
to memory would have 20% the impact of O(n^2) floating point operations.
It is not really resolvable, so I am stuck with 25%. This means something
which should take (say) 10 seconds will take 12.5 seconds. But in the SVD
algorithm in question, this problematic chunk of codeis always paired with
a second operation which also takes 10 seconds or so, the equivalent
Fortran of which has no penalty. So, in totality, there is just 12.5%
penalty for the whole section of code. Anything under 15% I can handle.
I will put working partial reductions on my Santa wish-list for 2021 now!
Regards - Damian
Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers