Brad,
Do not put this high on your agenda. Thanks for raising the issue in
Github.
I can attack the problem the current way. For my problems, the impact is
maybe a theoretically known 10%-15% penalty in performance causes by cache
misses because the algorithm is reading and writing a matrix by columns
(rather than by rows).
I will document the issue in my report on this project, note a projected
10%-15% performance into the future and express lots of hope, and move on.
I can prove that I get a 10% performance improvement on a 6-core machine
with 1.20 if I avoid the reduce for the O(n) operation,
z^T = ( ( b^T * B ) / g ) / h
but then access memory by rows in the 2nd loop for the O(n^2) operation
B += b * z^T
So that says that Chapel does give me the speed gain I predict.
The worry is that I get a 1000% performance drop on a 36-core with 1.22.
This is doing a Singular Value Decomposition on a 3200*3200 matrix.
I will document it all more fully at the end of next week but I do not
really have the insight or experience in Chapel (or the mix of systems
with old and new releases to do tersting) to really assist much with
solving the problem.
Regards - Damian
Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers