Hi Damian, Optimizing LinearAlgebra `dot`(specifically matrix-matrix multiplication) was part of my GSOC project this year. Engin had previously done some work in this direction for the Intel PRK test suite for Chapel and we based our implementation on it. The new `dot` has been merged with master and you can have a look at the merged PR here<https://github.com/chapel-lang/chapel/pull/15976>.
There's also a performance test<https://chapel-lang.org/perf/1-node-xc/?startdate=2020/07/24&enddate=2020/08/27&graphs=linearalgebradotperformance20482048matrix> for the native implementation of `dot` which shows similar performance against CrayBLAS gemm. Although, we haven't done a comparison for much larger matrices. Regards, Rahul -- Rahul Ghangas Advanced Computing (R&d) (Honours) The Australian National University Ph- +61 0426126128 Email - [email protected] , [email protected] ________________________________ From: Damian McGuckin <[email protected]> Sent: Friday, August 28, 2020 2:42 PM To: Chapel-Sourceforge <[email protected]> Subject: [Chapel-developers] Slicing is an optimal approach in Matrix Multiplication - GEMM Hi, Consider the follow trying to do matrix multiplication proc gemm2(t : [?tD] ?R, u : [?uD], v : [?vD] R, unroll : int) { const (rows, columns) = tD.dims(); const (_, common) = uD.dims(); const n = columns.last; const S = (1 << unroll) - 1; for c in columns by (S+1) do { const cfinal = if c + S > n then n else c + S; const cslice = c .. cfinal; var vslab : [cslice, common] R; // transpose those nasty columns into rows! [j in common] vslab[cslice, j] = v[j, cslice]; // rip through every single row // should this be blocked better? forall r in rows do { const ref ur = u[r, ..]; for j in cslice do { t[r, j] = vmDot(common, ur, vslab[j, ..]); } } } } where vmDot(r, x, y) returns the dot product of two 1D-arrays 'x' & 'y' across the index range 'r'. It works: But, on changing the last loop, this forall r in rows do { const ref ur = u[r, ..]; const x = [j in cslice] vmDot(common, ur, vslab[j, ..]); t[r, cslice] = x; } slows down slightly, about 5+%. On the other hand, the more concise forall r in rows do { const x = [j in cslice] vmDot(common, u[r, ..], vslab[j, ..]); t[r, cslice] = x; } slows down seriously, about 25+%. I am curious on what is going on here just so I know what is an optimal approach. Let me know if you want the test code. Has any research work been done on an optimal GEMM in Chapel and if so, is it been documented anyway. Regards - Damian Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037 Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here Views & opinions here are mine and not those of any past or present employer _______________________________________________ Chapel-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/chapel-developers
_______________________________________________ Chapel-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/chapel-developers
