Hi Damian,

Optimizing LinearAlgebra `dot`(specifically matrix-matrix multiplication) was 
part of my GSOC project this year. Engin had previously done some work in this 
direction for the Intel PRK test suite for Chapel and we based our 
implementation on it. The new `dot` has been merged with master and you can 
have a look at the merged PR 
here<https://github.com/chapel-lang/chapel/pull/15976>.

There's also a performance 
test<https://chapel-lang.org/perf/1-node-xc/?startdate=2020/07/24&enddate=2020/08/27&graphs=linearalgebradotperformance20482048matrix>
 for the native implementation of `dot` which shows similar performance against 
CrayBLAS gemm. Although, we haven't done a comparison for much larger matrices.

Regards,
Rahul

--

Rahul Ghangas

Advanced Computing (R&d) (Honours)

The Australian National University

Ph- +61 0426126128

Email - [email protected] , [email protected]

________________________________
From: Damian McGuckin <[email protected]>
Sent: Friday, August 28, 2020 2:42 PM
To: Chapel-Sourceforge <[email protected]>
Subject: [Chapel-developers] Slicing is an optimal approach in Matrix 
Multiplication - GEMM


Hi,

Consider the follow trying to do matrix multiplication

proc gemm2(t : [?tD] ?R, u : [?uD], v : [?vD] R, unroll : int)
{
     const (rows, columns) = tD.dims();
     const (_, common) = uD.dims();
     const n = columns.last;
     const S = (1 << unroll) - 1;

     for c in columns by (S+1) do
     {
         const cfinal = if c + S > n then n else c + S;
         const cslice = c .. cfinal;
         var vslab : [cslice, common] R;

         // transpose those nasty columns into rows!

         [j in common] vslab[cslice, j] = v[j, cslice];

         // rip through every single row
         // should this be blocked better?

         forall r in rows do
         {
             const ref ur = u[r, ..];

             for j in cslice do
             {
                 t[r, j] = vmDot(common, ur, vslab[j, ..]);
             }
         }
     } }

where vmDot(r, x, y) returns the dot product of two 1D-arrays 'x' & 'y'
across the index range 'r'.

It works:

But, on changing the last loop, this

         forall r in rows do
         {
             const ref ur = u[r, ..];
             const x = [j in cslice] vmDot(common, ur, vslab[j, ..]);

             t[r, cslice] = x;
         }

slows down slightly, about 5+%. On the other hand, the more concise

         forall r in rows do
         {
             const x = [j in cslice] vmDot(common, u[r, ..], vslab[j, ..]);

             t[r, cslice] = x;
         }

slows down seriously, about 25+%.

I am curious on what is going on here just so I know what is an optimal
approach.

Let me know if you want the test code.

Has any research work been done on an optimal GEMM in Chapel and if so, is
it been documented anyway.

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer


_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Reply via email to