Mark Adams <mfad...@lbl.gov> writes: > On Mon, Dec 24, 2018 at 4:56 PM Jed Brown <j...@jedbrown.org> wrote: > >> Mark Adams via petsc-users <petsc-users@mcs.anl.gov> writes: >> >> > Anyway, my data for this is in my SC 2004 paper (MakeNextMat_private in >> > attached, NB, this is code that I wrote in grad school). It is memory >> > efficient and simple, just four nested loops i,j,I,J: C(I,J) = >> > P(i,I)*A(i,j)*P(j,J). In eyeballing the numbers and from new data that I >> am >> > getting from my bone modeling colleagues, that use this old code on >> > Stampede now, the times look reasonable compared to GAMG. This is >> optimized >> > for elasticity, where I unroll loops (so it is really six nested loops). >> >> Is the A above meant to include some ghosted rows? >> > > You could but I was thinking of having i in the outer loop. In C(I,J) = > P(i,I)*A(i,j)*P(j,J), the iteration over 'i' need only be the local rows of > A (and the left term P).
Okay, so you need to gather those rows of P referenced by the off-diagonal parts of A. Once you have them, do for i: v[:] = 0 # sparse vector for j: v[:] += A[i,j] * P[j,:] for I: C[I,:] += P[i,I] * v[:] One inefficiency is that you don't actually get "hits" on all the entries of C[I,:], but that much remains no matter how you reorder loops (unless you make I the outermost). >> > In thinking about this now, I think you want to make a local copy of P >> with >> > rows (j) for every column in A that you have locally, then transpose this >> > local thing for the P(j,J) term. A sparse AXPY on j. (My code uses a >> > special tree data structure but a matrix is simpler.) >> >> Why transpose for P(j,J)? >> > > (premature) optimization. I was thinking 'j' being in the inner loop and > doing sparse inner product, but now that I think about it there are other > options. Sparse inner products tend to be quite inefficient. Explicit blocking helps some, but I would try to avoid it.