On Mon, Dec 24, 2018 at 4:56 PM Jed Brown <j...@jedbrown.org> wrote: > Mark Adams via petsc-users <petsc-users@mcs.anl.gov> writes: > > > Anyway, my data for this is in my SC 2004 paper (MakeNextMat_private in > > attached, NB, this is code that I wrote in grad school). It is memory > > efficient and simple, just four nested loops i,j,I,J: C(I,J) = > > P(i,I)*A(i,j)*P(j,J). In eyeballing the numbers and from new data that I > am > > getting from my bone modeling colleagues, that use this old code on > > Stampede now, the times look reasonable compared to GAMG. This is > optimized > > for elasticity, where I unroll loops (so it is really six nested loops). > > Is the A above meant to include some ghosted rows? >
You could but I was thinking of having i in the outer loop. In C(I,J) = P(i,I)*A(i,j)*P(j,J), the iteration over 'i' need only be the local rows of A (and the left term P). > > > In thinking about this now, I think you want to make a local copy of P > with > > rows (j) for every column in A that you have locally, then transpose this > > local thing for the P(j,J) term. A sparse AXPY on j. (My code uses a > > special tree data structure but a matrix is simpler.) > > Why transpose for P(j,J)? > (premature) optimization. I was thinking 'j' being in the inner loop and doing sparse inner product, but now that I think about it there are other options.