This code looks funny in MatSOR_SeqAIJ: x[i] = (1-omega)*x[i] + sum*idiag[i]; shouldn't this be:
x[i] = (1-omega)*x[i] + omega*sum*idiag[i]; and I've done the first optimization (for the non-blocked version) and get this (running each many time because my Mac is a little noisy). Flop rates go down a bit and total flops got down a lot (perhaps I have a flop counting bug?). 6 old runs then7 new ones: 12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f make-local runex54 | grep MatSOR MatSOR 60 1.0 1.3511e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 34 0 0 0 14 34 0 0 0 1105 12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f make-local runex54 | grep MatSOR MatSOR 60 1.0 1.3611e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 34 0 0 0 14 34 0 0 0 1097 12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f make-local runex54 | grep MatSOR MatSOR 60 1.0 1.3464e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 34 0 0 0 14 34 0 0 0 1109 12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f make-local runex54 | grep MatSOR MatSOR 60 1.0 1.3487e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 34 0 0 0 14 34 0 0 0 1107 12:41 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f make-local runex54 | grep MatSOR MatSOR 60 1.0 1.3629e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 34 0 0 0 15 34 0 0 0 1095 12:41 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f make-local runex54 | grep MatSOR MatSOR 60 1.0 1.3492e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 34 0 0 0 14 34 0 0 0 1107 12:41 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f make-local runex54 | grep MatSOR MatSOR 60 1.0 1.2776e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00 11 29 0 0 0 14 29 0 0 0 910 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f make-local runex54 | grep MatSOR MatSOR 60 1.0 1.2809e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 29 0 0 0 14 29 0 0 0 908 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f make-local runex54 | grep MatSOR MatSOR 60 1.0 1.2765e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 29 0 0 0 14 29 0 0 0 911 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f make-local runex54 | grep MatSOR MatSOR 60 1.0 1.3131e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 29 0 0 0 14 29 0 0 0 886 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f make-local runex54 | grep MatSOR MatSOR 60 1.0 1.2792e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 29 0 0 0 14 29 0 0 0 909 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f make-local runex54 | grep MatSOR MatSOR 60 1.0 1.2913e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 29 0 0 0 14 29 0 0 0 901 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f make-local runex54 | grep MatSOR MatSOR 60 1.0 1.2889e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 29 0 0 0 14 29 0 0 0 902 On Aug 20, 2013, at 6:40 PM, Barry Smith <bsm...@mcs.anl.gov> wrote: > > On Aug 20, 2013, at 4:39 PM, Jed Brown <jedbr...@mcs.anl.gov> wrote: > >> Hmm, less clear flops benefit. Half of these are nonzero initial guess, for >> which triangular storage would be worse > > Is it clear that triangular storage would be particularly worse? I think > only a small percentage worse for a multiply. The problem with CSR for > triangular solves is for each new row one needs to jump some amount in the > index and double array wasting cache lines and limiting streaming. With > multiply with triangular storage there is no jumping, just streaming two > different index and double arrays and current CPUs have no problem managing 4 > streams. > > In other words CSR good for multiply, bad for triangular solve; triangular > storage good for triangular solve and pretty good for multiply. > > Barry > > > Barry > >> (unless you add an Eisenstat optimization). >> >> On Aug 20, 2013 4:26 PM, "Mark F. Adams" <mfad...@lbl.gov> wrote: >> Barry, >> >> These are tests using the well load balanced problems in KSP on my Mac: >> >> 3D Vector (ex56) >> >> 8 proc: >> >> rich/ssor(1) >>> KSPSolve 1 1.0 3.0143e-01 1.0 7.98e+07 1.1 1.0e+04 7.2e+02 >>> 8.5e+01 19 26 26 16 8 100100100100100 2047 >> >> cheb/jac(2) >>> KSPSolve 1 1.0 2.5836e-01 1.0 6.87e+07 1.1 1.4e+04 7.3e+02 >>> 8.8e+01 17 25 27 19 8 100100100100100 2053 >> >> 1 proc: >> >> rich/ssor(1) >>> KSPSolve 1 1.0 1.8541e-01 1.0 3.20e+08 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 10 22 0 0 0 100100 0 0 0 1727 >> >> cheb/jac(2) >>> KSPSolve 1 1.0 2.4841e-01 1.0 4.65e+08 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 12 24 0 0 0 100100 0 0 0 1870 >> >> 2D Scalar (ex54) 4 procs >> >> cheb/jac(2) >>> KSPSolve 1 1.0 2.3614e-01 1.0 3.51e+07 1.0 2.6e+03 8.7e+02 >>> 7.8e+02 91100 98 93 95 100100100100100 592 >> >> rich/ssor(1) >>> KSPSolve 1 1.0 2.0144e-01 1.0 2.29e+07 1.0 1.8e+03 8.8e+02 >>> 7.1e+02 89100 98 90 95 100100100100100 453 >> >> I'm not sure if this matters but I wanted to get off of the funny test >> problem that I was using. >> >> Mark >> >> On Aug 17, 2013, at 5:30 PM, Barry Smith <bsm...@mcs.anl.gov> wrote: >> >>> >>> Mark, >>> >>> Why rich/eisenstat(2) ? With Cheb/eisenstat you get the acceleration of >>> Cheby on top of the SOR but rich/eisenstat(2) just seems like some strange >>> implementation of sor? >>> >>> I am guessing you are getting your 15% potential improvement from >>> >>>>>> 7.63/11.3 >>> 0.6752212389380531 >>>>>> .6752212389380531*5.9800e+00 >>> 4.037823008849558 >>>>>> .63/4.63 >>> 0.13606911447084233 >>> >>> Anyways I would focus on rich/ssor() smoothing. >>> >>> Before thinking about implementing a new data structure I think it is >>> relatively easy to reduce more the total number of flops done with ssor(1) >>> smoother [and despite the current conventional wisdom that doing extra >>> flops is in a good thing in HPC :-)). >>> >>> Presmoothing: Note that since you are running ssor(1) and zero initial >>> guess the current implementation is already good in terms of reducing total >>> flops; it does one down solve (saving the intermediate values) and then one >>> up solve (using those saved values). >>> >>> Postsmoothing: Here it is running SSOR(1) with nonzero initial guess, the >>> current implementation in MatSOR_SeqAIJ() is bad because it applies the >>> entire matrix in both the down solve and the up solve. >>> >>> while (its--) { >>> if (flag & SOR_FORWARD_SWEEP || flag & SOR_LOCAL_FORWARD_SWEEP) { >>> for (i=0; i<m; i++) { >>> n = a->i[i+1] - a->i[i]; >>> idx = a->j + a->i[i]; >>> v = a->a + a->i[i]; >>> sum = b[i]; >>> PetscSparseDenseMinusDot(sum,x,v,idx,n); >>> x[i] = (1. - omega)*x[i] + (sum + mdiag[i]*x[i])*idiag[i]; >>> } >>> ierr = PetscLogFlops(2.0*a->nz);CHKERRQ(ierr); >>> } >>> if (flag & SOR_BACKWARD_SWEEP || flag & SOR_LOCAL_BACKWARD_SWEEP) { >>> for (i=m-1; i>=0; i--) { >>> n = a->i[i+1] - a->i[i]; >>> idx = a->j + a->i[i]; >>> v = a->a + a->i[i]; >>> sum = b[i]; >>> PetscSparseDenseMinusDot(sum,x,v,idx,n); >>> x[i] = (1. - omega)*x[i] + (sum + mdiag[i]*x[i])*idiag[i]; >>> } >>> ierr = PetscLogFlops(2.0*a->nz);CHKERRQ(ierr); >>> } >>> >>> (Note also that this partially explains why the flop rate in the table for >>> MatSOR() s so good, two of the three ssor triangular solves are actually >>> full matrix vector products). What it should do is apply the entire matrix >>> in the down solve (but save the partial sums of the lower triangular part) >>> and then it can do the upper solve applying only 1/2 the matrix and using >>> the accumulated values. So currently the combined pre and post smooth has >>> 3 complete applications of the matrix (.5 and .5 from the pre smooth and 1 >>> and 1 from the up smooth. Adding the optimization will reduce it to 2.5 >>> applications of the matrix giving a reduction of 1/6 = 17 percent of the >>> smooth flops or >>> >>>>>> 17.*14/33 >>> 7.212121212121212 >>> >>> 7 percent decrease in the total KSPSolve flops >>> >>> Next we can eliminate 1/2 of the application of the matrix in the residual >>> computation after the smoothing if we have saved the partial sums of the >>> upper triangular solve in the smoother. We can estimate the flops of the >>> matrix-multiple in the residual computation as one application of the >>> matrix and hence equal to 1/3 of the flops in the ssor computations. >>> Eliminating 1/2 of that means eliminating the equivalent of eliminating 1/6 >>> of the flops in the MatSOR computation which is >>> >>>>>> 14./(33*6) >>> 0.0707070707070707 (actually the same computation as above :-( >>> >>> so another 7 percent of the total flops in the KSPSolve. So in total we >>> could save 14 percent of the flops (and lots of memory access) in the >>> KSPSolve at the work of editing a couple of functions. >>> >>> At that point we could compute the flop rate of all the SSOR triangular >>> solves and the percent of the time of the KSPSolves() in the them to >>> determine if a new data structure is warranted. Note that in reducing the >>> total number of flops we are replacing two full matrix-vector products with >>> 2 triangular solves hence the flop rate will be lower indicating that using >>> a new data structure will actually help more. >>> >>> Notes on the implementation: >>> >>> * modifying the MatSolve_SeqAIJ() to better handle the nonzero initial >>> guess is straightforward; edit one functions >>> >>> * modifying the residual computation in the multigrid to reuse the upper >>> triangular sums requires a tiny bit more thought. Basically MatSOR_SeqAIJ() >>> would have an internal pointer to the accumulated solves for the upper >>> triangular part and we would need to either (1) have a MatMult_SeqAIJ() >>> implementation that would retrieve those values and use them appropriately >>> (how would it know when the values are valid?) or (2) use the >>> PCMGSetResidual() to provide a custom residual routine that again pulled >>> out the accumulated sums and used them. Of course we will eventually figure >>> out how to organize it cleanly for users. >>> >>> >>> Note also the even if a new custom data structure is introduced the work >>> outlined above is not lost since the new data structure still needs those >>> flop reduction optimizations. So step one, fix the MatSOR_SeqAIJ() for >>> nonzero initial guess, get a roughly 6 percent improvement, step 2, >>> optimize the residual computation, get another 6 percent improvement, step >>> 3, introduce a new data structure and get another roughly 5 percent >>> improvement in KSPSolve() (the 6, 6 and 5 are my guesstimates). >>> >>> I would be very interested in seeing new numbers. >>> >>> Final note: the very unload balancing of this problem will limit how much >>> these optimizations will help. For a perfectly load balanced problem I >>> think these optimizations would a lot higher in percentage (maybe twice as >>> much? more?). >>> >>> >>> Barry >>> >>> >>> >>> >>> >>> >>> On Aug 17, 2013, at 3:16 PM, "Mark F. Adams" <mfad...@lbl.gov> wrote: >>> >>>> I would like to get an idea of how much benefit there would be with using >>>> a special matrix type for SOR. Here is an experiment on 128 cores of >>>> Hopper (Cray XE6), 7 point stencil, with some embedded BCs that look like >>>> higher order stencils at BCs. 32^3 subdomian on each core: >>>> >>>> cheb/jacobi(2) >>>> KSPSolve 15 1.0 5.9800e+00 1.0 1.13e+09 3.2 6.2e+06 1.1e+03 >>>> 2.8e+02 7 29 67 46 7 26100100100 76 18765 >>>> >>>> >>>> rich/eisenstat(2) >>>> KSPSolve 15 1.0 1.1563e+01 1.0 1.37e+09 3.4 5.4e+06 1.1e+03 >>>> 2.8e+02 12 32 66 44 7 38100100100 76 11659 >>>> >>>> >>>> rich/sor >>>> KSPSolve 15 1.0 4.6355e+00 1.0 7.63e+08 4.5 3.2e+06 1.0e+03 >>>> 3.1e+02 10 21 57 31 8 33100100100 77 15708 >>>> >>>> >>>> Complete log files attached. The "projection" solve is the solver of >>>> interest. >>>> >>>> I have 2 Jacobi so that it has about the same amount of work a one (s)sor. >>>> There are voids in the domain which I believe accounts for the large >>>> differences in the number of flops per core. These were run with the same >>>> processor group (i.e., all runs done in the same qsub script) >>>> >>>> This shows only about 15% potential gain. Should we conclude that there >>>> is not much to gain from an optimized data structure? >>>> >>>> Mark >>>> <log_eis><log_jac><log_sor> >>>> >>>> >>>> On Aug 16, 2013, at 7:53 PM, Jed Brown <jedbr...@mcs.anl.gov> wrote: >>>> >>>>> "Mark F. Adams" <mfad...@lbl.gov> writes: >>>>>> Some hypre papers have shown that cheb/jacobi is faster for some >>>>>> problems but for me robustness trumps this for default solver >>>>>> parameters in PETSc. >>>>> >>>>> Richardson/SOR is about the best thing out there in terms of reasonable >>>>> local work, low/no setup cost, and reliable convergence. Cheby/Jacobi >>>>> just has trivial fine-grained parallelism, but it's not clear that buys >>>>> anything on a CPU. >>>>> >>>>>> Jed's analysis suggests that Eisenstat's method saves almost 50% work >>>>>> but needs a specialized matrix to get good flop rates. Something to >>>>>> think about doing … >>>>> >>>>> Mine was too sloppy, Barry got it right. Eisenstat is for Cheby/SOR, >>>>> however, and doesn't do anything for Richardson. >>>>> >>>>> To speed Richardson/SSOR up with a new matrix format, I think we have to >>>>> cache the action of the lower triangular part in the forward sweep so >>>>> that the back-sweep can use it, and vice-versa. With full caching and >>>>> triangular residual optimization, I think this brings 2 SSOR iterations >>>>> of the down-smoother plus a residual to 2.5 work units in the >>>>> down-smooth (zero initial guess) and 3 work units in the up-smooth >>>>> (nonzero initial guess). (This is a strong smoother and frequently, one >>>>> SSOR would be enough.) >>>> >>> >> >