This code looks funny in MatSOR_SeqAIJ:

          x[i] = (1-omega)*x[i] + sum*idiag[i];
 
shouldn't this be:

          x[i] = (1-omega)*x[i] + omega*sum*idiag[i];
 
and I've done the first optimization (for the non-blocked version) and get this 
(running each many time because my Mac is a little noisy).  Flop rates go down 
a bit and total flops got down a lot (perhaps I have a flop counting bug?).  6 
old runs then7 new ones:

12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local 
runex54 | grep MatSOR
MatSOR                60 1.0 1.3511e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 12 34  0  0  0  14 34  0  0  0  1105
12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local 
runex54 | grep MatSOR
MatSOR                60 1.0 1.3611e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 12 34  0  0  0  14 34  0  0  0  1097
12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local 
runex54 | grep MatSOR
MatSOR                60 1.0 1.3464e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 12 34  0  0  0  14 34  0  0  0  1109
12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local 
runex54 | grep MatSOR
MatSOR                60 1.0 1.3487e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 12 34  0  0  0  14 34  0  0  0  1107
12:41 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local 
runex54 | grep MatSOR
MatSOR                60 1.0 1.3629e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 12 34  0  0  0  15 34  0  0  0  1095
12:41 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local 
runex54 | grep MatSOR
MatSOR                60 1.0 1.3492e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 12 34  0  0  0  14 34  0  0  0  1107
12:41 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f make-local 
runex54 | grep MatSOR
MatSOR                60 1.0 1.2776e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 11 29  0  0  0  14 29  0  0  0   910
12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f 
make-local runex54 | grep MatSOR
MatSOR                60 1.0 1.2809e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 12 29  0  0  0  14 29  0  0  0   908
12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f 
make-local runex54 | grep MatSOR
MatSOR                60 1.0 1.2765e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 12 29  0  0  0  14 29  0  0  0   911
12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f 
make-local runex54 | grep MatSOR
MatSOR                60 1.0 1.3131e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 12 29  0  0  0  14 29  0  0  0   886
12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f 
make-local runex54 | grep MatSOR
MatSOR                60 1.0 1.2792e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 12 29  0  0  0  14 29  0  0  0   909
12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f 
make-local runex54 | grep MatSOR
MatSOR                60 1.0 1.2913e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 12 29  0  0  0  14 29  0  0  0   901
12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$  make -f 
make-local runex54 | grep MatSOR
MatSOR                60 1.0 1.2889e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 12 29  0  0  0  14 29  0  0  0   902


On Aug 20, 2013, at 6:40 PM, Barry Smith <bsm...@mcs.anl.gov> wrote:

> 
> On Aug 20, 2013, at 4:39 PM, Jed Brown <jedbr...@mcs.anl.gov> wrote:
> 
>> Hmm, less clear flops benefit. Half of these are nonzero initial guess, for 
>> which triangular storage would be worse
> 
>   Is it clear that triangular storage would be particularly worse? I think 
> only a small percentage worse for a multiply. The problem with CSR for 
> triangular solves is for each new row one needs to jump some amount in the 
> index and double array wasting cache lines and limiting streaming. With 
> multiply with triangular storage there is no jumping, just streaming two 
> different index and double arrays and current CPUs have no problem managing 4 
> streams. 
> 
>   In other words CSR good for multiply, bad for triangular solve; triangular 
> storage good for triangular solve and pretty good for multiply.
> 
>   Barry
> 
> 
>   Barry
> 
>> (unless you add an Eisenstat optimization).
>> 
>> On Aug 20, 2013 4:26 PM, "Mark F. Adams" <mfad...@lbl.gov> wrote:
>> Barry,
>> 
>> These are tests using the well load balanced problems in KSP on my Mac:
>> 
>> 3D Vector (ex56)
>> 
>> 8 proc:
>> 
>> rich/ssor(1)
>>> KSPSolve               1 1.0 3.0143e-01 1.0 7.98e+07 1.1 1.0e+04 7.2e+02 
>>> 8.5e+01 19 26 26 16  8 100100100100100  2047
>> 
>> cheb/jac(2)
>>> KSPSolve               1 1.0 2.5836e-01 1.0 6.87e+07 1.1 1.4e+04 7.3e+02 
>>> 8.8e+01 17 25 27 19  8 100100100100100  2053
>> 
>> 1 proc:
>> 
>> rich/ssor(1)
>>> KSPSolve               1 1.0 1.8541e-01 1.0 3.20e+08 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 10 22  0  0  0 100100  0  0  0  1727
>> 
>> cheb/jac(2)
>>> KSPSolve               1 1.0 2.4841e-01 1.0 4.65e+08 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 12 24  0  0  0 100100  0  0  0  1870
>> 
>> 2D Scalar (ex54) 4 procs
>> 
>> cheb/jac(2)
>>> KSPSolve               1 1.0 2.3614e-01 1.0 3.51e+07 1.0 2.6e+03 8.7e+02 
>>> 7.8e+02 91100 98 93 95 100100100100100   592
>> 
>> rich/ssor(1)
>>> KSPSolve               1 1.0 2.0144e-01 1.0 2.29e+07 1.0 1.8e+03 8.8e+02 
>>> 7.1e+02 89100 98 90 95 100100100100100   453
>> 
>> I'm not sure if this matters but I wanted to get off of the funny test 
>> problem that I was using.
>> 
>> Mark
>> 
>> On Aug 17, 2013, at 5:30 PM, Barry Smith <bsm...@mcs.anl.gov> wrote:
>> 
>>> 
>>>  Mark,
>>> 
>>>   Why rich/eisenstat(2) ?   With Cheb/eisenstat you get the acceleration of 
>>> Cheby on top of the SOR but rich/eisenstat(2) just seems like some strange 
>>> implementation of sor?
>>> 
>>>   I am guessing you are getting your 15% potential improvement from
>>> 
>>>>>> 7.63/11.3
>>> 0.6752212389380531
>>>>>> .6752212389380531*5.9800e+00
>>> 4.037823008849558
>>>>>> .63/4.63
>>> 0.13606911447084233
>>> 
>>> Anyways I would focus on rich/ssor()  smoothing.
>>> 
>>> Before thinking about implementing a new data structure I think it is 
>>> relatively easy to reduce more the total number of flops done with ssor(1) 
>>> smoother [and despite the current conventional wisdom that doing extra 
>>> flops is in a good thing in HPC :-)).
>>> 
>>> Presmoothing: Note that since you are running ssor(1) and zero initial 
>>> guess the current implementation is already good in terms of reducing total 
>>> flops; it does one down solve (saving the intermediate values) and then one 
>>> up solve (using those saved values).
>>> 
>>> Postsmoothing: Here it is running SSOR(1) with nonzero initial guess, the 
>>> current implementation in MatSOR_SeqAIJ() is bad because it applies the 
>>> entire matrix in both the down solve and the up solve.
>>> 
>>> while (its--) {
>>>   if (flag & SOR_FORWARD_SWEEP || flag & SOR_LOCAL_FORWARD_SWEEP) {
>>>     for (i=0; i<m; i++) {
>>>       n   = a->i[i+1] - a->i[i];
>>>       idx = a->j + a->i[i];
>>>       v   = a->a + a->i[i];
>>>       sum = b[i];
>>>       PetscSparseDenseMinusDot(sum,x,v,idx,n);
>>>       x[i] = (1. - omega)*x[i] + (sum + mdiag[i]*x[i])*idiag[i];
>>>     }
>>>     ierr = PetscLogFlops(2.0*a->nz);CHKERRQ(ierr);
>>>   }
>>>   if (flag & SOR_BACKWARD_SWEEP || flag & SOR_LOCAL_BACKWARD_SWEEP) {
>>>     for (i=m-1; i>=0; i--) {
>>>       n   = a->i[i+1] - a->i[i];
>>>       idx = a->j + a->i[i];
>>>       v   = a->a + a->i[i];
>>>       sum = b[i];
>>>       PetscSparseDenseMinusDot(sum,x,v,idx,n);
>>>       x[i] = (1. - omega)*x[i] + (sum + mdiag[i]*x[i])*idiag[i];
>>>     }
>>>     ierr = PetscLogFlops(2.0*a->nz);CHKERRQ(ierr);
>>>   }
>>> 
>>> (Note also that this partially explains why the flop rate in the table for 
>>> MatSOR() s so good, two of the three ssor triangular solves are actually 
>>> full matrix vector products). What it should do is apply the entire matrix 
>>> in the down solve (but save the partial sums of the lower triangular part) 
>>> and then it can do the upper solve applying only 1/2 the matrix and using 
>>> the accumulated values.  So currently the combined pre and post smooth has 
>>> 3 complete applications of the matrix (.5 and .5 from the pre smooth and 1 
>>> and 1 from the up smooth. Adding the optimization will reduce it to 2.5 
>>> applications of the matrix giving a reduction of 1/6  = 17 percent of the 
>>> smooth flops or
>>> 
>>>>>> 17.*14/33
>>> 7.212121212121212
>>> 
>>> 7 percent decrease in the total KSPSolve flops
>>> 
>>> Next we can eliminate 1/2 of the application of the matrix in the residual 
>>> computation after the smoothing if we have saved the partial sums of the 
>>> upper triangular solve in the smoother. We can estimate the flops of the 
>>> matrix-multiple in the residual computation as one application of the 
>>> matrix and hence equal to 1/3 of the flops in the ssor computations. 
>>> Eliminating 1/2 of that means eliminating the equivalent of eliminating 1/6 
>>> of the flops in the MatSOR computation which is
>>> 
>>>>>> 14./(33*6)
>>> 0.0707070707070707  (actually the same computation as above :-(
>>> 
>>> so another 7 percent of the total flops in the KSPSolve.  So in total we 
>>> could save 14 percent of the flops (and lots of memory access) in the 
>>> KSPSolve at the work of editing a couple of functions.
>>> 
>>> At that point we could compute the flop rate of all the SSOR triangular 
>>> solves and the percent of the time of the KSPSolves() in the them to 
>>> determine if a new data structure is warranted. Note that in reducing the 
>>> total number of flops we are replacing two full matrix-vector products with 
>>> 2 triangular solves hence the flop rate will be lower indicating that using 
>>> a new data structure will actually help more.
>>> 
>>> Notes on the implementation:
>>> 
>>> * modifying the MatSolve_SeqAIJ() to better handle the nonzero initial 
>>> guess is straightforward; edit one functions
>>> 
>>> * modifying the residual computation in the multigrid to reuse the upper 
>>> triangular sums requires a tiny bit more thought. Basically MatSOR_SeqAIJ() 
>>> would have an internal pointer to the accumulated solves for the upper 
>>> triangular part and we would need to either (1) have a MatMult_SeqAIJ() 
>>> implementation that would retrieve those values and use them appropriately 
>>> (how would it know when the values are valid?) or (2) use the 
>>> PCMGSetResidual() to provide a custom residual routine that again pulled 
>>> out the accumulated sums and used them. Of course we will eventually figure 
>>> out how to organize it cleanly for users.
>>> 
>>> 
>>> Note also the even if a new custom data structure is introduced the work 
>>> outlined above is not lost since the new data structure still needs those 
>>> flop reduction optimizations. So step one, fix the MatSOR_SeqAIJ() for 
>>> nonzero initial guess, get a roughly 6 percent improvement, step 2, 
>>> optimize the residual computation, get another 6 percent improvement, step 
>>> 3,  introduce a new data structure and get another roughly 5 percent 
>>> improvement in KSPSolve() (the 6, 6  and 5 are my guesstimates).
>>> 
>>> I would be very interested in seeing new numbers.
>>> 
>>> Final note: the very unload balancing of this problem will limit how much 
>>> these optimizations will help. For a perfectly load balanced problem I 
>>> think these optimizations would a lot higher in percentage (maybe twice as 
>>> much? more?).
>>> 
>>> 
>>>  Barry
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Aug 17, 2013, at 3:16 PM, "Mark F. Adams" <mfad...@lbl.gov> wrote:
>>> 
>>>> I would like to get an idea of how much benefit there would be with using 
>>>> a special matrix type for SOR.  Here is an experiment on 128 cores of 
>>>> Hopper (Cray XE6), 7 point stencil, with some embedded BCs that look like 
>>>> higher order stencils at BCs.  32^3 subdomian on each core:
>>>> 
>>>> cheb/jacobi(2)
>>>> KSPSolve              15 1.0 5.9800e+00 1.0 1.13e+09 3.2 6.2e+06 1.1e+03 
>>>> 2.8e+02  7 29 67 46  7  26100100100 76 18765
>>>> 
>>>> 
>>>> rich/eisenstat(2)
>>>> KSPSolve              15 1.0 1.1563e+01 1.0 1.37e+09 3.4 5.4e+06 1.1e+03 
>>>> 2.8e+02 12 32 66 44  7  38100100100 76 11659
>>>> 
>>>> 
>>>> rich/sor
>>>> KSPSolve              15 1.0 4.6355e+00 1.0 7.63e+08 4.5 3.2e+06 1.0e+03 
>>>> 3.1e+02 10 21 57 31  8  33100100100 77 15708
>>>> 
>>>> 
>>>> Complete log files attached.  The "projection" solve is the solver of 
>>>> interest.
>>>> 
>>>> I have 2 Jacobi so that it has about the same amount of work a one (s)sor. 
>>>>  There are voids in the domain which I believe accounts for the large 
>>>> differences in the number of flops per core.  These were run with the same 
>>>> processor group (i.e., all runs done in the same qsub script)
>>>> 
>>>> This shows only about 15% potential gain.  Should we conclude that there 
>>>> is not much to gain from an optimized data structure?
>>>> 
>>>> Mark
>>>> <log_eis><log_jac><log_sor>
>>>> 
>>>> 
>>>> On Aug 16, 2013, at 7:53 PM, Jed Brown <jedbr...@mcs.anl.gov> wrote:
>>>> 
>>>>> "Mark F. Adams" <mfad...@lbl.gov> writes:
>>>>>> Some hypre papers have shown that cheb/jacobi is faster for some
>>>>>> problems but for me robustness trumps this for default solver
>>>>>> parameters in PETSc.
>>>>> 
>>>>> Richardson/SOR is about the best thing out there in terms of reasonable
>>>>> local work, low/no setup cost, and reliable convergence.  Cheby/Jacobi
>>>>> just has trivial fine-grained parallelism, but it's not clear that buys
>>>>> anything on a CPU.
>>>>> 
>>>>>> Jed's analysis suggests that Eisenstat's method saves almost 50% work
>>>>>> but needs a specialized matrix to get good flop rates.  Something to
>>>>>> think about doing …
>>>>> 
>>>>> Mine was too sloppy, Barry got it right.  Eisenstat is for Cheby/SOR,
>>>>> however, and doesn't do anything for Richardson.
>>>>> 
>>>>> To speed Richardson/SSOR up with a new matrix format, I think we have to
>>>>> cache the action of the lower triangular part in the forward sweep so
>>>>> that the back-sweep can use it, and vice-versa.  With full caching and
>>>>> triangular residual optimization, I think this brings 2 SSOR iterations
>>>>> of the down-smoother plus a residual to 2.5 work units in the
>>>>> down-smooth (zero initial guess) and 3 work units in the up-smooth
>>>>> (nonzero initial guess).  (This is a strong smoother and frequently, one
>>>>> SSOR would be enough.)
>>>> 
>>> 
>> 
> 

Reply via email to