> On Feb 5, 2020, at 4:36 AM, Hao DONG <dong-...@outlook.com> wrote: > > Thanks a lot for your suggestions, Hong and Barry - > > As you suggested, I first tried the LU direct solvers (built-in and MUMPs) > out this morning, which work perfectly, albeit slow. As my problem itself is > a part of a PDE based optimization, the exact solution in the searching > procedure is not necessary (I often set a relative tolerance of 1E-7/1E-8, > etc. for Krylov subspace methods). Also tried BJACOBI with exact LU, the KSP > just converges in one or two iterations, as expected. > > I added -kspview option for the D-ILU test (still with Block Jacobi as > preconditioner and bcgs as KSP solver). The KSPview output from one of the > examples in a toy model looks like: > > KSP Object: 1 MPI processes > type: bcgs > maximum iterations=120, nonzero initial guess > tolerances: relative=1e-07, absolute=1e-50, divergence=10000. > left preconditioning > using PRECONDITIONED norm type for convergence test > PC Object: 1 MPI processes > type: bjacobi > number of blocks = 3 > Local solve is same for all blocks, in the following KSP and PC objects: > KSP Object: (sub_) 1 MPI processes > type: preonly > maximum iterations=10000, initial guess is zero > tolerances: relative=1e-05, absolute=1e-50, divergence=10000. > left preconditioning > using NONE norm type for convergence test > PC Object: (sub_) 1 MPI processes > type: ilu > out-of-place factorization > 0 levels of fill > tolerance for zero pivot 2.22045e-14 > matrix ordering: natural > factor fill ratio given 1., needed 1. > Factored matrix follows: > Mat Object: 1 MPI processes > type: seqaij > rows=11294, cols=11294 > package used to perform factorization: petsc > total: nonzeros=76008, allocated nonzeros=76008 > total number of mallocs used during MatSetValues calls=0 > not using I-node routines > linear system matrix = precond matrix: > Mat Object: 1 MPI processes > type: seqaij > rows=11294, cols=11294 > total: nonzeros=76008, allocated nonzeros=76008 > total number of mallocs used during MatSetValues calls=0 > not using I-node routines > linear system matrix = precond matrix: > Mat Object: 1 MPI processes > type: mpiaij > rows=33880, cols=33880 > total: nonzeros=436968, allocated nonzeros=436968 > total number of mallocs used during MatSetValues calls=0 > not using I-node (on process 0) routines > > do you see something wrong with my setup? > > I also tried a quick performance test with a small 278906 by 278906 matrix > (3850990 nnz) with the following parameters: > > -ksp_type bcgs -pc_type bjacobi -pc_bjacobi_local_blocks 3 -pc_sub_type ilu > -ksp_view > > Reducing the relative residual to 1E-7 > > Took 4.08s with 41 bcgs iterations. > > Merely changing the -pc_bjacobi_local_blocks to 6 > > Took 7.02s with 73 bcgs iterations. 9 blocks would further take 9.45s with > 101 bcgs iterations.
This is normal. more blocks slower convergence > > As a reference, my home-brew Fortran code solves the same problem (3-block > D-ILU0) in > > 1.84s with 24 bcgs iterations (the bcgs code is also a home-brew one)… > Run the PETSc code with optimization ./configure --with-debugging=0 an run the code with -log_view this will show where the PETSc code is spending the time (send it to use) > > > Well, by saying “use explicit L/U matrix as preconditioner”, I wonder if a > user is allowed to provide his own (separate) L and U Mat for preconditioning > (like how it works in Matlab solvers), e.g. > > x = qmr(A,b,Tol,MaxIter,L,U,x) > > As I already explicitly constructed the L and U matrices in Fortran, it is > not hard to convert them to Mat in Petsc to test Petsc and my Fortran code > head-to-head. In that case, the A, b, x, and L/U are all identical, it would > be easier to see where the problem came from. > > No, we don't provide this kind of support > > BTW, there is another thing I wondered - is there a way to output residual in > unpreconditioned norm? I tried to > > call KSPSetNormType(ksp_local, KSP_NORM_UNPRECONDITIONED, ierr) > > But always get an error that current ksp does not support unpreconditioned in > LEFT/RIGHT (either way). Is it possible to do that (output unpreconditioned > residual) in PETSc at all? -ksp_monitor_true_residual You can also run GMRES (and some other methods) with right preconditioning, -ksp_pc_side right then the residual computed is by the algorithm the unpreconditioned residual KSPSetNormType sets the norm used in the algorithm, it generally always has to left or right, only a couple algorithms support unpreconditioned directly. Barry > > Cheers, > Hao > > > From: Smith, Barry F. <bsm...@mcs.anl.gov> > Sent: Tuesday, February 4, 2020 8:27 PM > To: Hao DONG <dong-...@outlook.com> > Cc: petsc-users@mcs.anl.gov <petsc-users@mcs.anl.gov> > Subject: Re: [petsc-users] What is the right way to implement a (block) > Diagonal ILU as PC? > > > > > On Feb 4, 2020, at 12:41 PM, Hao DONG <dong-...@outlook.com> wrote: > > > > Dear all, > > > > > > I have a few questions about the implementation of diagonal ILU PC in > > PETSc. I want to solve a very simple system with KSP (in parallel), the > > nature of the system (finite difference time-harmonic Maxwell) is probably > > not important to the question itself. Long story short, I just need to > > solve a set of equations of Ax = b with a block diagonal system matrix, > > like (not sure if the mono font works): > > > > |X | > > A =| Y | > > | Z| > > > > Note that A is not really block-diagonal, it’s just a multi-diagonal matrix > > determined by the FD mesh, where most elements are close to diagonal. So > > instead of a full ILU decomposition, a D-ILU is a good approximation as a > > preconditioner, and the number of blocks should not matter too much: > > > > |Lx | |Ux | > > L = | Ly | and U = | Uy | > > | Lz| | Uz| > > > > Where [Lx, Ux] = ILU0(X), etc. Indeed, the D-ILU preconditioner (with 3N > > blocks) is quite efficient with Krylov subspace methods like BiCGstab or > > QMR in my sequential Fortran/Matlab code. > > > > So like most users, I am looking for a parallel implement with this problem > > in PETSc. After looking through the manual, it seems that the most > > straightforward way to do it is through PCBJACOBI. Not sure I understand it > > right, I just setup a 3-block PCJACOBI and give each of the block a KSP > > with PCILU. Is this supposed to be equivalent to my D-ILU preconditioner? > > My little block of fortran code would look like: > > ... > > call PCBJacobiSetTotalBlocks(pc_local,Ntotal, & > > & isubs,ierr) > > call PCBJacobiSetLocalBlocks(pc_local, Nsub, & > > & isubs(istart:iend),ierr) > > ! set up the block jacobi structure > > call KSPSetup(ksp_local,ierr) > > ! allocate sub ksps > > allocate(ksp_sub(Nsub)) > > call PCBJacobiGetSubKSP(pc_local,Nsub,istart, & > > & ksp_sub,ierr) > > do i=1,Nsub > > call KSPGetPC(ksp_sub(i),pc_sub,ierr) > > !ILU preconditioner > > call PCSetType(pc_sub,ptype,ierr) > > call PCFactorSetLevels(pc_sub,1,ierr) ! use ILU(1) here > > call KSPSetType(ksp_sub(i),KSPPREONLY,ierr)] > > end do > > call KSPSetTolerances(ksp_local,KSSiter%tol,PETSC_DEFAULT_REAL, & > > & PETSC_DEFAULT_REAL,KSSiter%maxit,ierr) > > … > > This code looks essentially right. You should call with -ksp_view to see > exactly what PETSc is using for a solver. > > > > > I understand that the parallel performance may not be comparable, so I > > first set up a one-process test (with MPIAij, but all the rows are local > > since there is only one process). The system is solved without any problem > > (identical results within error). But the performance is actually a lot > > worse (code built without debugging flags in performance tests) than my own > > home-brew implementation in Fortran (I wrote my own ILU0 in CSR sparse > > matrix format), which is hard to believe. I suspect the difference is from > > the PC as the PETSc version took much more BiCGstab iterations (60-ish vs > > 100-ish) to converge to the same relative tol. > > PETSc uses GMRES by default with a restart of 30 and left preconditioning. > It could be different exact choices in the solver (which is why -ksp_view is > so useful) can explain the differences in the runs between your code and > PETSc's > > > > This is further confirmed when I change the setup of D-ILU (using 6 or 9 > > blocks instead of 3). While my Fortran/Matlab codes see minimal performance > > difference (<5%) when I play with the D-ILU setup, increasing the number of > > D-ILU blocks from 3 to 9 caused the ksp setup with PCBJACOBI to suffer a > > performance decrease of more than 50% in sequential test. > > This is odd, the more blocks the smaller each block so the quicker the ILU > set up should be. You can run various cases with -log_view and send us the > output to see what is happening at each part of the computation in time. > > > So my implementation IS somewhat different in PETSc. Do I miss something in > > the PCBJACOBI setup? Or do I have some fundamental misunderstanding of how > > PCBJACOBI works in PETSc? > > Probably not. > > > > If this is not the right way to implement a block diagonal ILU as > > (parallel) PC, please kindly point me to the right direction. I searched > > through the mail list to find some answers, only to find a couple of > > similar questions... An example would be nice. > > You approach seems fundamentally right but I cannot be sure of possible > bugs. > > > > On the other hand, does PETSc support a simple way to use explicit L/U > > matrix as a preconditioner? I can import the D-ILU matrix (I already > > converted my A matrix into Mat) constructed in my Fortran code to make a > > better comparison. Or do I have to construct my own PC using PCshell? If > > so, is there a good tutorial/example to learn about how to use PCSHELL (in > > a more advanced sense, like how to setup pc side and transpose)? > > Not sure what you mean by explicit L/U matrix as a preconditioner. As Hong > said, yes you can use a parallel LU from MUMPS or SuperLU_DIST or Pastix as > the solver. You do not need any shell code. You simply need to set the PCType > to lu > > You can also set all this options from the command line and don't need to > write the code specifically. So call KSPSetFromOptions() and then for example > > -pc_type bjacobi -pc_bjacobi_local_blocks 3 -pc_sub_type ilu (this last > one is applied to each block so you could use -pc_type lu and it would use lu > on each block.) > > -ksp_type_none -pc_type lu -pc_factor_mat_solver_type mumps (do parallel > LU with mumps) > > By not hardwiring in the code and just using options you can test out > different cases much quicker > > Use -ksp_view to make sure that is using the solver the way you expect. > > Barry > > > > Barry > > > > > Thanks in advance, > > > > Hao