> On Feb 5, 2020, at 4:36 AM, Hao DONG <dong-...@outlook.com> wrote:
> 
> Thanks a lot for your suggestions, Hong and Barry - 
> 
> As you suggested, I first tried the LU direct solvers (built-in and MUMPs) 
> out this morning, which work perfectly, albeit slow. As my problem itself is 
> a part of a PDE based optimization, the exact solution in the searching 
> procedure is not necessary (I often set a relative tolerance of 1E-7/1E-8, 
> etc. for Krylov subspace methods). Also tried BJACOBI with exact LU, the KSP 
> just converges in one or two iterations, as expected. 
> 
> I added -kspview option for the D-ILU test (still with Block Jacobi as 
> preconditioner and bcgs as KSP solver). The KSPview output from one of the 
> examples in a toy model looks like: 
> 
> KSP Object: 1 MPI processes
>    type: bcgs
>    maximum iterations=120, nonzero initial guess
>    tolerances:  relative=1e-07, absolute=1e-50, divergence=10000.
>    left preconditioning
>    using PRECONDITIONED norm type for convergence test
>  PC Object: 1 MPI processes
>    type: bjacobi
>      number of blocks = 3
>      Local solve is same for all blocks, in the following KSP and PC objects:
>      KSP Object: (sub_) 1 MPI processes
>        type: preonly
>        maximum iterations=10000, initial guess is zero
>        tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
>        left preconditioning
>        using NONE norm type for convergence test
>      PC Object: (sub_) 1 MPI processes
>        type: ilu
>          out-of-place factorization
>          0 levels of fill
>          tolerance for zero pivot 2.22045e-14
>          matrix ordering: natural
>          factor fill ratio given 1., needed 1.
>            Factored matrix follows:
>              Mat Object: 1 MPI processes
>                type: seqaij
>                rows=11294, cols=11294
>                package used to perform factorization: petsc
>                total: nonzeros=76008, allocated nonzeros=76008
>                total number of mallocs used during MatSetValues calls=0
>                  not using I-node routines
>        linear system matrix = precond matrix:
>        Mat Object: 1 MPI processes
>          type: seqaij
>          rows=11294, cols=11294
>          total: nonzeros=76008, allocated nonzeros=76008
>          total number of mallocs used during MatSetValues calls=0
>            not using I-node routines
>    linear system matrix = precond matrix:
>    Mat Object: 1 MPI processes
>      type: mpiaij
>      rows=33880, cols=33880
>      total: nonzeros=436968, allocated nonzeros=436968
>      total number of mallocs used during MatSetValues calls=0
>        not using I-node (on process 0) routines
> 
> do you see something wrong with my setup?
> 
> I also tried a quick performance test with a small 278906 by 278906 matrix 
> (3850990 nnz) with the following parameters: 
> 
> -ksp_type bcgs -pc_type bjacobi -pc_bjacobi_local_blocks 3 -pc_sub_type ilu 
> -ksp_view
> 
> Reducing the relative residual to 1E-7 
> 
> Took 4.08s with 41 bcgs iterations. 
> 
> Merely changing the -pc_bjacobi_local_blocks to 6 
> 
> Took 7.02s with 73 bcgs iterations. 9 blocks would further take 9.45s with 
> 101 bcgs iterations.

    This is normal. more blocks slower convergence
> 
> As a reference, my home-brew Fortran code solves the same problem (3-block 
> D-ILU0) in 
> 
> 1.84s with 24 bcgs iterations (the bcgs code is also a home-brew one)…
> 
    Run the PETSc code with optimization ./configure --with-debugging=0  an run 
the code with -log_view this will show where the PETSc code is spending the 
time (send it to use)


> 
> 
> Well, by saying “use explicit L/U matrix as preconditioner”, I wonder if a 
> user is allowed to provide his own (separate) L and U Mat for preconditioning 
> (like how it works in Matlab solvers), e.g.
> 
> x = qmr(A,b,Tol,MaxIter,L,U,x)
>  
> As I already explicitly constructed the L and U matrices in Fortran, it is 
> not hard to convert them to Mat in Petsc to test Petsc and my Fortran code 
> head-to-head. In that case, the A, b, x, and L/U are all identical, it would 
> be easier to see where the problem came from. 
> 
> 
     No, we don't provide this kind of support
     

> 
> BTW, there is another thing I wondered - is there a way to output residual in 
> unpreconditioned norm? I tried to
> 
> call KSPSetNormType(ksp_local, KSP_NORM_UNPRECONDITIONED, ierr)
> 
> But always get an error that current ksp does not support unpreconditioned in 
> LEFT/RIGHT (either way). Is it possible to do that (output unpreconditioned 
> residual) in PETSc at all?

   -ksp_monitor_true_residual    You can also run GMRES (and some other 
methods) with right preconditioning, -ksp_pc_side right  then the residual 
computed is by the algorithm the unpreconditioned residual

   KSPSetNormType sets the norm used in the algorithm, it generally always has 
to left or right, only a couple algorithms support unpreconditioned directly.

   Barry


> 
> Cheers, 
> Hao
> 
> 
> From: Smith, Barry F. <bsm...@mcs.anl.gov>
> Sent: Tuesday, February 4, 2020 8:27 PM
> To: Hao DONG <dong-...@outlook.com>
> Cc: petsc-users@mcs.anl.gov <petsc-users@mcs.anl.gov>
> Subject: Re: [petsc-users] What is the right way to implement a (block) 
> Diagonal ILU as PC?
>  
> 
> 
> > On Feb 4, 2020, at 12:41 PM, Hao DONG <dong-...@outlook.com> wrote:
> > 
> > Dear all, 
> > 
> > 
> > I have a few questions about the implementation of diagonal ILU PC in 
> > PETSc. I want to solve a very simple system with KSP (in parallel), the 
> > nature of the system (finite difference time-harmonic Maxwell) is probably 
> > not important to the question itself. Long story short, I just need to 
> > solve a set of equations of Ax = b with a block diagonal system matrix, 
> > like (not sure if the mono font works): 
> > 
> >    |X    |  
> > A =|  Y  |  
> >    |    Z| 
> > 
> > Note that A is not really block-diagonal, it’s just a multi-diagonal matrix 
> > determined by the FD mesh, where most elements are close to diagonal. So 
> > instead of a full ILU decomposition, a D-ILU is a good approximation as a 
> > preconditioner, and the number of blocks should not matter too much: 
> > 
> >     |Lx      |         |Ux      |
> > L = |   Ly   | and U = |   Uy   |
> >     |      Lz|         |      Uz|
> > 
> > Where [Lx, Ux] = ILU0(X), etc. Indeed, the D-ILU preconditioner (with 3N 
> > blocks) is quite efficient with Krylov subspace methods like BiCGstab or 
> > QMR in my sequential Fortran/Matlab code. 
> > 
> > So like most users, I am looking for a parallel implement with this problem 
> > in PETSc. After looking through the manual, it seems that the most 
> > straightforward way to do it is through PCBJACOBI. Not sure I understand it 
> > right, I just setup a 3-block PCJACOBI and give each of the block a KSP 
> > with PCILU. Is this supposed to be equivalent to my D-ILU preconditioner? 
> > My little block of fortran code would look like: 
> > ...
> >       call PCBJacobiSetTotalBlocks(pc_local,Ntotal,                   &
> >      &     isubs,ierr)
> >       call PCBJacobiSetLocalBlocks(pc_local, Nsub,                    &
> >      &    isubs(istart:iend),ierr)
> >       ! set up the block jacobi structure
> >       call KSPSetup(ksp_local,ierr)
> >       ! allocate sub ksps
> >       allocate(ksp_sub(Nsub))
> >       call PCBJacobiGetSubKSP(pc_local,Nsub,istart,                   &
> >      &     ksp_sub,ierr)
> >       do i=1,Nsub
> >           call KSPGetPC(ksp_sub(i),pc_sub,ierr)
> >           !ILU preconditioner
> >           call PCSetType(pc_sub,ptype,ierr)
> >           call PCFactorSetLevels(pc_sub,1,ierr) ! use ILU(1) here
> >           call KSPSetType(ksp_sub(i),KSPPREONLY,ierr)]
> >       end do
> >       call KSPSetTolerances(ksp_local,KSSiter%tol,PETSC_DEFAULT_REAL, &
> >      &     PETSC_DEFAULT_REAL,KSSiter%maxit,ierr)
> > … 
> 
>      This code looks essentially right. You should call with -ksp_view to see 
> exactly what PETSc is using for a solver. 
> 
> > 
> > I understand that the parallel performance may not be comparable, so I 
> > first set up a one-process test (with MPIAij, but all the rows are local 
> > since there is only one process). The system is solved without any problem 
> > (identical results within error). But the performance is actually a lot 
> > worse (code built without debugging flags in performance tests) than my own 
> > home-brew implementation in Fortran (I wrote my own ILU0 in CSR sparse 
> > matrix format), which is hard to believe. I suspect the difference is from 
> > the PC as the PETSc version took much more BiCGstab iterations (60-ish vs 
> > 100-ish) to converge to the same relative tol. 
> 
>    PETSc uses GMRES by default with a restart of 30 and left preconditioning. 
> It could be different exact choices in the solver (which is why -ksp_view is 
> so useful) can explain the differences in the runs between your code and 
> PETSc's
> > 
> > This is further confirmed when I change the setup of D-ILU (using 6 or 9 
> > blocks instead of 3). While my Fortran/Matlab codes see minimal performance 
> > difference (<5%) when I play with the D-ILU setup, increasing the number of 
> > D-ILU blocks from 3 to 9 caused the ksp setup with PCBJACOBI to suffer a 
> > performance decrease of more than 50% in sequential test.
> 
>    This is odd, the more blocks the smaller each block so the quicker the ILU 
> set up should be. You can run various cases with -log_view and send us the 
> output to see what is happening at each part of the computation in time.
>  
> > So my implementation IS somewhat different in PETSc. Do I miss something in 
> > the PCBJACOBI setup? Or do I have some fundamental misunderstanding of how 
> > PCBJACOBI works in PETSc? 
> 
>    Probably not.
> > 
> > If this is not the right way to implement a block diagonal ILU as 
> > (parallel) PC, please kindly point me to the right direction. I searched 
> > through the mail list to find some answers, only to find a couple of 
> > similar questions... An example would be nice.
> 
>    You approach seems fundamentally right but I cannot be sure of possible 
> bugs.
> > 
> > On the other hand, does PETSc support a simple way to use explicit L/U 
> > matrix as a preconditioner? I can import the  D-ILU matrix (I already 
> > converted my A matrix into Mat) constructed in my Fortran code to make a 
> > better comparison. Or do I have to construct my own PC using PCshell? If 
> > so, is there a good tutorial/example to learn about how to use PCSHELL (in 
> > a more advanced sense, like how to setup pc side and transpose)? 
> 
>    Not sure what you mean by explicit L/U matrix as a preconditioner. As Hong 
> said, yes you can use a parallel LU from MUMPS or SuperLU_DIST or Pastix as 
> the solver. You do not need any shell code. You simply need to set the PCType 
> to lu 
> 
>    You can also set all this options from the command line and don't need to 
> write the code specifically. So call KSPSetFromOptions() and then for example
> 
>     -pc_type bjacobi  -pc_bjacobi_local_blocks 3 -pc_sub_type ilu (this last 
> one is applied to each block so you could use -pc_type lu and it would use lu 
> on each block.) 
> 
>    -ksp_type_none  -pc_type lu -pc_factor_mat_solver_type mumps  (do parallel 
> LU with mumps)
> 
> By not hardwiring in the code and just using options you can test out 
> different cases much quicker
> 
> Use -ksp_view to make sure that is using the solver the way you expect.
> 
> Barry
> 
> 
> 
>    Barry
> 
> > 
> > Thanks in advance, 
> > 
> > Hao

Reply via email to