Thanks for all the replies. Really appreciate.
Bests, Jinlei On Tue, Aug 9, 2016 at 11:48 AM, Matthew Knepley <[email protected]> wrote: > On Tue, Aug 9, 2016 at 9:52 AM, Jinlei Shen <[email protected]> wrote: > >> Hi Barry, >> >> Thanks for your answer. >> >> But logically for large problem, we are always expecting to see >> paralleled program perform better with regard to both speed and memory >> since each of the multi-processes independently deal with its own >> submatrix, especially for iterative solver, which is revealed by CG+BJ. >> I just don't understand, in the computing with CG+ASM and SUPER_LU, why >> the two-process is most inefficient among these cases. If this is due to >> the communication cost compared with uni-process, why the speed goes down >> for triple and more processes. I'm new to parallelism, could you speculate >> any possible reason for such situation? >> > > As Barry noted, there are two reasons that you get slowdown: > > 1) Worse performance of existing algorithm > > This comes from things like communication. This is usually a small > contributor to slowdown. > > 2) Different parallel algorithm > > This is what is causing your slowdown most likely. > > Matt > > >> Great thanks >> >> >> >> On Fri, Aug 5, 2016 at 10:09 PM, Barry Smith <[email protected]> wrote: >> >>> >>> > On Aug 5, 2016, at 5:58 PM, Jinlei Shen <[email protected]> wrote: >>> > >>> > ​Hi, >>> > >>> > Thanks for your answers. >>> > >>> > I just figured out the issues which are mainly due to the >>> ill-conditioning of my matrix. I found the conditional number blows up when >>> the beam is discretized into large number of elements. >>> > >>> > Now, I am using the 1D bar model to solve the same problem. The good >>> news is the solution is always accurate and stable even I discretized into >>> 10 million elements. >>> > >>> > When I run the model with both iterative solver(CG+BJACOBI/ASM) and >>> direct solver(SUPER_LU) in parallelization, I got the following results: >>> > >>> > Mesh size: 1 million unknowns >>> > Processes 1 2 4 6 8 10 12 >>> 16 20 >>> > CG+BJ 0.36 0.22 0.15 0.12 0.11 0.1 0.096 0.097 >>> 0.099 >>> > CG+ASM 0.47 0.46 0.267 0.2 0.17 0.15 0.145 >>> 0.16 0.15 >>> > SUPER_LU_DIST 4.73 5.4 4.69 4.58 4.38 4.2 4.27 >>> 4.28 4.38 >>> > >>> > It seems the CG+BJ works correctly, i.e. time decreases fast with a >>> few more processes and reach stable with many more cores. >>> > >>> > However, I have some concerns about CG+ASM and SUPER_LU_DIST. The time >>> of both two methods goes up when I use two processes compared with >>> uniprocess. >>> >>> This is actually not surprising at all but since the mantra is >>> "parallelism will always make things faster" it can confuse people. When >>> run with one process the ASM and SuperLU_DIST utilize essentially >>> sequential algorithms, when run with two processes they "switch" to >>> parallel algorithms which simply are not as good as the essentially >>> sequential algorithm that is obtained with one process hence they run >>> slower. This is just life, there really isn't something one can do about it >>> except to perhaps use a poorer quality algorithm on one process so that two >>> processes look better but the goal of PETSc is not to make parallelism to >>> look good but to provide efficient solvers (as best we can) for one and >>> multiple processes. >>> >>> Barry >>> >>> >>> >>> >>> > The tendency is more obvious when I use larger mesh size. >>> > I especially doubt the results of SUPER_LU_DIST in parallelism since >>> the overall expedition is very small which is not expected. >>> > The runtime option I use for ASM pc and SUPER_LU_DIST solver is shown >>> as below: >>> > ASM preconditioner: -pc_type asm -pc_asm_type basic >>> > SUPER_LU_DIST solver: -ksp_type preonly -pc_type lu >>> -pc_factor_mat_solver_package superlu_dist >>> > >>> > I use same mpiexec -n np ./xxxx for all solvers. >>> > >>> > Am I using them correctly? If so, is there anyway to speed up the >>> computation further, especially for SUPER_LU_DIST? >>> > >>> > Thank you very much! >>> > >>> > Bests, >>> > Jinlei >>> > >>> > On Mon, Aug 1, 2016 at 2:10 PM, Matthew Knepley <[email protected]> >>> wrote: >>> > On Mon, Aug 1, 2016 at 12:52 PM, Jinlei Shen <[email protected]> wrote: >>> > Hi Barry, >>> > >>> > Thanks for your reply. >>> > >>> > Firstly, as you suggested, I checked my program under valgrind. The >>> results for both sequential and parallel cases showed there are no memory >>> errors detected. >>> > >>> > Second, I coded a sequential program without using PETSC to generate >>> the global matrix of small mesh for the same problem. I then checked the >>> matrix both from petsc(sequential and parallel) and serial code, and they >>> are same. >>> > The way I assembled the global matrix in parallel is first >>> distributing the nodes and elements into processes, then I loop with >>> elements on the calling process to put the element stiffness into the >>> global. Since the nodes and elements in cantilever beam are numbered >>> successively, the connectivity is simple. I didn't use any partition tools >>> to optimize mesh. It's also easy to determine the preallocation d_nnz and >>> o_nnz since each node only connects the left and right nodes except for >>> beginning and end, the maximum nonzeros in each row is 6. The MatSetValue >>> process is shown as follows: >>> > do iEL = idElStart, idElEnd >>> > g_EL = (/2*iEL-1-1,2*iEL-1,2*iEL+1-1,2*iEL+2-1/) >>> > call MatSetValues(SG,4,g_EL,4,g_El,SE,ADD_VALUES,ierr) >>> > end do >>> > where idElStart and idElEnd are the global number of first element and >>> end element that the process owns, g_EL is the global index for DOF in >>> element iEL, SE is the element stiffness which is same for all elements. >>> > From above assembling, most of the elements are assembled within own >>> process while there are few elements crossing two processes. >>> > >>> > The BC for my problem(cantilever under end point load) is to fix the >>> first two DOF, so I called the MatZeroRowsColumns to set the first two rows >>> and columns into zero with diagonal equal to one, without changing the RHS. >>> > >>> > Now some new issues show up : >>> > >>> > I run with -ksp_monitor_true_residual and -ksp_converged_reason, the >>> monitor showed two different residues, one is the residue I can >>> set(preconditioned, unpreconditioned, natural), the other is called true >>> residue. >>> > ​​ >>> > I initially thought the true residue is same as unpreconditioned based >>> on definition. But it seems not true. Is it the norm of the residue (b-Ax) >>> between computed RHS and true RHS? But, how to understand unprecondition >>> residue since its definition is b-Ax as well? >>> > >>> > It is the unpreconditioned residual. You must be misinterpreting. And >>> we could determine exactly if you sent the output with the suggested >>> options. >>> > >>> > Can I set the true residue as my converging criteria? >>> > >>> > Use right preconditioning. >>> > >>> > I found the accuracy of large mesh in my problem didn't necessary >>> depend on the tolerance I set, either preconditioned or unpreconditioned, >>> sometimes, it showed converged while the solution is not correct. But the >>> true residue looks reflecting the true convergence very well, if the true >>> residue is diverging, no matter what the first residue says, the results >>> are bad! >>> > >>> > Yes, your preconditioner looks singular. Note that BJACOBI has an >>> inner solver, and by default the is GMRES/ILU(0). I think >>> > ILU(0) is really ill-conditioned for your problem. >>> > >>> > For the preconditioner concerns, actually, I used BJACOBI before I >>> sent the first email, since the JACOBI or PBJACOBI didn't even converge >>> when the size was large. >>> > But BJACOBI also didn't perform well in the paralleliztion for large >>> mesh as posed in my last email, while it's fine for small size (below 10k >>> elements) >>> > >>> > Yesterday, I tried the ASM with CG using the runtime option: -pc_type >>> asm -pc_asm_type basic -sub_pc_type lu (default is ilu). >>> > For 15k elements mesh, I am now able to get the correct answer with >>> 1-3, 6 and more processes, using either -sub_pc_type lu or ilu. >>> > >>> > Yes, LU works for your subdomain solver. >>> > >>> > Based on all the results I have got, it shows the results varies a lot >>> with different PC and seems ASM is better for large problem. >>> > >>> > Its not ASM so much as an LU subsolver that is better. >>> > >>> > But what is the major factor to produce such difference between >>> different PCs, since it's not just the issue of computational efficiency, >>> but also the accuracy. >>> > Also, I noticed for large mesh, the solution is unstable with small >>> number of processes, for the 15k case, the solution is not correct with 4 >>> and 5 processes, however, the solution becomes always correct with more >>> than 6 processes. For the 50k mesh case, more processes are required to >>> show the stability. >>> > >>> > Yes, partitioning is very important here. Since you do not have a good >>> partition, you can get these wild variations. >>> > >>> > Thanks, >>> > >>> > Matt >>> > >>> > What do you think about this? Anything wrong? >>> > Since the iterative solver in parallel is first computed locally(if >>> this is correct), can it be possible that there are 'good' and 'bad' locals >>> when dividing the global matrix, and the result from 'bad' local will >>> contaminate the global results. But with more processes, such risk is >>> reduced. >>> > >>> > It is highly appreciated if you could give me some instruction for >>> above questions. >>> > >>> > Thank you very much. >>> > >>> > Bests, >>> > Jinlei >>> > >>> > >>> > On Fri, Jul 29, 2016 at 2:09 PM, Barry Smith <[email protected]> >>> wrote: >>> > >>> > First run under valgrind all the cases to make sure there is not >>> some use of uninitialized data or overwriting of data. Go to >>> http://www.mcs.anl.gov/petsc follow the link to FAQ and search for >>> valgrind (the web server seems to be broken at the moment). >>> > >>> > Second it is possible that your code the assembles the matrices and >>> vectors is not correctly assembling it for either the sequential or >>> parallel case. Hence a different number of processes could be generating a >>> different linear system hence inconsistent results. How are you handling >>> the parallelism? How do you know the matrix generated in parallel is >>> identically to that sequentially? >>> > >>> > Simple preconditioners such as pbjacobi will converge slower and >>> slower with more elements. >>> > >>> > Note that you should run with -ksp_monitor_true_residual and >>> -ksp_converged_reason to make sure that the iterative solver is even >>> converging. By default PETSc KSP solvers do not stop with a big error >>> message if they do not converge so you need make sure they are always >>> converging. >>> > >>> > Barry >>> > >>> > >>> > >>> > > On Jul 29, 2016, at 11:46 AM, Jinlei Shen <[email protected]> wrote: >>> > > >>> > > Dear PETSC developers, >>> > > >>> > > Thank you for developing such a powerful tool for scientific >>> computations. >>> > > >>> > > I'm currently trying to run a simple cantilever beam FEM to test the >>> scalability of PETSC on multi-processors. I also want to verify whether >>> iterative solver or direct solver is more efficient for parallel large FEM >>> problem. >>> > > >>> > > Problem description, An Euler elementary cantilever beam with point >>> load at the end along -y direction. Each node has 2 DOF (deflection and >>> rotation)). MPIBAIJ is used with bs = 2, dnnz and onnz are determined based >>> on the connectivity. Loop with elements in each processor to assemble the >>> global matrix with same element stiffness matrix. The boundary condition is >>> set using call MatZeroRowsColumns(SG,2,g_BC,o >>> ne,PETSC_NULL_OBJECT,PETSC_NULL_OBJECT,ierr); >>> > > >>> > > Based on what I have done, I find the computations work well, i.e >>> the results are correct compared with theoretical solution, for small mesh >>> size (small than 5000 elements) using both solvers with different numbers >>> of processes. >>> > > >>> > > However, there are several confusing issues when I increase the mesh >>> size to 10000 and more elements with iterative solve(CG + PCBJACOBI) >>> > > >>> > > 1. For 10k elements, I can get accurate solution using iterative >>> solver with uni-processor(i.e. only one process). However, when I use 2-8 >>> processes, it tells the linear solver converged with different iterations, >>> but, the results are all different for different processes and erroneous. >>> The wired thing is when I use >9 processes, the results are correct again. >>> I am really confused by this. Could you explain me why? If my >>> parallelization is not correct, why it works for small cases? And I check >>> the global matrix and RHS vector and didn't see any mallocs during the >>> process. >>> > > >>> > > 2. For 30k elements, if I use one process, it says: Linear solve did >>> not converge due to DIVERGED_INDEFINITE_PC. Does this commonly happen for >>> large sparse matrix? If so, is there any stable solver or pc for large >>> problem? >>> > > >>> > > >>> > > For parallel computing using direct solver(SUPERLU_DIST + PCLU), I >>> can only get accuracy when the number of elements are below 5000. There >>> must be something wrong. The way I use the superlu_dist solver is first >>> convert MatType to AIJ, then call PCFactorSetMatSolverPackage, and change >>> the PC to PCLU. Do I miss anything else to run SUPER_LU correctly? >>> > > >>> > > >>> > > I also use SUPER_LU and iterative solver(CG+PCBJACOBI) to solve the >>> sequential version of the same problem. The results shows that iterative >>> solver works well for <50k elements, while SUPER_LU only gets right >>> solution below 5k elements. Can I say iterative solver is better than >>> SUPER_LU for large problem? How can I improve the solver to copy with very >>> large problem, such as million by million? Another thing is it's still >>> doubtable of performance of SUPER_LU. >>> > > >>> > > For the inaccuracy issue, do you think it may be due to the memory? >>> However, there is no memory error showing during the execution. >>> > > >>> > > I really appreciate someone could resolve those puzzles above for >>> me. My goal is to replace the current SUPER_LU solver in my parallel CPFEM >>> main program with the iterative solver using PETSC. >>> > > >>> > > >>> > > Please let me if you would like to see my code in detail. >>> > > >>> > > Thank you very much. >>> > > >>> > > Bests, >>> > > Jinlei >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > >>> > >>> > >>> > >>> > >>> > -- >>> > What most experimenters take for granted before they begin their >>> experiments is infinitely more interesting than any results to which their >>> experiments lead. >>> > -- Norbert Wiener >>> > >>> >>> >> > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener >
