On Wed, Mar 2, 2016 at 7:15 PM, Justin Chang <[email protected]> wrote:
> Barry, > > Attached are the log_summary output for each preconditioner. > MatPtAP takes all the time. It looks like there is no coarsening at all at the first level. Mark, can you see what is going on here? Matt > Thanks, > Justin > > > On Wednesday, March 2, 2016, Barry Smith <[email protected]> wrote: > >> >> Justin, >> >> Do you have the -log_summary output for these runs? >> >> Barry >> >> > On Mar 2, 2016, at 4:28 PM, Justin Chang <[email protected]> wrote: >> > >> > Dear all, >> > >> > Using the firedrake project, I am solving this simple mixed poisson >> problem: >> > >> > mesh = UnitCubeMesh(40,40,40) >> > V = FunctionSpace(mesh,"RT",1) >> > Q = FunctionSpace(mesh,"DG",0) >> > W = V*Q >> > >> > v, p = TrialFunctions(W) >> > w, q = TestFunctions(W) >> > >> > f = Function(Q) >> > >> f.interpolate(Expression("12*pi*pi*sin(pi*x[0]*2)*sin(pi*x[1]*2)*sin(2*pi*x[2])")) >> > >> > a = dot(v,w)*dx - p*div(w)*dx + div(v)*q*dx >> > L = f*q*dx >> > >> > u = Function(W) >> > solve(a==L,u,solver_parameters={...}) >> > >> > This problem has 1161600 degrees of freedom. The solver_parameters are: >> > >> > -ksp_type gmres >> > -pc_type fieldsplit >> > -pc_fieldsplit_type schur >> > -pc_fieldsplit_schur_fact_type: upper >> > -pc_fieldsplit_schur_precondition selfp >> > -fieldsplit_0_ksp_type preonly >> > -fieldsplit_0_pc_type bjacobi >> > -fieldsplit_1_ksp_type preonly >> > -fieldsplit_1_pc_type hypre/ml/gamg >> > >> > for the last option, I compared the wall-clock timings for hypre, >> ml,and gamg. Here are the strong-scaling results (across 64 cores, 8 cores >> per Intel Xeon E5-2670 node) for hypre, ml, and gamg: >> > >> > hypre: >> > 1 core: 47.5 s, 12 solver iters >> > 2 cores: 34.1 s, 15 solver iters >> > 4 cores: 21.5 s, 15 solver iters >> > 8 cores: 16.6 s, 15 solver iters >> > 16 cores: 10.2 s, 15 solver iters >> > 24 cores: 7.66 s, 15 solver iters >> > 32 cores: 6.31 s, 15 solver iters >> > 40 cores: 5.68 s, 15 solver iters >> > 48 cores: 5.36 s, 16 solver iters >> > 56 cores: 5.12 s, 16 solver iters >> > 64 cores: 4.99 s, 16 solver iters >> > >> > ml: >> > 1 core: 4.44 s, 14 solver iters >> > 2 cores: 2.85 s, 16 solver iters >> > 4 cores: 1.6 s, 17 solver iters >> > 8 cores: 0.966 s, 17 solver iters >> > 16 cores: 0.585 s, 18 solver iters >> > 24 cores: 0.440 s, 18 solver iters >> > 32 cores: 0.375 s, 18 solver iters >> > 40 cores: 0.332 s, 18 solver iters >> > 48 cores: 0.307 s, 17 solver iters >> > 56 cores: 0.290 s, 18 solver iters >> > 64 cores: 0.281 s, 18 solver items >> > >> > gamg: >> > 1 core: 613 s, 12 solver iters >> > 2 cores: 204 s, 15 solver iters >> > 4 cores: 77.1 s, 15 solver iters >> > 8 cores: 38.1 s, 15 solver iters >> > 16 cores: 15.9 s, 16 solver iters >> > 24 cores: 9.24 s, 16 solver iters >> > 32 cores: 5.92 s, 16 solver iters >> > 40 cores: 4.72 s, 16 solver iters >> > 48 cores: 3.89 s, 16 solver iters >> > 56 cores: 3.65 s, 16 solver iters >> > 64 cores: 3.46 s, 16 solver iters >> > >> > The performance difference between ML and HYPRE makes sense to me, but >> what I am really confused about is GAMG. It seems GAMG is really slow on a >> single core but something internally is causing it to speed up >> super-linearly as I increase the number of MPI processes. Shouldn't ML and >> GAMG have the same performance? I am not sure what log outputs to give you >> guys, but for starters, below is -ksp_view for the single core case with >> GAMG >> > >> > KSP Object:(solver_) 1 MPI processes >> > type: gmres >> > GMRES: restart=30, using Classical (unmodified) Gram-Schmidt >> Orthogonalization with no iterative refinement >> > GMRES: happy breakdown tolerance 1e-30 >> > maximum iterations=10000, initial guess is zero >> > tolerances: relative=1e-07, absolute=1e-50, divergence=10000. >> > left preconditioning >> > using PRECONDITIONED norm type for convergence test >> > PC Object:(solver_) 1 MPI processes >> > type: fieldsplit >> > FieldSplit with Schur preconditioner, factorization UPPER >> > Preconditioner for the Schur complement formed from Sp, an >> assembled approximation to S, which uses (lumped, if requested) A00's >> diagonal's inverse >> > Split info: >> > Split number 0 Defined by IS >> > Split number 1 Defined by IS >> > KSP solver for A00 block >> > KSP Object: (solver_fieldsplit_0_) 1 MPI processes >> > type: preonly >> > maximum iterations=10000, initial guess is zero >> > tolerances: relative=1e-05, absolute=1e-50, divergence=10000. >> > left preconditioning >> > using NONE norm type for convergence test >> > PC Object: (solver_fieldsplit_0_) 1 MPI processes >> > type: bjacobi >> > block Jacobi: number of blocks = 1 >> > Local solve is same for all blocks, in the following KSP and >> PC objects: >> > KSP Object: (solver_fieldsplit_0_sub_) 1 >> MPI processes >> > type: preonly >> > maximum iterations=10000, initial guess is zero >> > tolerances: relative=1e-05, absolute=1e-50, >> divergence=10000. >> > left preconditioning >> > using NONE norm type for convergence test >> > PC Object: (solver_fieldsplit_0_sub_) 1 >> MPI processes >> > type: ilu >> > ILU: out-of-place factorization >> > 0 levels of fill >> > tolerance for zero pivot 2.22045e-14 >> > matrix ordering: natural >> > factor fill ratio given 1., needed 1. >> > Factored matrix follows: >> > Mat Object: 1 MPI processes >> > type: seqaij >> > rows=777600, cols=777600 >> > package used to perform factorization: petsc >> > total: nonzeros=5385600, allocated nonzeros=5385600 >> > total number of mallocs used during MatSetValues >> calls =0 >> > not using I-node routines >> > linear system matrix = precond matrix: >> > Mat Object: (solver_fieldsplit_0_) 1 >> MPI processes >> > type: seqaij >> > rows=777600, cols=777600 >> > total: nonzeros=5385600, allocated nonzeros=5385600 >> > total number of mallocs used during MatSetValues calls =0 >> > not using I-node routines >> > linear system matrix = precond matrix: >> > Mat Object: (solver_fieldsplit_0_) 1 MPI >> processes >> > type: seqaij >> > rows=777600, cols=777600 >> > total: nonzeros=5385600, allocated nonzeros=5385600 >> > total number of mallocs used during MatSetValues calls =0 >> > not using I-node routines >> > KSP solver for S = A11 - A10 inv(A00) A01 >> > KSP Object: (solver_fieldsplit_1_) 1 MPI processes >> > type: preonly >> > maximum iterations=10000, initial guess is zero >> > tolerances: relative=1e-05, absolute=1e-50, divergence=10000. >> > left preconditioning >> > using NONE norm type for convergence test >> > PC Object: (solver_fieldsplit_1_) 1 MPI processes >> > type: gamg >> > MG: type is MULTIPLICATIVE, levels=5 cycles=v >> > Cycles per PCApply=1 >> > Using Galerkin computed coarse grid matrices >> > GAMG specific options >> > Threshold for dropping small values from graph 0. >> > AGG specific options >> > Symmetric graph false >> > Coarse grid solver -- level ------------------------------- >> > KSP Object: (solver_fieldsplit_1_mg_coarse_) >> 1 MPI processes >> > type: preonly >> > maximum iterations=1, initial guess is zero >> > tolerances: relative=1e-05, absolute=1e-50, >> divergence=10000. >> > left preconditioning >> > using NONE norm type for convergence test >> > PC Object: (solver_fieldsplit_1_mg_coarse_) >> 1 MPI processes >> > type: bjacobi >> > block Jacobi: number of blocks = 1 >> > Local solve is same for all blocks, in the following KSP >> and PC objects: >> > KSP Object: >> (solver_fieldsplit_1_mg_coarse_sub_) 1 MPI processes >> > type: preonly >> > maximum iterations=1, initial guess is zero >> > tolerances: relative=1e-05, absolute=1e-50, >> divergence=10000. >> > left preconditioning >> > using NONE norm type for convergence test >> > PC Object: >> (solver_fieldsplit_1_mg_coarse_sub_) 1 MPI processes >> > type: lu >> > LU: out-of-place factorization >> > tolerance for zero pivot 2.22045e-14 >> > using diagonal shift on blocks to prevent zero pivot >> [INBLOCKS] >> > matrix ordering: nd >> > factor fill ratio given 5., needed 1. >> > Factored matrix follows: >> > Mat Object: 1 MPI processes >> > type: seqaij >> > rows=9, cols=9 >> > package used to perform factorization: petsc >> > total: nonzeros=81, allocated nonzeros=81 >> > total number of mallocs used during >> MatSetValues calls =0 >> > using I-node routines: found 2 nodes, limit >> used is 5 >> > linear system matrix = precond matrix: >> > Mat Object: 1 MPI processes >> > type: seqaij >> > rows=9, cols=9 >> > total: nonzeros=81, allocated nonzeros=81 >> > total number of mallocs used during MatSetValues >> calls =0 >> > using I-node routines: found 2 nodes, limit used is >> 5 >> > linear system matrix = precond matrix: >> > Mat Object: 1 MPI processes >> > type: seqaij >> > rows=9, cols=9 >> > total: nonzeros=81, allocated nonzeros=81 >> > total number of mallocs used during MatSetValues calls =0 >> > using I-node routines: found 2 nodes, limit used is 5 >> > Down solver (pre-smoother) on level 1 >> ------------------------------- >> > KSP Object: (solver_fieldsplit_1_mg_levels_1_) >> 1 MPI processes >> > type: chebyshev >> > Chebyshev: eigenvalue estimates: min = 0.0999525, max = >> 1.09948 >> > Chebyshev: eigenvalues estimated using gmres with >> translations [0. 0.1; 0. 1.1] >> > KSP Object: >> (solver_fieldsplit_1_mg_levels_1_esteig_) 1 MPI processes >> > type: gmres >> > GMRES: restart=30, using Classical (unmodified) >> Gram-Schmidt Orthogonalization with no iterative refinement >> > GMRES: happy breakdown tolerance 1e-30 >> > maximum iterations=10, initial guess is zero >> > tolerances: relative=1e-12, absolute=1e-50, >> divergence=10000. >> > left preconditioning >> > using PRECONDITIONED norm type for convergence test >> > maximum iterations=2 >> > tolerances: relative=1e-05, absolute=1e-50, >> divergence=10000. >> > left preconditioning >> > using nonzero initial guess >> > using NONE norm type for convergence test >> > PC Object: (solver_fieldsplit_1_mg_levels_1_) >> 1 MPI processes >> > type: sor >> > SOR: type = local_symmetric, iterations = 1, local >> iterations = 1, omega = 1. >> > linear system matrix = precond matrix: >> > Mat Object: 1 MPI processes >> > type: seqaij >> > rows=207, cols=207 >> > total: nonzeros=42849, allocated nonzeros=42849 >> > total number of mallocs used during MatSetValues calls =0 >> > using I-node routines: found 42 nodes, limit used is 5 >> > Up solver (post-smoother) same as down solver (pre-smoother) >> > Down solver (pre-smoother) on level 2 >> ------------------------------- >> > KSP Object: (solver_fieldsplit_1_mg_levels_2_) >> 1 MPI processes >> > type: chebyshev >> > Chebyshev: eigenvalue estimates: min = 0.0996628, max = >> 1.09629 >> > Chebyshev: eigenvalues estimated using gmres with >> translations [0. 0.1; 0. 1.1] >> > KSP Object: >> (solver_fieldsplit_1_mg_levels_2_esteig_) 1 MPI processes >> > type: gmres >> > GMRES: restart=30, using Classical (unmodified) >> Gram-Schmidt Orthogonalization with no iterative refinement >> > GMRES: happy breakdown tolerance 1e-30 >> > maximum iterations=10, initial guess is zero >> > tolerances: relative=1e-12, absolute=1e-50, >> divergence=10000. >> > left preconditioning >> > using PRECONDITIONED norm type for convergence test >> > maximum iterations=2 >> > tolerances: relative=1e-05, absolute=1e-50, >> divergence=10000. >> > left preconditioning >> > using nonzero initial guess >> > using NONE norm type for convergence test >> > PC Object: (solver_fieldsplit_1_mg_levels_2_) >> 1 MPI processes >> > type: sor >> > SOR: type = local_symmetric, iterations = 1, local >> iterations = 1, omega = 1. >> > linear system matrix = precond matrix: >> > Mat Object: 1 MPI processes >> > type: seqaij >> > rows=5373, cols=5373 >> > total: nonzeros=28852043, allocated nonzeros=28852043 >> > total number of mallocs used during MatSetValues calls =0 >> > using I-node routines: found 1481 nodes, limit used is 5 >> > Up solver (post-smoother) same as down solver (pre-smoother) >> > Down solver (pre-smoother) on level 3 >> ------------------------------- >> > KSP Object: (solver_fieldsplit_1_mg_levels_3_) >> 1 MPI processes >> > type: chebyshev >> > Chebyshev: eigenvalue estimates: min = 0.0994294, max = >> 1.09372 >> > Chebyshev: eigenvalues estimated using gmres with >> translations [0. 0.1; 0. 1.1] >> > KSP Object: >> (solver_fieldsplit_1_mg_levels_3_esteig_) 1 MPI processes >> > type: gmres >> > GMRES: restart=30, using Classical (unmodified) >> Gram-Schmidt Orthogonalization with no iterative refinement >> > GMRES: happy breakdown tolerance 1e-30 >> > maximum iterations=10, initial guess is zero >> > tolerances: relative=1e-12, absolute=1e-50, >> divergence=10000. >> > left preconditioning >> > using PRECONDITIONED norm type for convergence test >> > maximum iterations=2 >> > tolerances: relative=1e-05, absolute=1e-50, >> divergence=10000. >> > left preconditioning >> > using nonzero initial guess >> > using NONE norm type for convergence test >> > PC Object: (solver_fieldsplit_1_mg_levels_3_) >> 1 MPI processes >> > type: sor >> > SOR: type = local_symmetric, iterations = 1, local >> iterations = 1, omega = 1. >> > linear system matrix = precond matrix: >> > Mat Object: 1 MPI processes >> > type: seqaij >> > rows=52147, cols=52147 >> > total: nonzeros=38604909, allocated nonzeros=38604909 >> > total number of mallocs used during MatSetValues calls =2 >> > not using I-node routines >> > Up solver (post-smoother) same as down solver (pre-smoother) >> > Down solver (pre-smoother) on level 4 >> ------------------------------- >> > KSP Object: (solver_fieldsplit_1_mg_levels_4_) >> 1 MPI processes >> > type: chebyshev >> > Chebyshev: eigenvalue estimates: min = 0.158979, max = >> 1.74876 >> > Chebyshev: eigenvalues estimated using gmres with >> translations [0. 0.1; 0. 1.1] >> > KSP Object: >> (solver_fieldsplit_1_mg_levels_4_esteig_) 1 MPI processes >> > type: gmres >> > GMRES: restart=30, using Classical (unmodified) >> Gram-Schmidt Orthogonalization with no iterative refinement >> > GMRES: happy breakdown tolerance 1e-30 >> > maximum iterations=10, initial guess is zero >> > tolerances: relative=1e-12, absolute=1e-50, >> divergence=10000. >> > left preconditioning >> > using PRECONDITIONED norm type for convergence test >> > maximum iterations=2 >> > tolerances: relative=1e-05, absolute=1e-50, >> divergence=10000. >> > left preconditioning >> > using nonzero initial guess >> > using NONE norm type for convergence test >> > PC Object: (solver_fieldsplit_1_mg_levels_4_) >> 1 MPI processes >> > type: sor >> > SOR: type = local_symmetric, iterations = 1, local >> iterations = 1, omega = 1. >> > linear system matrix followed by preconditioner matrix: >> > Mat Object: (solver_fieldsplit_1_) 1 >> MPI processes >> > type: schurcomplement >> > rows=384000, cols=384000 >> > Schur complement A11 - A10 inv(A00) A01 >> > A11 >> > Mat Object: (solver_fieldsplit_1_) >> 1 MPI processes >> > type: seqaij >> > rows=384000, cols=384000 >> > total: nonzeros=384000, allocated nonzeros=384000 >> > total number of mallocs used during MatSetValues >> calls =0 >> > not using I-node routines >> > A10 >> > Mat Object: 1 MPI processes >> > type: seqaij >> > rows=384000, cols=777600 >> > total: nonzeros=1919999, allocated nonzeros=1919999 >> > total number of mallocs used during MatSetValues >> calls =0 >> > not using I-node routines >> > KSP of A00 >> > KSP Object: (solver_fieldsplit_0_) >> 1 MPI processes >> > type: preonly >> > maximum iterations=10000, initial guess is zero >> > tolerances: relative=1e-05, absolute=1e-50, >> divergence=10000. >> > left preconditioning >> > using NONE norm type for convergence test >> > PC Object: (solver_fieldsplit_0_) >> 1 MPI processes >> > type: bjacobi >> > block Jacobi: number of blocks = 1 >> > Local solve is same for all blocks, in the >> following KSP and PC objects: >> > KSP Object: >> (solver_fieldsplit_0_sub_) 1 MPI processes >> > type: preonly >> > maximum iterations=10000, initial guess is zero >> > tolerances: relative=1e-05, absolute=1e-50, >> divergence=10000. >> > left preconditioning >> > using NONE norm type for convergence test >> > PC Object: >> (solver_fieldsplit_0_sub_) 1 MPI processes >> > type: ilu >> > ILU: out-of-place factorization >> > 0 levels of fill >> > tolerance for zero pivot 2.22045e-14 >> > matrix ordering: natural >> > factor fill ratio given 1., needed 1. >> > Factored matrix follows: >> > Mat Object: >> 1 MPI processes >> > type: seqaij >> > rows=777600, cols=777600 >> > package used to perform factorization: >> petsc >> > total: nonzeros=5385600, allocated >> nonzeros=5385600 >> > total number of mallocs used during >> MatSetValues calls =0 >> > not using I-node routines >> > linear system matrix = precond matrix: >> > Mat Object: >> (solver_fieldsplit_0_) 1 MPI processes >> > type: seqaij >> > rows=777600, cols=777600 >> > total: nonzeros=5385600, allocated >> nonzeros=5385600 >> > total number of mallocs used during >> MatSetValues calls =0 >> > not using I-node routines >> > linear system matrix = precond matrix: >> > Mat Object: >> (solver_fieldsplit_0_) 1 MPI processes >> > type: seqaij >> > rows=777600, cols=777600 >> > total: nonzeros=5385600, allocated >> nonzeros=5385600 >> > total number of mallocs used during MatSetValues >> calls =0 >> > not using I-node routines >> > A01 >> > Mat Object: 1 MPI processes >> > type: seqaij >> > rows=777600, cols=384000 >> > total: nonzeros=1919999, allocated nonzeros=1919999 >> > total number of mallocs used during MatSetValues >> calls =0 >> > not using I-node routines >> > Mat Object: 1 MPI processes >> > type: seqaij >> > rows=384000, cols=384000 >> > total: nonzeros=3416452, allocated nonzeros=3416452 >> > total number of mallocs used during MatSetValues calls =0 >> > not using I-node routines >> > Up solver (post-smoother) same as down solver (pre-smoother) >> > linear system matrix followed by preconditioner matrix: >> > Mat Object: (solver_fieldsplit_1_) 1 MPI >> processes >> > type: schurcomplement >> > rows=384000, cols=384000 >> > Schur complement A11 - A10 inv(A00) A01 >> > A11 >> > Mat Object: (solver_fieldsplit_1_) >> 1 MPI processes >> > type: seqaij >> > rows=384000, cols=384000 >> > total: nonzeros=384000, allocated nonzeros=384000 >> > total number of mallocs used during MatSetValues calls >> =0 >> > not using I-node routines >> > A10 >> > Mat Object: 1 MPI processes >> > type: seqaij >> > rows=384000, cols=777600 >> > total: nonzeros=1919999, allocated nonzeros=1919999 >> > total number of mallocs used during MatSetValues calls >> =0 >> > not using I-node routines >> > KSP of A00 >> > KSP Object: (solver_fieldsplit_0_) >> 1 MPI processes >> > type: preonly >> > maximum iterations=10000, initial guess is zero >> > tolerances: relative=1e-05, absolute=1e-50, >> divergence=10000. >> > left preconditioning >> > using NONE norm type for convergence test >> > PC Object: (solver_fieldsplit_0_) >> 1 MPI processes >> > type: bjacobi >> > block Jacobi: number of blocks = 1 >> > Local solve is same for all blocks, in the following >> KSP and PC objects: >> > KSP Object: >> (solver_fieldsplit_0_sub_) 1 MPI processes >> > type: preonly >> > maximum iterations=10000, initial guess is zero >> > tolerances: relative=1e-05, absolute=1e-50, >> divergence=10000. >> > left preconditioning >> > using NONE norm type for convergence test >> > PC Object: >> (solver_fieldsplit_0_sub_) 1 MPI processes >> > type: ilu >> > ILU: out-of-place factorization >> > 0 levels of fill >> > tolerance for zero pivot 2.22045e-14 >> > matrix ordering: natural >> > factor fill ratio given 1., needed 1. >> > Factored matrix follows: >> > Mat Object: 1 MPI >> processes >> > type: seqaij >> > rows=777600, cols=777600 >> > package used to perform factorization: petsc >> > total: nonzeros=5385600, allocated >> nonzeros=5385600 >> > total number of mallocs used during >> MatSetValues calls =0 >> > not using I-node routines >> > linear system matrix = precond matrix: >> > Mat Object: >> (solver_fieldsplit_0_) 1 MPI processes >> > type: seqaij >> > rows=777600, cols=777600 >> > total: nonzeros=5385600, allocated >> nonzeros=5385600 >> > total number of mallocs used during MatSetValues >> calls =0 >> > not using I-node routines >> > linear system matrix = precond matrix: >> > Mat Object: (solver_fieldsplit_0_) >> 1 MPI processes >> > type: seqaij >> > rows=777600, cols=777600 >> > total: nonzeros=5385600, allocated nonzeros=5385600 >> > total number of mallocs used during MatSetValues >> calls =0 >> > not using I-node routines >> > A01 >> > Mat Object: 1 MPI processes >> > type: seqaij >> > rows=777600, cols=384000 >> > total: nonzeros=1919999, allocated nonzeros=1919999 >> > total number of mallocs used during MatSetValues calls >> =0 >> > not using I-node routines >> > Mat Object: 1 MPI processes >> > type: seqaij >> > rows=384000, cols=384000 >> > total: nonzeros=3416452, allocated nonzeros=3416452 >> > total number of mallocs used during MatSetValues calls =0 >> > not using I-node routines >> > linear system matrix = precond matrix: >> > Mat Object: 1 MPI processes >> > type: nest >> > rows=1161600, cols=1161600 >> > Matrix object: >> > type=nest, rows=2, cols=2 >> > MatNest structure: >> > (0,0) : prefix="solver_fieldsplit_0_", type=seqaij, >> rows=777600, cols=777600 >> > (0,1) : type=seqaij, rows=777600, cols=384000 >> > (1,0) : type=seqaij, rows=384000, cols=777600 >> > (1,1) : prefix="solver_fieldsplit_1_", type=seqaij, >> rows=384000, cols=384000 >> > >> > Any insight as to what's happening? Btw this firedrake/petsc-mapdes is >> from way back in october 2015 (yes much has changed since but >> reinstalling/updating firedrake and petsc on LANL's firewall HPC machines >> is a big pain in the ass). >> > >> > Thanks, >> > Justin >> >> -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener
