Re: [petsc-users] PETSc (3.9.0) GAMG weak scaling test issue

Alberto F. Martín Thu, 08 Nov 2018 03:42:01 -0800

Dear Mark,

thanks for your quick and comprehensive reply.

Before moving to the results of the experiments that u suggested, let meclarify two points

on my original e-mail and your answer:

(1) The raw timings and #iters. provided in my first e-mail were actually

obtained with "-pc_gamg_square_graph 1" (and not 0); sorry aboutthat, my mistake.(the logs, though, were consistent with the solver configurationprovided).The raw figures with "-pc_gamg_square_graph 0" are actually asfollows:

(load3): [0.25074561, 0.3650926566, 0.6251466936, 0.8709517661,15.52180776](load3): [0.148803731, 0.325266364, 0.5538515123, 0.7537377281,1.475100923]

      (load3): [8, 9, 11, 12, 12]

Bottom line: significant improvement of absolute times for thefirst 4x problems, marginal improvement forthe largest problem (compared to"-pc_gamg_square_graph 1")

(2) <</The PC setup times are large (I see 48 seconds at 16K but youreport 16). //

//          -pc_gamg_square_graph 10 should help that./>>

This disagreement is justified by the following note on myoriginal e-mail:

<</Please note that within each run, I execute these twostages up-to//// three times, and this influences absolute timings givenin -log_view./>>

I tried new configurations based on your suggestions. Find attached theresults.

(legends indicate changes with respect to the solver configuration provided
in my first e-mail).

Bottom lines: (1) the configuration provided in my original e-mail leadsto fastest executionand less number of iteration for the first 4x problems. (2) *The (new)parameter-value combinations****suggested seem to have almost no impact into the preconditioner set uptime of the last problem.**

*I also tried HYPRE-BoomerAMG as suggested, with two differentconfigurations.


*** SYMMETRIC CONFIGURATION ***
-ksp_type cg
-ksp_monitor
-ksp_rtol 1.0e-6
-ksp_converged_reason
-ksp_max_it 500
-ksp_norm_type unpreconditioned
-ksp_view
-log_view

-pc_type hypre
-pc_hypre_type boomeramg
-pc_hypre_boomeramg_print_statistics 1
-pc_hypre_boomeramg_strong_threshold 0.25
-pc_hypre_boomeramg_coarsen_type HMIS
-pc_hypre_boomeramg_relax_type_down symmetric-SOR/Jacobi
-pc_hypre_boomeramg_relax_type_up symmetric-SOR/Jacobi
-pc_hypre_boomeramg_relax_type_coarse Gaussian-elimination

*** UNSYMMETRIC CONFIGURATION ***
-ksp_type gmres
-ksp_gmres_restart 500
-ksp_monitor
-ksp_rtol 1.0e-6
-ksp_converged_reason
-ksp_max_it 500
-ksp_pc_side right
-ksp_norm_type unpreconditioned

-pc_type hypre
-pc_hypre_type boomeramg
-pc_hypre_boomeramg_print_statistics 1
-pc_hypre_boomeramg_strong_threshold 0.25
-pc_hypre_boomeramg_coarsen_type HMIS
-pc_hypre_boomeramg_relax_type_down SOR/Jacobi
-pc_hypre_boomeramg_relax_type_up SOR/Jacobi
-pc_hypre_boomeramg_relax_type_coarse Gaussian-elimination

The raw results were:

*** SYMMETRIC CONFIGURATION ***

(load3): [0.1828534687, 0.3055133289, 0.3582984209, 0.4280304033,1.343549139]

(load3):  [0.2102472978, 0.4572948301, 0.7153297188, 0.9989531627, N/A]
(load3):  [19, 23, 26, 28, 'DIVERGED_INDEFINITE_PC']

*** UNSYMMETRIC CONFIGURATION ***

(load3): [0.1841227429, 0.3082743008, 0.3652294828, 0.4654760892,1.331299786]

(load3): [0.1194557019, 0.2830136018, 0.5046830242, 1.363314636, N/A]
(load3): [15, 19, 24, 48, DIVERGED_ITS]

Thus, the largest problem also seems to cause (even more severe) issuesto HYPRE, in particular,INDEFINITE PRECONDITIONER with CG, and not convergence within 500iterations for GMRES.The preconditioner set up stage time, though, scales reasonably wellwith the same data distributionthat we used to feed GAMG (although the preconditioner computed for thelargest problem seems to be

totally useless).

I have logs for all these results if required.

Thanks for your help!
Best regards,
 Alberto.



On 07/11/18 19:46, Mark Adams wrote:

First I would add -gamg_est_ksp_type cg

You seem to be converging well so I assume you are setting the nullspace for GAMG.


Note, you should test hypre also.

You probably want a bigger "-pc_gamg_process_eq_limit 50". 200 atleast but you test your machine with a range on the largest problem.This is a parameter for reducing the number of active processors (oncoarse grids).

I would only worry about "load3". This has 16K equations per process,which is where you start noticing "strong scaling" problems, dependingon the machine.

An important parameter is "-pc_gamg_square_graph 0". I would probablystart with infinity (eg, 10).

Now, I'm not sure about your domain, problem sizes, and thus the weakscaling design. You seem to be scaling on the background mesh, butthat may not be a good proxy for complexity.

You can look at the number of flops and scale it appropriately by thenumber of solver iterations to get a relative size of the problem. Iwould recommend scaling the number of processors with this. Forinstance here the MatMult line for the 4 proc and 16K proc run:


------------------------------------------------------------------------------------------------------------------------

Event Count Time (sec) Flop --- Global --- --- Stage ---TotalMax Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F%M %L %R %T %F %M %L %R Mflop/s

------------------------------------------------------------------------------------------------------------------------

MatMult 636 1.0 1.9035e-01 1.0 3.12e+08 1.1 7.6e+03 3.0e+03 0.0e+00 047 62 44 0 0 47 62 44 0 6275 [2 procs]MatMult 1416 1.0 1.9601e+002744.6 4.82e+08 0.0 4.3e+08 7.2e+020.0e+00 0 48 50 48 0 0 48 50 48 0 2757975 [16K procs]

Now, you have empty processors. See the massive load imbalance on timeand the zero on Flops. The "Ratio" is max/min and cleary min=0 soPETSc reports a ratio of 0 (it is infinity really).

Also, weak scaling on a thin body (I don't know your domain) is alittle funny because as the problem scales up the mesh becomes more 3Dand this causes the cost per equation to go up. That is why I preferto use the number of non-zeros as the processor scaling function butnumber of equations is easier ...

The PC setup times are large (I see 48 seconds at 16K bu you report16). -pc_gamg_square_graph 10 should help that.

The max number of flops per processor in MatMult goes up by 50% andthe max time goes up by 10x and the number of iterations goes up by13/8. If I put all of this together I get that 75% of the time at 16Kis in communication at 16K. I think that and the absolute time can beimproved some by optimizing parameters as I've suggested.


Mark

On Wed, Nov 7, 2018 at 11:03 AM "Alberto F. Martín" via petsc-users<petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov>> wrote:


    Dear All,

    we are performing a weak scaling test of the PETSc (v3.9.0) GAMG
    preconditioner when applied to the linear system arising
    from the *conforming unfitted FE discretization *(using Q1
    Lagrangian FEs) of a 3D PDE Poisson problem, where
    the boundary of the domain (a popcorn flake)  is described as a
    zero-level-set embedded within a uniform background
    (Cartesian-like) hexahedral mesh. Details underlying the FEM
    formulation can be made available on demand if you
    believe that this might be helpful, but let me just point out that
    it is designed such that it addresses the well-known
    ill-conditioning issues of unfitted FE discretizations due to the
    small cut cell problem.

    The weak scaling test is set up as follows. We start from a single
    cube background mesh, and refine it uniformly several
    steps, until we have approximately either 10**3 (load1), 20**3
    (load2), or 40**3 (load3) hexahedra/MPI task when
    distributing it over 4 MPI tasks. The benchmark is scaled such
    that the next larger scale problem to be tested is obtained
    by uniformly refining the mesh from the previous scale and running
    it on 8x times the number of MPI tasks that we used
    in the previous scale.  As a result, we obtain three weak scaling
    curves for each of the three fixed loads per MPI task
    above, on the following total number of MPI tasks: 4, 32, 262,
    2097, 16777. The underlying mesh is not partitioned among
    MPI tasks using ParMETIS (unstructured multilevel graph
    partitioning)  nor optimally by hand, but following the so-called
    z-shape space-filling curves provided by an underlying octree-like
    mesh handler (i.e., p4est library).

    I configured the preconditioned linear solver as follows:

    -ksp_type cg
    -ksp_monitor
    -ksp_rtol 1.0e-6
    -ksp_converged_reason
    -ksp_max_it 500
    -ksp_norm_type unpreconditioned
    -ksp_view
    -log_view

    -pc_type gamg
    -pc_gamg_type agg
    -mg_levels_esteig_ksp_type cg
    -mg_coarse_sub_pc_type cholesky
    -mg_coarse_sub_pc_factor_mat_ordering_type nd
    -pc_gamg_process_eq_limit 50
    -pc_gamg_square_graph 0
    -pc_gamg_agg_nsmooths 1

    Raw timings (in seconds) of the preconditioner set up and PCG
    iterative solution stage, and number of iterations are as follows:

    **preconditioner set up**
    (load1): [0.02542160451, 0.05169247743, 0.09266782179,
    0.2426272957, 13.64161944]
    (load2): [0.1239175797  , 0.1885528499  , 0.2719282564  ,
    0.4783878336, 13.37947339]
    (load3): [0.6565349903  , 0.9435049873  , 1.299908397    ,
    1.916243652  , 16.02904088]

    **PCG stage**
    (load1): [0.003287350759, 0.008163803257, 0.03565631993,
    0.08343045413, 0.6937994603]
    (load2): [0.0205939794    , 0.03594723623  , 0.07593298424,
    0.1212046621  , 0.6780373845]
    (load3): [0.1310882876    , 0.3214917686    , 0.5532023879 ,
    0.766881627    , 1.485446003]

    **number of PCG iterations**
    (load1): [5, 8, 11, 13, 13]
    (load2): [7, 10, 12, 13, 13]
    (load3): [8, 10, 12, 13, 13]

    It can be observed that both the number of linear solver
    iterations and the PCG stage timings (weakly)
    scale remarkably, but t*here is a significant time increase when
    scaling the problem from 2097 to 16777 MPI tasks **
    **for the preconditioner setup stage* (e.g., 1.916243652 vs
    16.02904088 sec. with 40**3 cells per MPI task).
    I gathered the combined output of -ksp_view and -log_view (only)
    for all the points involving the load3 weak scaling
    test (find them attached to this message). Please note that within
    each run, I execute the these two stages up-to
    three times, and this influences absolute timings given in -log_view.

    Looking at the output of -log_view, it is very strange to me,
    e.g., that the stage labelled as "Graph"
    does not scale properly as it is just a call to MatDuplicate if
    the block size of the matrix is 1 (our case), and
    I guess that it is just a local operation that does not require
    any communication.
    What I am missing here? The load does not seem to be unbalanced
    looking at the "Ratio" column.

    I wonder whether the observed behaviour is as expected, or this a
    miss-configuration of the solver from our side.
    I played (quite a lot) with several parameter-value combinations,
    and the configuration above is the one that led to fastest
    execution  (from the ones tested, that might be incomplete, I can
    also provide further feedback if helpful).
    Any feedback that we can get from your experience in order to find
    the cause(s) of this issue and a mitigating solution
    will be of high added value.

    Thanks very much in advance!
    Best regards,
     Alberto.

--Alberto F. Martín-Huertas

    Senior Researcher, PhD. Computational Science
    Centre Internacional de Mètodes Numèrics a l'Enginyeria (CIMNE)
    Parc Mediterrani de la Tecnologia, UPC
    Esteve Terradas 5, Building C3, Office 215,
    08860 Castelldefels (Barcelona, Spain)
    Tel.: (+34) 9341 34223
    e-mail:amar...@cimne.upc.edu  <mailto:e-mail:amar...@cimne.upc.edu>

    FEMPAR project co-founder

web:http://www.fempar.org

    ________________
    IMPORTANT NOTICE
    All personal data contained on this mail will be processed confidentially 
and registered in a file property of CIMNE in
    order to manage corporate communications. You may exercise the rights of 
access, rectification, erasure and object by
    letter sent to Ed. C1 Campus Norte UPC. Gran Capitán s/n Barcelona.


--
Alberto F. Martín-Huertas
Senior Researcher, PhD. Computational Science
Centre Internacional de Mètodes Numèrics a l'Enginyeria (CIMNE)
Parc Mediterrani de la Tecnologia, UPC
Esteve Terradas 5, Building C3, Office 215,
08860 Castelldefels (Barcelona, Spain)
Tel.: (+34) 9341 34223
e-mail:amar...@cimne.upc.edu

FEMPAR project co-founder
web: http://www.fempar.org

________________
IMPORTANT NOTICE
All personal data contained on this mail will be processed confidentially and 
registered in a file property of CIMNE in
order to manage corporate communications. You may exercise the rights of 
access, rectification, erasure and object by
letter sent to Ed. C1 Campus Norte UPC. Gran Capitán s/n Barcelona.

(A) ** Added NearNullSpace to matrix (i.e. the constant vector) **

(load3): [0.2512322953, 0.3657070249, 0.6209384622, 0.8898622398, 16.37409958]
(load3): [0.1474562958, 0.3245896269, 0.551462595  , 0.7768286369, 1.563904478]
(load3): [8, 9, 11, 12, 12]                                      

(B) ** (A) +  -gamg_est_ksp_type cg**

(load3): [0.2532081502, 0.3669248847, 0.6215682998, 0.9122101571, 15.82921874]
(load3): [0.1476225629, 0.3242742592, 0.5494060389, 0.793106758, 1.541510889]
(load3): [8, 9, 11, 12, 12]

(C) ** (B) + -pc_gamg_square_graph 10**

(load3): [0.7063658834, 1.045530763, 1.403756126, 1.903321964, 16.91176975]
(load3): [0.1308690757, 0.3190896986, 0.5635806862, 0.790503782, 1.528392129]
(load3): [8, 10, 12, 14, 15]

(D) ** (C) + -pc_gamg_process_eq_limit 200**

(load3): [0.7066891911, 1.041900044, 1.438325046, 2.154289208, 15.54656001]
(load3): [0.1325668963, 0.3205731977, 0.5486685866, 0.8334027417, 1.485407834]
(load3): [8, 10, 12, 14, 15]

(E) ** (C) + -pc_gamg_process_eq_limit 500**

(load3): [0.7349723065, 1.084142983, 1.562717193, 2.198781526, 16.83547859]
(load3): [0.1336050248, 0.3177526584, 0.5764533961, 0.8126104074, 1.661861523]
(load3): [8, 10, 12, 14, 15]

(F) ** (C) + -pc_gamg_process_eq_limit 1000**
(3, 'a0b0c0d0e0f0g0h0i0'): [0.739308523, 1.117045472, 1.54470065, 2.845281176, 
16.66935678]
(3, 'a0b0c0d0e0f0g0h0i0'): [0.1373377964, 0.3255409142, 0.5619245535, 
0.8124665194, 1.660140919]
(3, 'a0b0c0d0e0f0g0h0i0'): [8, 10, 12, 13, 15]

Re: [petsc-users] PETSc (3.9.0) GAMG weak scaling test issue

Reply via email to