-pc_gamg_type classical
FYI, we only support smoothed aggregation "agg" (the default). (This thread started by saying you were using GAMG.) It is not clear how much this will make a difference for you, but you don't want to use classical because we do not support it. It is meant as a reference implementation for developers. First, how did you get the idea to use classical? If the documentation lead you to believe this was a good thing to do then we need to fix that! Anyway, here is a generic input for GAMG: -pc_type gamg -pc_gamg_type agg -pc_gamg_agg_nsmooths 1 -pc_gamg_coarse_eq_limit 1000 -pc_gamg_reuse_interpolation true -pc_gamg_square_graph 1 -pc_gamg_threshold 0.05 -pc_gamg_threshold_scale .0 On Thu, Jun 7, 2018 at 6:52 PM, Junchao Zhang <jczh...@mcs.anl.gov> wrote: > OK, I have thought that space was a typo. btw, this option does not show > up in -h. > I changed number of ranks to use all cores on each node to avoid > misleading ratio in -log_view. Since one node has 36 cores, I ran with > 6^3=216 ranks, and 12^3=1728 ranks. I also found call counts of MatSOR etc > in the two tests were different. So they are not strict weak scaling tests. > I tried to add -ksp_max_it 6 -pc_mg_levels 6, but still could not make the > two have the same MatSOR count. Anyway, I attached the load balance output. > > I find PCApply_MG calls PCMGMCycle_Private, which is recursive and > indirectly calls MatSOR_MPIAIJ. I believe the following code in > MatSOR_MPIAIJ practically syncs {MatSOR, MatMultAdd}_SeqAIJ between > processors through VecScatter at each MG level. If SOR and MatMultAdd are > imbalanced, the cost is accumulated along MG levels and shows up as large > VecScatter cost. > > 1460: while (its--) {1461: VecScatterBegin > <http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Vec/VecScatterBegin.html#VecScatterBegin>(mat->Mvctx,xx,mat->lvec,INSERT_VALUES > > <http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/INSERT_VALUES.html#INSERT_VALUES>,SCATTER_FORWARD > > <http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Vec/SCATTER_FORWARD.html#SCATTER_FORWARD>);1462: > VecScatterEnd > <http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Vec/VecScatterEnd.html#VecScatterEnd>(mat->Mvctx,xx,mat->lvec,INSERT_VALUES > > <http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/INSERT_VALUES.html#INSERT_VALUES>,SCATTER_FORWARD > > <http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Vec/SCATTER_FORWARD.html#SCATTER_FORWARD>); > 1464: /* update rhs: bb1 = bb - B*x */1465: VecScale > <http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Vec/VecScale.html#VecScale>(mat->lvec,-1.0);1466: > (*mat->B->ops->multadd)(mat->B,mat->lvec,bb,bb1); > 1468: /* local sweep */1469: > (*mat->A->ops->sor)(mat->A,bb1,omega,SOR_SYMMETRIC_SWEEP > <http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MatSORType.html#MatSORType>,fshift,lits,1,xx);1470: > } > > > > > --Junchao Zhang > > On Thu, Jun 7, 2018 at 3:11 PM, Smith, Barry F. <bsm...@mcs.anl.gov> > wrote: > >> >> >> > On Jun 7, 2018, at 12:27 PM, Zhang, Junchao <jczh...@mcs.anl.gov> >> wrote: >> > >> > Searched but could not find this option, -mat_view::load_balance >> >> There is a space between the view and the : load_balance is a >> particular viewer format that causes the printing of load balance >> information about number of nonzeros in the matrix. >> >> Barry >> >> > >> > --Junchao Zhang >> > >> > On Thu, Jun 7, 2018 at 10:46 AM, Smith, Barry F. <bsm...@mcs.anl.gov> >> wrote: >> > So the only surprise in the results is the SOR. It is embarrassingly >> parallel and normally one would not see a jump. >> > >> > The load balance for SOR time 1.5 is better at 1000 processes than for >> 125 processes of 2.1 not worse so this number doesn't easily explain it. >> > >> > Could you run the 125 and 1000 with -mat_view ::load_balance and see >> what you get out? >> > >> > Thanks >> > >> > Barry >> > >> > Notice that the MatSOR time jumps a lot about 5 secs when the >> -log_sync is on. My only guess is that the MatSOR is sharing memory >> bandwidth (or some other resource? cores?) with the VecScatter and for some >> reason this is worse for 1000 cores but I don't know why. >> > >> > > On Jun 6, 2018, at 9:13 PM, Junchao Zhang <jczh...@mcs.anl.gov> >> wrote: >> > > >> > > Hi, PETSc developers, >> > > I tested Michael Becker's code. The code calls the same KSPSolve >> 1000 times in the second stage and needs cubic number of processors to run. >> I ran with 125 ranks and 1000 ranks, with or without -log_sync option. I >> attach the log view output files and a scaling loss excel file. >> > > I profiled the code with 125 processors. It looks {MatSOR, MatMult, >> MatMultAdd, MatMultTranspose, MatMultTransposeAdd}_SeqAIJ in aij.c took >> ~50% of the time, The other half time was spent on waiting in MPI. >> MatSOR_SeqAIJ took 30%, mostly in PetscSparseDenseMinusDot(). >> > > I tested it on a 36 cores/node machine. I found 32 ranks/node gave >> better performance (about 10%) than 36 ranks/node in the 125 ranks >> testing. I guess it is because processors in the former had more balanced >> memory bandwidth. I collected PAPI_DP_OPS (double precision operations) and >> PAPI_TOT_CYC (total cycles) of the 125 ranks case (see the attached files). >> It looks ranks at the two ends have less DP_OPS and TOT_CYC. >> > > Does anyone familiar with the algorithm have quick explanations? >> > > >> > > --Junchao Zhang >> > > >> > > On Mon, Jun 4, 2018 at 11:59 AM, Michael Becker < >> michael.bec...@physik.uni-giessen.de> wrote: >> > > Hello again, >> > > >> > > this took me longer than I anticipated, but here we go. >> > > I did reruns of the cases where only half the processes per node were >> used (without -log_sync): >> > > >> > > 125 procs,1st 125 procs,2nd >> 1000 procs,1st 1000 procs,2nd >> > > Max Ratio Max Ratio >> Max Ratio Max Ratio >> > > KSPSolve 1.203E+02 1.0 1.210E+02 1.0 >> 1.399E+02 1.1 1.365E+02 1.0 >> > > VecTDot 6.376E+00 3.7 6.551E+00 4.0 >> 7.885E+00 2.9 7.175E+00 3.4 >> > > VecNorm 4.579E+00 7.1 5.803E+00 10.2 >> 8.534E+00 6.9 6.026E+00 4.9 >> > > VecScale 1.070E-01 2.1 1.129E-01 2.2 >> 1.301E-01 2.5 1.270E-01 2.4 >> > > VecCopy 1.123E-01 1.3 1.149E-01 1.3 >> 1.301E-01 1.6 1.359E-01 1.6 >> > > VecSet 7.063E-01 1.7 6.968E-01 1.7 >> 7.432E-01 1.8 7.425E-01 1.8 >> > > VecAXPY 1.166E+00 1.4 1.167E+00 1.4 >> 1.221E+00 1.5 1.279E+00 1.6 >> > > VecAYPX 1.317E+00 1.6 1.290E+00 1.6 >> 1.536E+00 1.9 1.499E+00 2.0 >> > > VecScatterBegin 6.142E+00 3.2 5.974E+00 2.8 >> 6.448E+00 3.0 6.472E+00 2.9 >> > > VecScatterEnd 3.606E+01 4.2 3.551E+01 4.0 >> 5.244E+01 2.7 4.995E+01 2.7 >> > > MatMult 3.561E+01 1.6 3.403E+01 1.5 >> 3.435E+01 1.4 3.332E+01 1.4 >> > > MatMultAdd 1.124E+01 2.0 1.130E+01 2.1 >> 2.093E+01 2.9 1.995E+01 2.7 >> > > MatMultTranspose 1.372E+01 2.5 1.388E+01 2.6 >> 1.477E+01 2.2 1.381E+01 2.1 >> > > MatSolve 1.949E-02 0.0 1.653E-02 0.0 >> 4.789E-02 0.0 4.466E-02 0.0 >> > > MatSOR 6.610E+01 1.3 6.673E+01 1.3 >> 7.111E+01 1.3 7.105E+01 1.3 >> > > MatResidual 2.647E+01 1.7 2.667E+01 1.7 >> 2.446E+01 1.4 2.467E+01 1.5 >> > > PCSetUpOnBlocks 5.266E-03 1.4 5.295E-03 1.4 >> 5.427E-03 1.5 5.289E-03 1.4 >> > > PCApply 1.031E+02 1.0 1.035E+02 1.0 >> 1.180E+02 1.0 1.164E+02 1.0 >> > > >> > > I also slimmed down my code and basically wrote a simple weak scaling >> test (source files attached) so you can profile it yourself. I appreciate >> the offer Junchao, thank you. >> > > You can adjust the system size per processor at runtime via >> "-nodes_per_proc 30" and the number of repeated calls to the function >> containing KSPsolve() via "-iterations 1000". The physical problem is >> simply calculating the electric potential from a homogeneous charge >> distribution, done multiple times to accumulate time in KSPsolve(). >> > > A job would be started using something like >> > > mpirun -n 125 ~/petsc_ws/ws_test -nodes_per_proc 30 -mesh_size 1E-4 >> -iterations 1000 \\ >> > > -ksp_rtol 1E-6 \ >> > > -log_view -log_sync\ >> > > -pc_type gamg -pc_gamg_type classical\ >> > > -ksp_type cg \ >> > > -ksp_norm_type unpreconditioned \ >> > > -mg_levels_ksp_type richardson \ >> > > -mg_levels_ksp_norm_type none \ >> > > -mg_levels_pc_type sor \ >> > > -mg_levels_ksp_max_it 1 \ >> > > -mg_levels_pc_sor_its 1 \ >> > > -mg_levels_esteig_ksp_type cg \ >> > > -mg_levels_esteig_ksp_max_it 10 \ >> > > -gamg_est_ksp_type cg >> > > , ideally started on a cube number of processes for a cubical process >> grid. >> > > Using 125 processes and 10.000 iterations I get the output in >> "log_view_125_new.txt", which shows the same imbalance for me. >> > > Michael >> > > >> > > >> > > Am 02.06.2018 um 13:40 schrieb Mark Adams: >> > >> >> > >> >> > >> On Fri, Jun 1, 2018 at 11:20 PM, Junchao Zhang <jczh...@mcs.anl.gov> >> wrote: >> > >> Hi,Michael, >> > >> You can add -log_sync besides -log_view, which adds barriers to >> certain events but measures barrier time separately from the events. I find >> this option makes it easier to interpret log_view output. >> > >> >> > >> That is great (good to know). >> > >> >> > >> This should give us a better idea if your large VecScatter costs are >> from slow communication or if it catching some sort of load imbalance. >> > >> >> > >> >> > >> --Junchao Zhang >> > >> >> > >> On Wed, May 30, 2018 at 3:27 AM, Michael Becker < >> michael.bec...@physik.uni-giessen.de> wrote: >> > >> Barry: On its way. Could take a couple days again. >> > >> >> > >> Junchao: I unfortunately don't have access to a cluster with a >> faster network. This one has a mixed 4X QDR-FDR InfiniBand 2:1 blocking >> fat-tree network, which I realize causes parallel slowdown if the nodes are >> not connected to the same switch. Each node has 24 processors (2x12/socket) >> and four NUMA domains (two for each socket). >> > >> The ranks are usually not distributed perfectly even, i.e. for 125 >> processes, of the six required nodes, five would use 21 cores and one 20. >> > >> Would using another CPU type make a difference communication-wise? I >> could switch to faster ones (on the same network), but I always assumed >> this would only improve performance of the stuff that is unrelated to >> communication. >> > >> >> > >> Michael >> > >> >> > >> >> > >> >> > >>> The log files have something like "Average time for zero size >> MPI_Send(): 1.84231e-05". It looks you ran on a cluster with a very slow >> network. A typical machine should give less than 1/10 of the latency you >> have. An easy way to try is just running the code on a machine with a >> faster network and see what happens. >> > >>> >> > >>> Also, how many cores & numa domains does a compute node have? I >> could not figure out how you distributed the 125 MPI ranks evenly. >> > >>> >> > >>> --Junchao Zhang >> > >>> >> > >>> On Tue, May 29, 2018 at 6:18 AM, Michael Becker < >> michael.bec...@physik.uni-giessen.de> wrote: >> > >>> Hello again, >> > >>> >> > >>> here are the updated log_view files for 125 and 1000 processors. I >> ran both problems twice, the first time with all processors per node >> allocated ("-1.txt"), the second with only half on twice the number of >> nodes ("-2.txt"). >> > >>> >> > >>>>> On May 24, 2018, at 12:24 AM, Michael Becker < >> michael.bec...@physik.uni-giessen.de> >> > >>>>> wrote: >> > >>>>> >> > >>>>> I noticed that for every individual KSP iteration, six vector >> objects are created and destroyed (with CG, more with e.g. GMRES). >> > >>>>> >> > >>>> Hmm, it is certainly not intended at vectors be created and >> destroyed within each KSPSolve() could you please point us to the code that >> makes you think they are being created and destroyed? We create all the >> work vectors at KSPSetUp() and destroy them in KSPReset() not during the >> solve. Not that this would be a measurable distance. >> > >>>> >> > >>> >> > >>> I mean this, right in the log_view output: >> > >>> >> > >>>> Memory usage is given in bytes: >> > >>>> >> > >>>> Object Type Creations Destructions Memory Descendants' Mem. >> > >>>> Reports information only for process 0. >> > >>>> >> > >>>> --- Event Stage 0: Main Stage >> > >>>> >> > >>>> ... >> > >>>> >> > >>>> --- Event Stage 1: First Solve >> > >>>> >> > >>>> ... >> > >>>> >> > >>>> --- Event Stage 2: Remaining Solves >> > >>>> >> > >>>> Vector 23904 23904 1295501184 0. >> > >>> I logged the exact number of KSP iterations over the 999 timesteps >> and its exactly 23904/6 = 3984. >> > >>> Michael >> > >>> >> > >>> >> > >>> Am 24.05.2018 um 19:50 schrieb Smith, Barry F.: >> > >>>> >> > >>>> Please send the log file for 1000 with cg as the solver. >> > >>>> >> > >>>> You should make a bar chart of each event for the two cases to >> see which ones are taking more time and which are taking less (we cannot >> tell with the two logs you sent us since they are for different solvers.) >> > >>>> >> > >>>> >> > >>>> >> > >>>> >> > >>>>> On May 24, 2018, at 12:24 AM, Michael Becker < >> michael.bec...@physik.uni-giessen.de> >> > >>>>> wrote: >> > >>>>> >> > >>>>> I noticed that for every individual KSP iteration, six vector >> objects are created and destroyed (with CG, more with e.g. GMRES). >> > >>>>> >> > >>>> Hmm, it is certainly not intended at vectors be created and >> destroyed within each KSPSolve() could you please point us to the code that >> makes you think they are being created and destroyed? We create all the >> work vectors at KSPSetUp() and destroy them in KSPReset() not during the >> solve. Not that this would be a measurable distance. >> > >>>> >> > >>>> >> > >>>> >> > >>>> >> > >>>>> This seems kind of wasteful, is this supposed to be like this? Is >> this even the reason for my problems? Apart from that, everything seems >> quite normal to me (but I'm not the expert here). >> > >>>>> >> > >>>>> >> > >>>>> Thanks in advance. >> > >>>>> >> > >>>>> Michael >> > >>>>> >> > >>>>> >> > >>>>> >> > >>>>> <log_view_125procs.txt><log_vi >> > >>>>> ew_1000procs.txt> >> > >>>>> >> > >>> >> > >>> >> > >> >> > >> >> > >> >> > > >> > > >> > > <o-wstest-125.txt><Scaling-loss.png><o-wstest-1000.txt><o- >> wstest-sync-125.txt><o-wstest-sync-1000.txt><MatSOR_SeqAIJ. >> png><PAPI_TOT_CYC.png><PAPI_DP_OPS.png> >> > >> > >> >> >