> On Jun 13, 2018, at 1:09 PM, Zhang, Junchao <jczh...@mcs.anl.gov> wrote:
> 
> Mark,
>   Yes, it is 7-point stencil. I tried your options, -pc_gamg_square_graph 0 
> -pc_gamg_threshold 0.0 -pc_gamg_repartition, and found they increased the 
> time. I did not try hypre since I don't know how to set its options.
>   I also tried periodic boundary condition and ran it with -mat_view 
> ::load_balance. It gives fewer KSP iterations and but PETSc still reports load
> imbalance at coarse levels.

   Is the overall scaling better, worse, or the same for periodic boundary 
conditions?

>   
> 
> --Junchao Zhang
> 
> On Tue, Jun 12, 2018 at 3:17 PM, Mark Adams <mfad...@lbl.gov> wrote:
> This all looks reasonable to me. The VecScatters times are a little high but 
> these are fast little solves (.2 seconds each).
> 
> The RAP times are very low, suggesting we could optimize parameters a bit and 
> reduce the iteration count. These are 7 point stencils as I recall. You could 
> try -pc_gamg_square_graph 0 (instead of 1) and you probably want 
> '-pc_gamg_threshold 0.0'.  You could also test hypre.
> 
> And you should be able to improve coarse grid load imbalance with 
> -pc_gamg_repartition.
> 
> Mark
> 
> On Tue, Jun 12, 2018 at 12:32 PM, Junchao Zhang <jczh...@mcs.anl.gov> wrote:
> Mark,
>   I tried "-pc_gamg_type agg ..." options you mentioned, and also -ksp_type 
> cg + PETSc's default PC bjacobi. In the latter case, to reduce execution time 
> I called KSPSolve 100 times instead of 1000, and also used -ksp_max_it 100. 
> In the 36x48=1728 ranks case, I also did a test with -log_sync. From there 
> you can see a lot of time is spent on VecNormBarrier, which implies load 
> imbalance. Note VecScatterBarrie time is misleading, since it barriers ALL 
> ranks, but in reality VecScatter sort of syncs in a small neighborhood.
>   Barry suggested trying periodic boundary condition so that the nonzeros are 
> perfectly balanced across processes. I will try that to see what happens.
> 
> --Junchao Zhang
> 
> On Mon, Jun 11, 2018 at 8:09 AM, Mark Adams <mfad...@lbl.gov> wrote:
> 
> 
> On Mon, Jun 11, 2018 at 12:46 AM, Junchao Zhang <jczh...@mcs.anl.gov> wrote:
> I used an LCRC machine named Bebop. I tested on its Intel Broadwell nodes. 
> Each nodes has 2 CPUs and 36 cores in total. I collected data using 36 cores 
> in a node or 18 cores in a node.  As you can see, 18 cores/node gave much 
> better performance, which is reasonable as routines like MatSOR, MatMult, 
> MatMultAdd are all bandwidth bound.
> 
> The code uses a DMDA 3D grid, 7-point stencil, and defines nodes(vertices) at 
> the surface or second to the surface as boundary nodes. Boundary nodes only 
> have a diagonal one in their row in the matrix. Interior nodes have 7 
> nonzeros in their row. Boundary processors in the processor grid has less 
> nonzero. This is one source of load-imbalance. Will load-imbalance get 
> severer at coarser grids of an MG level?
> 
> Yes. 
> 
> You can use a simple Jacobi solver to see the basic performance of your 
> operator and machine. Do you see as much time spent in Vec Scatters? VecAXPY? 
> etc.
>  
> 
> I attach a trace view figure that show activity of each ranks along the time 
> axis in one KSPSove. White color means MPI wait. You can see white takes a 
> large space. 
> 
> I don't have a good explanation why at large scale (1728 cores), processors 
> wait longer time, as the communication pattern is still 7-point stencil in a 
> cubic processor gird.
> 
> --Junchao Zhang
> 
> On Sat, Jun 9, 2018 at 11:32 AM, Smith, Barry F. <bsm...@mcs.anl.gov> wrote:
> 
>   Junchao,
> 
>       Thanks, the load balance of matrix entries is remarkably similar for 
> the two runs so it can't be a matter of worse work load imbalance for SOR for 
> the larger case explaining why the SOR takes more time. 
> 
>       Here is my guess (and I know no way to confirm it). In the smaller case 
> the overlap of different processes on the same node running SOR at the same 
> time is lower than the larger case hence the larger case is slower because 
> there are more SOR processes fighting over the same memory bandwidth at the 
> same time than in the smaller case.   Ahh, here is something you can try, 
> lets undersubscribe the memory bandwidth needs, run on say 16 processes per 
> node with 8 nodes and 16 processes per node with 64 nodes and send the two 
> -log_view output files. I assume this is an LCRC machine and NOT a KNL system?
> 
>    Thanks
> 
> 
>    Barry
> 
> 
> > On Jun 9, 2018, at 8:29 AM, Mark Adams <mfad...@lbl.gov> wrote:
> > 
> > -pc_gamg_type classical
> > 
> > FYI, we only support smoothed aggregation "agg" (the default). (This thread 
> > started by saying you were using GAMG.)
> > 
> > It is not clear how much this will make a difference for you, but you don't 
> > want to use classical because we do not support it. It is meant as a 
> > reference implementation for developers.
> > 
> > First, how did you get the idea to use classical? If the documentation lead 
> > you to believe this was a good thing to do then we need to fix that!
> > 
> > Anyway, here is a generic input for GAMG:
> > 
> > -pc_type gamg 
> > -pc_gamg_type agg 
> > -pc_gamg_agg_nsmooths 1 
> > -pc_gamg_coarse_eq_limit 1000 
> > -pc_gamg_reuse_interpolation true 
> > -pc_gamg_square_graph 1 
> > -pc_gamg_threshold 0.05 
> > -pc_gamg_threshold_scale .0 
> > 
> > 
> > 
> > 
> > On Thu, Jun 7, 2018 at 6:52 PM, Junchao Zhang <jczh...@mcs.anl.gov> wrote:
> > OK, I have thought that space was a typo. btw, this option does not show up 
> > in -h.
> > I changed number of ranks to use all cores on each node to avoid misleading 
> > ratio in -log_view. Since one node has 36 cores, I ran with 6^3=216 ranks, 
> > and 12^3=1728 ranks. I also found call counts of MatSOR etc in the two 
> > tests were different. So they are not strict weak scaling tests. I tried to 
> > add -ksp_max_it 6 -pc_mg_levels 6, but still could not make the two have 
> > the same MatSOR count. Anyway, I attached the load balance output.
> > 
> > I find PCApply_MG calls PCMGMCycle_Private, which is recursive and 
> > indirectly calls MatSOR_MPIAIJ. I believe the following code in 
> > MatSOR_MPIAIJ practically syncs {MatSOR, MatMultAdd}_SeqAIJ  between 
> > processors through VecScatter at each MG level. If SOR and MatMultAdd are 
> > imbalanced, the cost is accumulated along MG levels and shows up as large 
> > VecScatter cost.
> > 1460:     while
> >  (its--) {
> > 
> > 1461:       
> > VecScatterBegin(mat->Mvctx,xx,mat->lvec,INSERT_VALUES,SCATTER_FORWARD
> > );
> > 
> > 1462:       
> > VecScatterEnd(mat->Mvctx,xx,mat->lvec,INSERT_VALUES,SCATTER_FORWARD
> > );
> > 
> > 
> > 1464:       /* update rhs: bb1 = bb - B*x */
> > 1465:       VecScale
> > (mat->lvec,-1.0);
> > 
> > 1466:       (*mat->B->ops->multadd)(mat->
> > B,mat->lvec,bb,bb1);
> > 
> > 
> > 1468:       /* local sweep */
> > 1469:       (*mat->A->ops->sor)(mat->A,bb1,omega,SOR_SYMMETRIC_SWEEP,
> > fshift,lits,1,xx);
> > 
> > 1470:     }
> > 
> > 
> > 
> > --Junchao Zhang
> > 
> > On Thu, Jun 7, 2018 at 3:11 PM, Smith, Barry F. <bsm...@mcs.anl.gov> wrote:
> > 
> > 
> > > On Jun 7, 2018, at 12:27 PM, Zhang, Junchao <jczh...@mcs.anl.gov> wrote:
> > > 
> > > Searched but could not find this option, -mat_view::load_balance
> > 
> >    There is a space between the view and the :   load_balance is a 
> > particular viewer format that causes the printing of load balance 
> > information about number of nonzeros in the matrix.
> > 
> >    Barry
> > 
> > > 
> > > --Junchao Zhang
> > > 
> > > On Thu, Jun 7, 2018 at 10:46 AM, Smith, Barry F. <bsm...@mcs.anl.gov> 
> > > wrote:
> > >  So the only surprise in the results is the SOR. It is embarrassingly 
> > > parallel and normally one would not see a jump.
> > > 
> > >  The load balance for SOR time 1.5 is better at 1000 processes than for 
> > > 125 processes of 2.1  not worse so this number doesn't easily explain it.
> > > 
> > >  Could you run the 125 and 1000 with -mat_view ::load_balance and see 
> > > what you get out?
> > > 
> > >    Thanks
> > > 
> > >      Barry
> > > 
> > >  Notice that the MatSOR time jumps a lot about 5 secs when the -log_sync 
> > > is on. My only guess is that the MatSOR is sharing memory bandwidth (or 
> > > some other resource? cores?) with the VecScatter and for some reason this 
> > > is worse for 1000 cores but I don't know why.
> > > 
> > > > On Jun 6, 2018, at 9:13 PM, Junchao Zhang <jczh...@mcs.anl.gov> wrote:
> > > > 
> > > > Hi, PETSc developers,
> > > >  I tested Michael Becker's code. The code calls the same KSPSolve 1000 
> > > > times in the second stage and needs cubic number of processors to run. 
> > > > I ran with 125 ranks and 1000 ranks, with or without -log_sync option. 
> > > > I attach the log view output files and a scaling loss excel file.
> > > >  I profiled the code with 125 processors. It looks {MatSOR, MatMult, 
> > > > MatMultAdd, MatMultTranspose, MatMultTransposeAdd}_SeqAIJ in aij.c took 
> > > > ~50% of the time,  The other half time was spent on waiting in MPI.  
> > > > MatSOR_SeqAIJ took 30%, mostly in PetscSparseDenseMinusDot().
> > > >  I tested it on a 36 cores/node machine. I found 32 ranks/node gave 
> > > > better performance (about 10%) than 36 ranks/node in the 125 ranks 
> > > > testing.  I guess it is because processors in the former had more 
> > > > balanced memory bandwidth. I collected PAPI_DP_OPS (double precision 
> > > > operations) and PAPI_TOT_CYC (total cycles) of the 125 ranks case (see 
> > > > the attached files). It looks ranks at the two ends have less DP_OPS 
> > > > and TOT_CYC. 
> > > >  Does anyone familiar with the algorithm have quick explanations?
> > > > 
> > > > --Junchao Zhang
> > > > 
> > > > On Mon, Jun 4, 2018 at 11:59 AM, Michael Becker 
> > > > <michael.bec...@physik.uni-giessen.de> wrote:
> > > > Hello again,
> > > > 
> > > > this took me longer than I anticipated, but here we go.
> > > > I did reruns of the cases where only half the processes per node were 
> > > > used (without -log_sync):
> > > > 
> > > >                     125 procs,1st           125 procs,2nd          1000 
> > > > procs,1st          1000 procs,2nd
> > > >                   Max        Ratio        Max        Ratio        Max   
> > > >      Ratio        Max        Ratio
> > > > KSPSolve           1.203E+02    1.0        1.210E+02    1.0        
> > > > 1.399E+02    1.1        1.365E+02    1.0
> > > > VecTDot            6.376E+00    3.7        6.551E+00    4.0        
> > > > 7.885E+00    2.9        7.175E+00    3.4
> > > > VecNorm            4.579E+00    7.1        5.803E+00   10.2        
> > > > 8.534E+00    6.9        6.026E+00    4.9
> > > > VecScale           1.070E-01    2.1        1.129E-01    2.2        
> > > > 1.301E-01    2.5        1.270E-01    2.4
> > > > VecCopy            1.123E-01    1.3        1.149E-01    1.3        
> > > > 1.301E-01    1.6        1.359E-01    1.6
> > > > VecSet             7.063E-01    1.7        6.968E-01    1.7        
> > > > 7.432E-01    1.8        7.425E-01    1.8
> > > > VecAXPY            1.166E+00    1.4        1.167E+00    1.4        
> > > > 1.221E+00    1.5        1.279E+00    1.6
> > > > VecAYPX            1.317E+00    1.6        1.290E+00    1.6        
> > > > 1.536E+00    1.9        1.499E+00    2.0
> > > > VecScatterBegin    6.142E+00    3.2        5.974E+00    2.8        
> > > > 6.448E+00    3.0        6.472E+00    2.9
> > > > VecScatterEnd      3.606E+01    4.2        3.551E+01    4.0        
> > > > 5.244E+01    2.7        4.995E+01    2.7
> > > > MatMult            3.561E+01    1.6        3.403E+01    1.5        
> > > > 3.435E+01    1.4        3.332E+01    1.4
> > > > MatMultAdd         1.124E+01    2.0        1.130E+01    2.1        
> > > > 2.093E+01    2.9        1.995E+01    2.7
> > > > MatMultTranspose   1.372E+01    2.5        1.388E+01    2.6        
> > > > 1.477E+01    2.2        1.381E+01    2.1
> > > > MatSolve           1.949E-02    0.0        1.653E-02    0.0        
> > > > 4.789E-02    0.0        4.466E-02    0.0
> > > > MatSOR             6.610E+01    1.3        6.673E+01    1.3        
> > > > 7.111E+01    1.3        7.105E+01    1.3
> > > > MatResidual        2.647E+01    1.7        2.667E+01    1.7        
> > > > 2.446E+01    1.4        2.467E+01    1.5
> > > > PCSetUpOnBlocks    5.266E-03    1.4        5.295E-03    1.4        
> > > > 5.427E-03    1.5        5.289E-03    1.4
> > > > PCApply            1.031E+02    1.0        1.035E+02    1.0        
> > > > 1.180E+02    1.0        1.164E+02    1.0
> > > > 
> > > > I also slimmed down my code and basically wrote a simple weak scaling 
> > > > test (source files attached) so you can profile it yourself. I 
> > > > appreciate the offer Junchao, thank you.
> > > > You can adjust the system size per processor at runtime via 
> > > > "-nodes_per_proc 30" and the number of repeated calls to the function 
> > > > containing KSPsolve() via "-iterations 1000". The physical problem is 
> > > > simply calculating the electric potential from a homogeneous charge 
> > > > distribution, done multiple times to accumulate time in KSPsolve().
> > > > A job would be started using something like
> > > > mpirun -n 125 ~/petsc_ws/ws_test -nodes_per_proc 30 -mesh_size 1E-4 
> > > > -iterations 1000 \\
> > > > -ksp_rtol 1E-6 \
> > > > -log_view -log_sync\
> > > > -pc_type gamg -pc_gamg_type classical\
> > > > -ksp_type cg \
> > > > -ksp_norm_type unpreconditioned \
> > > > -mg_levels_ksp_type richardson \
> > > > -mg_levels_ksp_norm_type none \
> > > > -mg_levels_pc_type sor \
> > > > -mg_levels_ksp_max_it 1 \
> > > > -mg_levels_pc_sor_its 1 \
> > > > -mg_levels_esteig_ksp_type cg \
> > > > -mg_levels_esteig_ksp_max_it 10 \
> > > > -gamg_est_ksp_type cg
> > > > , ideally started on a cube number of processes for a cubical process 
> > > > grid.
> > > > Using 125 processes and 10.000 iterations I get the output in 
> > > > "log_view_125_new.txt", which shows the same imbalance for me.
> > > > Michael
> > > > 
> > > > 
> > > > Am 02.06.2018 um 13:40 schrieb Mark Adams:
> > > >> 
> > > >> 
> > > >> On Fri, Jun 1, 2018 at 11:20 PM, Junchao Zhang <jczh...@mcs.anl.gov> 
> > > >> wrote:
> > > >> Hi,Michael,
> > > >>  You can add -log_sync besides -log_view, which adds barriers to 
> > > >> certain events but measures barrier time separately from the events. I 
> > > >> find this option makes it easier to interpret log_view output.
> > > >> 
> > > >> That is great (good to know).
> > > >> 
> > > >> This should give us a better idea if your large VecScatter costs are 
> > > >> from slow communication or if it catching some sort of load imbalance.
> > > >> 
> > > >> 
> > > >> --Junchao Zhang
> > > >> 
> > > >> On Wed, May 30, 2018 at 3:27 AM, Michael Becker 
> > > >> <michael.bec...@physik.uni-giessen.de> wrote:
> > > >> Barry: On its way. Could take a couple days again.
> > > >> 
> > > >> Junchao: I unfortunately don't have access to a cluster with a faster 
> > > >> network. This one has a mixed 4X QDR-FDR InfiniBand 2:1 blocking 
> > > >> fat-tree network, which I realize causes parallel slowdown if the 
> > > >> nodes are not connected to the same switch. Each node has 24 
> > > >> processors (2x12/socket) and four NUMA domains (two for each socket).
> > > >> The ranks are usually not distributed perfectly even, i.e. for 125 
> > > >> processes, of the six required nodes, five would use 21 cores and one 
> > > >> 20.
> > > >> Would using another CPU type make a difference communication-wise? I 
> > > >> could switch to faster ones (on the same network), but I always 
> > > >> assumed this would only improve performance of the stuff that is 
> > > >> unrelated to communication.
> > > >> 
> > > >> Michael
> > > >> 
> > > >> 
> > > >> 
> > > >>> The log files have something like "Average time for zero size 
> > > >>> MPI_Send(): 1.84231e-05". It looks you ran on a cluster with a very 
> > > >>> slow network. A typical machine should give less than 1/10 of the 
> > > >>> latency you have. An easy way to try is just running the code on a 
> > > >>> machine with a faster network and see what happens.
> > > >>> 
> > > >>> Also, how many cores & numa domains does a compute node have? I could 
> > > >>> not figure out how you distributed the 125 MPI ranks evenly.
> > > >>> 
> > > >>> --Junchao Zhang
> > > >>> 
> > > >>> On Tue, May 29, 2018 at 6:18 AM, Michael Becker 
> > > >>> <michael.bec...@physik.uni-giessen.de> wrote:
> > > >>> Hello again,
> > > >>> 
> > > >>> here are the updated log_view files for 125 and 1000 processors. I 
> > > >>> ran both problems twice, the first time with all processors per node 
> > > >>> allocated ("-1.txt"), the second with only half on twice the number 
> > > >>> of nodes ("-2.txt").
> > > >>> 
> > > >>>>> On May 24, 2018, at 12:24 AM, Michael Becker 
> > > >>>>> <michael.bec...@physik.uni-giessen.de>
> > > >>>>> wrote:
> > > >>>>> 
> > > >>>>> I noticed that for every individual KSP iteration, six vector 
> > > >>>>> objects are created and destroyed (with CG, more with e.g. GMRES). 
> > > >>>>> 
> > > >>>>   Hmm, it is certainly not intended at vectors be created and 
> > > >>>> destroyed within each KSPSolve() could you please point us to the 
> > > >>>> code that makes you think they are being created and destroyed?   We 
> > > >>>> create all the work vectors at KSPSetUp() and destroy them in 
> > > >>>> KSPReset() not during the solve. Not that this would be a measurable 
> > > >>>> distance.
> > > >>>> 
> > > >>> 
> > > >>> I mean this, right in the log_view output:
> > > >>> 
> > > >>>> Memory usage is given in bytes:
> > > >>>> 
> > > >>>> Object Type Creations Destructions Memory Descendants' Mem.
> > > >>>> Reports information only for process 0.
> > > >>>> 
> > > >>>> --- Event Stage 0: Main Stage
> > > >>>> 
> > > >>>> ...
> > > >>>> 
> > > >>>> --- Event Stage 1: First Solve
> > > >>>> 
> > > >>>> ...
> > > >>>> 
> > > >>>> --- Event Stage 2: Remaining Solves
> > > >>>> 
> > > >>>> Vector 23904 23904 1295501184 0.
> > > >>> I logged the exact number of KSP iterations over the 999 timesteps 
> > > >>> and its exactly 23904/6 = 3984.
> > > >>> Michael
> > > >>> 
> > > >>> 
> > > >>> Am 24.05.2018 um 19:50 schrieb Smith, Barry F.:
> > > >>>> 
> > > >>>>  Please send the log file for 1000 with cg as the solver.
> > > >>>> 
> > > >>>>   You should make a bar chart of each event for the two cases to see 
> > > >>>> which ones are taking more time and which are taking less (we cannot 
> > > >>>> tell with the two logs you sent us since they are for different 
> > > >>>> solvers.)
> > > >>>> 
> > > >>>> 
> > > >>>> 
> > > >>>> 
> > > >>>>> On May 24, 2018, at 12:24 AM, Michael Becker 
> > > >>>>> <michael.bec...@physik.uni-giessen.de>
> > > >>>>> wrote:
> > > >>>>> 
> > > >>>>> I noticed that for every individual KSP iteration, six vector 
> > > >>>>> objects are created and destroyed (with CG, more with e.g. GMRES). 
> > > >>>>> 
> > > >>>>   Hmm, it is certainly not intended at vectors be created and 
> > > >>>> destroyed within each KSPSolve() could you please point us to the 
> > > >>>> code that makes you think they are being created and destroyed?   We 
> > > >>>> create all the work vectors at KSPSetUp() and destroy them in 
> > > >>>> KSPReset() not during the solve. Not that this would be a measurable 
> > > >>>> distance.
> > > >>>> 
> > > >>>> 
> > > >>>> 
> > > >>>> 
> > > >>>>> This seems kind of wasteful, is this supposed to be like this? Is 
> > > >>>>> this even the reason for my problems? Apart from that, everything 
> > > >>>>> seems quite normal to me (but I'm not the expert here).
> > > >>>>> 
> > > >>>>> 
> > > >>>>> Thanks in advance.
> > > >>>>> 
> > > >>>>> Michael
> > > >>>>> 
> > > >>>>> 
> > > >>>>> 
> > > >>>>> <log_view_125procs.txt><log_vi
> > > >>>>> ew_1000procs.txt>
> > > >>>>> 
> > > >>> 
> > > >>> 
> > > >> 
> > > >> 
> > > >> 
> > > > 
> > > > 
> > > > <o-wstest-125.txt><Scaling-loss.png><o-wstest-1000.txt><o-wstest-sync-125.txt><o-wstest-sync-1000.txt><MatSOR_SeqAIJ.png><PAPI_TOT_CYC.png><PAPI_DP_OPS.png>
> > > 
> > > 
> > 
> > 
> > 
> 
> 
> 
> 
> 
> 
> <o-wstest-periodic-36x6.txt><o-wstest-gamg-agg-repartition-36x6.txt>

Reply via email to