Re: [petsc-dev] [petsc-users] Poor weak scaling when solving successive linearsystems

2018-06-26 Thread Kong, Fande
First of all, "-pc_type hypre" does not take "-mg_levels_*". The parameters
you set may not affect anything.

And Hypre with the default parameters set by PETSc does not scale well as
far as we know. We find the following parameters are important:

-pc_hypre_boomeramg_strong_threshold

-pc_hypre_boomeramg_max_levels

-pc_hypre_boomeramg_coarsen_type

-pc_hypre_boomeramg_agg_nl

-pc_hypre_boomeramg_agg_num_paths

You could go into the hypre user manual for more details on these
parameters.

Fande,

On Tue, Jun 26, 2018 at 2:25 PM, Junchao Zhang  wrote:

> Mark,
>   I re-do the -pc_type hypre experiment without openmp.  Now the job
> finishes instead of running out of time. I have results with 216 processors
> (see below). The 1728-processor job is still in the queue so I don't know
> how it scales. But for the 216-processor one, the execution time is 245
> seconds. With -pc_type gamg, the time is 107 seconds.  My options are
>
> -ksp_norm_type unpreconditioned
> -ksp_rtol 1E-6
> -ksp_type cg
> -log_view
> -mesh_size 1E-4
> -mg_levels_esteig_ksp_max_it 10
> -mg_levels_esteig_ksp_type cg
> -mg_levels_ksp_max_it 1
> -mg_levels_ksp_norm_type none
> -mg_levels_ksp_type richardson
> -mg_levels_pc_sor_its 1
> -mg_levels_pc_type sor
> -nodes_per_proc 30
> -pc_type hypre
>
>
> It is a 7-point stencil code. Do you know other hypre options that I can
> try to improve it?  Thanks.
>
> --- Event Stage 2: Remaining Solves
>
> KSPSolve1000 1.0 2.4574e+02 1.0 4.48e+09 1.0 7.6e+06 7.2e+03
> 2.0e+04 97100100100100 100100100100100  3928
> VecTDot12000 1.0 6.5646e+00 2.2 6.48e+08 1.0 0.0e+00 0.0e+00
> 1.2e+04  2 14  0  0 60   2 14  0  0 60 21321
> VecNorm 8000 1.0 9.7144e-01 1.2 4.32e+08 1.0 0.0e+00 0.0e+00
> 8.0e+03  0 10  0  0 40   0 10  0  0 40 96055
> VecCopy 1000 1.0 7.9706e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
> VecSet  6000 1.0 1.7941e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
> VecAXPY12000 1.0 7.5738e-01 1.2 6.48e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  0 14  0  0  0   0 14  0  0  0 184806
> VecAYPX 6000 1.0 4.6802e-01 1.3 2.97e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  7  0  0  0   0  7  0  0  0 137071
> VecScatterBegin 7000 1.0 4.7924e-01 2.3 0.00e+00 0.0 7.6e+06 7.2e+03
> 0.0e+00  0  0100100  0   0  0100100  0 0
> VecScatterEnd   7000 1.0 7.9303e-01 2.8 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
> MatMult 7000 1.0 6.0762e+00 1.1 2.46e+09 1.0 7.6e+06 7.2e+03
> 0.0e+00  2 55100100  0   2 55100100  0 86894
> PCApply 6000 1.0 2.3429e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 92  0  0  0  0  95  0  0  0  0 0
>
>
> --Junchao Zhang
>
> On Thu, Jun 14, 2018 at 5:45 PM, Junchao Zhang 
> wrote:
>
>> I tested -pc_gamg_repartition with 216 processors again. First I tested
>> with these options
>>
>> -log_view \
>> -ksp_rtol 1E-6 \
>> -ksp_type cg \
>> -ksp_norm_type unpreconditioned \
>> -mg_levels_ksp_type richardson \
>> -mg_levels_ksp_norm_type none \
>> -mg_levels_pc_type sor \
>> -mg_levels_ksp_max_it 1 \
>> -mg_levels_pc_sor_its 1 \
>> -mg_levels_esteig_ksp_type cg \
>> -mg_levels_esteig_ksp_max_it 10 \
>> -pc_type gamg \
>> -pc_gamg_type agg \
>> -pc_gamg_threshold 0.05 \
>> -pc_gamg_type classical \
>> -gamg_est_ksp_type cg \
>> -pc_gamg_square_graph 10 \
>> -pc_gamg_threshold 0.0
>>
>>
>> then I tested with an extra -pc_gamg_repartition. With repartition, the
>> time increased from 120s to 140s.  The code measures first KSPSolve and
>> the remaining in separate stages, so the repartition time was not counted
>> in the stage of interest. Actually, log_view says GMAG :repartition time
>> (in the first event stage) is about 1.5 sec., so it is not a big deal. I
>> also tested -pc_gamg_square_graph 4. It did not change the time.
>> I tested hypre with options "-log_view -ksp_rtol 1E-6 -ksp_type cg
>> -ksp_norm_type unpreconditioned -pc_type hypre"  and nothing else. The code
>> ran out of time. In old tests, a job (1000 KSPSolve with 7 KSP iterations
>> each) took 4 minutes. With hypre, 1 KSPSolve + 6 KSP iterations each, takes
>> 6 minutes.
>> I will test and profile the code on a single node, and apply some
>> vecscatter optimizations I recently did to see what happens.
>>
>>
>> --Junchao Zhang
>>
>> On Thu, Jun 14, 2018 at 11:03 AM, Mark Adams  wrote:
>>
>>> And with 7-point stensils and no large material discontinuities you
>>> probably want -pc_gamg_square_graph 10 -pc_gamg_threshold 0.0 and you
>>> could test the square graph parameter (eg, 1,2,3,4).
>>>
>>> And I would definitely test hypre.
>>>
>>> On Thu, Jun 14, 2018 at 8:54 AM Mark Adams  wrote:
>>>

> Just -pc_type hypre instead of -pc_type gamg.
>
>
 And you need to have configured PETSc with hypre.


>>>
>>
>


Re: [petsc-dev] [petsc-users] Poor weak scaling when solving successive linearsystems

2018-06-26 Thread Junchao Zhang
Mark,
  I re-do the -pc_type hypre experiment without openmp.  Now the job
finishes instead of running out of time. I have results with 216 processors
(see below). The 1728-processor job is still in the queue so I don't know
how it scales. But for the 216-processor one, the execution time is 245
seconds. With -pc_type gamg, the time is 107 seconds.  My options are

-ksp_norm_type unpreconditioned
-ksp_rtol 1E-6
-ksp_type cg
-log_view
-mesh_size 1E-4
-mg_levels_esteig_ksp_max_it 10
-mg_levels_esteig_ksp_type cg
-mg_levels_ksp_max_it 1
-mg_levels_ksp_norm_type none
-mg_levels_ksp_type richardson
-mg_levels_pc_sor_its 1
-mg_levels_pc_type sor
-nodes_per_proc 30
-pc_type hypre


It is a 7-point stencil code. Do you know other hypre options that I can
try to improve it?  Thanks.

--- Event Stage 2: Remaining Solves

KSPSolve1000 1.0 2.4574e+02 1.0 4.48e+09 1.0 7.6e+06 7.2e+03
2.0e+04 97100100100100 100100100100100  3928
VecTDot12000 1.0 6.5646e+00 2.2 6.48e+08 1.0 0.0e+00 0.0e+00
1.2e+04  2 14  0  0 60   2 14  0  0 60 21321
VecNorm 8000 1.0 9.7144e-01 1.2 4.32e+08 1.0 0.0e+00 0.0e+00
8.0e+03  0 10  0  0 40   0 10  0  0 40 96055
VecCopy 1000 1.0 7.9706e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
VecSet  6000 1.0 1.7941e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
VecAXPY12000 1.0 7.5738e-01 1.2 6.48e+08 1.0 0.0e+00 0.0e+00
0.0e+00  0 14  0  0  0   0 14  0  0  0 184806
VecAYPX 6000 1.0 4.6802e-01 1.3 2.97e+08 1.0 0.0e+00 0.0e+00
0.0e+00  0  7  0  0  0   0  7  0  0  0 137071
VecScatterBegin 7000 1.0 4.7924e-01 2.3 0.00e+00 0.0 7.6e+06 7.2e+03
0.0e+00  0  0100100  0   0  0100100  0 0
VecScatterEnd   7000 1.0 7.9303e-01 2.8 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
MatMult 7000 1.0 6.0762e+00 1.1 2.46e+09 1.0 7.6e+06 7.2e+03
0.0e+00  2 55100100  0   2 55100100  0 86894
PCApply 6000 1.0 2.3429e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 92  0  0  0  0  95  0  0  0  0 0


--Junchao Zhang

On Thu, Jun 14, 2018 at 5:45 PM, Junchao Zhang  wrote:

> I tested -pc_gamg_repartition with 216 processors again. First I tested
> with these options
>
> -log_view \
> -ksp_rtol 1E-6 \
> -ksp_type cg \
> -ksp_norm_type unpreconditioned \
> -mg_levels_ksp_type richardson \
> -mg_levels_ksp_norm_type none \
> -mg_levels_pc_type sor \
> -mg_levels_ksp_max_it 1 \
> -mg_levels_pc_sor_its 1 \
> -mg_levels_esteig_ksp_type cg \
> -mg_levels_esteig_ksp_max_it 10 \
> -pc_type gamg \
> -pc_gamg_type agg \
> -pc_gamg_threshold 0.05 \
> -pc_gamg_type classical \
> -gamg_est_ksp_type cg \
> -pc_gamg_square_graph 10 \
> -pc_gamg_threshold 0.0
>
>
> then I tested with an extra -pc_gamg_repartition. With repartition, the
> time increased from 120s to 140s.  The code measures first KSPSolve and
> the remaining in separate stages, so the repartition time was not counted
> in the stage of interest. Actually, log_view says GMAG :repartition time
> (in the first event stage) is about 1.5 sec., so it is not a big deal. I
> also tested -pc_gamg_square_graph 4. It did not change the time.
> I tested hypre with options "-log_view -ksp_rtol 1E-6 -ksp_type cg
> -ksp_norm_type unpreconditioned -pc_type hypre"  and nothing else. The code
> ran out of time. In old tests, a job (1000 KSPSolve with 7 KSP iterations
> each) took 4 minutes. With hypre, 1 KSPSolve + 6 KSP iterations each, takes
> 6 minutes.
> I will test and profile the code on a single node, and apply some
> vecscatter optimizations I recently did to see what happens.
>
>
> --Junchao Zhang
>
> On Thu, Jun 14, 2018 at 11:03 AM, Mark Adams  wrote:
>
>> And with 7-point stensils and no large material discontinuities you
>> probably want -pc_gamg_square_graph 10 -pc_gamg_threshold 0.0 and you
>> could test the square graph parameter (eg, 1,2,3,4).
>>
>> And I would definitely test hypre.
>>
>> On Thu, Jun 14, 2018 at 8:54 AM Mark Adams  wrote:
>>
>>>
 Just -pc_type hypre instead of -pc_type gamg.


>>> And you need to have configured PETSc with hypre.
>>>
>>>
>>
>


Re: [petsc-dev] [petsc-users] Poor weak scaling when solving successive linearsystems

2018-06-14 Thread Junchao Zhang
I tested -pc_gamg_repartition with 216 processors again. First I tested
with these options

-log_view \
-ksp_rtol 1E-6 \
-ksp_type cg \
-ksp_norm_type unpreconditioned \
-mg_levels_ksp_type richardson \
-mg_levels_ksp_norm_type none \
-mg_levels_pc_type sor \
-mg_levels_ksp_max_it 1 \
-mg_levels_pc_sor_its 1 \
-mg_levels_esteig_ksp_type cg \
-mg_levels_esteig_ksp_max_it 10 \
-pc_type gamg \
-pc_gamg_type agg \
-pc_gamg_threshold 0.05 \
-pc_gamg_type classical \
-gamg_est_ksp_type cg \
-pc_gamg_square_graph 10 \
-pc_gamg_threshold 0.0


then I tested with an extra -pc_gamg_repartition. With repartition, the
time increased from 120s to 140s.  The code measures first KSPSolve and the
remaining in separate stages, so the repartition time was not counted in
the stage of interest. Actually, log_view says GMAG :repartition time (in
the first event stage) is about 1.5 sec., so it is not a big deal. I also
tested -pc_gamg_square_graph 4. It did not change the time.
I tested hypre with options "-log_view -ksp_rtol 1E-6 -ksp_type cg
-ksp_norm_type unpreconditioned -pc_type hypre"  and nothing else. The code
ran out of time. In old tests, a job (1000 KSPSolve with 7 KSP iterations
each) took 4 minutes. With hypre, 1 KSPSolve + 6 KSP iterations each, takes
6 minutes.
I will test and profile the code on a single node, and apply some
vecscatter optimizations I recently did to see what happens.


--Junchao Zhang

On Thu, Jun 14, 2018 at 11:03 AM, Mark Adams  wrote:

> And with 7-point stensils and no large material discontinuities you
> probably want -pc_gamg_square_graph 10 -pc_gamg_threshold 0.0 and you
> could test the square graph parameter (eg, 1,2,3,4).
>
> And I would definitely test hypre.
>
> On Thu, Jun 14, 2018 at 8:54 AM Mark Adams  wrote:
>
>>
>>> Just -pc_type hypre instead of -pc_type gamg.
>>>
>>>
>> And you need to have configured PETSc with hypre.
>>
>>
>


Re: [petsc-dev] [petsc-users] Poor weak scaling when solving successive linearsystems

2018-06-14 Thread Mark Adams
And with 7-point stensils and no large material discontinuities you
probably want -pc_gamg_square_graph 10 -pc_gamg_threshold 0.0 and you could
test the square graph parameter (eg, 1,2,3,4).

And I would definitely test hypre.

On Thu, Jun 14, 2018 at 8:54 AM Mark Adams  wrote:

>
>> Just -pc_type hypre instead of -pc_type gamg.
>>
>>
> And you need to have configured PETSc with hypre.
>
>


Re: [petsc-dev] [petsc-users] Poor weak scaling when solving successive linearsystems

2018-06-14 Thread Mark Adams
>
>
> Just -pc_type hypre instead of -pc_type gamg.
>
>
And you need to have configured PETSc with hypre.


Re: [petsc-dev] [petsc-users] Poor weak scaling when solving successive linearsystems

2018-06-14 Thread Mark Adams
On Wed, Jun 13, 2018 at 11:09 AM Junchao Zhang  wrote:

> Mark,
>   Yes, it is 7-point stencil. I tried your options, -pc_gamg_square_graph
> 0 -pc_gamg_threshold 0.0 -pc_gamg_repartition,
>

I trust you did these separately. -pc_gamg_repartition will have a largest
setup time so this cost can be amortized in most real app.  If you were
just testing to run time for on linear solve then you probably want to look
at KSPSolve only.


> and found they increased the time. I did not try hypre since I don't know
> how to set its options.
>

Just -pc_type hypre instead of -pc_type gamg.


>   I also tried periodic boundary condition and ran it with -mat_view
> ::load_balance. It gives fewer KSP iterations and but PETSc still reports
> load imbalance at coarse levels.
>
>
> --Junchao Zhang
>
> On Tue, Jun 12, 2018 at 3:17 PM, Mark Adams  wrote:
>
>> This all looks reasonable to me. The VecScatters times are a little high
>> but these are fast little solves (.2 seconds each).
>>
>> The RAP times are very low, suggesting we could optimize parameters a bit
>> and reduce the iteration count. These are 7 point stencils as I recall. You
>> could try -pc_gamg_square_graph 0 (instead of 1) and you probably want
>> '-pc_gamg_threshold 0.0'.  You could also test hypre.
>>
>> And you should be able to improve coarse grid load imbalance with
>> -pc_gamg_repartition.
>>
>> Mark
>>
>> On Tue, Jun 12, 2018 at 12:32 PM, Junchao Zhang 
>> wrote:
>>
>>> Mark,
>>>   I tried "-pc_gamg_type agg ..." options you mentioned, and also
>>> -ksp_type cg + PETSc's default PC bjacobi. In the latter case, to reduce
>>> execution time I called KSPSolve 100 times instead of 1000, and also
>>> used -ksp_max_it 100. In the 36x48=1728 ranks case, I also did a test with
>>> -log_sync. From there you can see a lot of time is spent on VecNormBarrier,
>>> which implies load imbalance. Note VecScatterBarrie time is misleading,
>>> since it barriers ALL ranks, but in reality VecScatter sort of syncs in
>>> a small neighborhood.
>>>   Barry suggested trying periodic boundary condition so that the
>>> nonzeros are perfectly balanced across processes. I will try that to see
>>> what happens.
>>>
>>> --Junchao Zhang
>>>
>>> On Mon, Jun 11, 2018 at 8:09 AM, Mark Adams  wrote:
>>>


 On Mon, Jun 11, 2018 at 12:46 AM, Junchao Zhang 
 wrote:

> I used an LCRC machine named Bebop. I tested on its Intel Broadwell
> nodes. Each nodes has 2 CPUs and 36 cores in total. I collected data using
> 36 cores in a node or 18 cores in a node.  As you can see, 18 cores/node
> gave much better performance, which is reasonable as routines like MatSOR,
> MatMult, MatMultAdd are all bandwidth bound.
>
> The code uses a DMDA 3D grid, 7-point stencil, and defines
> nodes(vertices) at the surface or second to the surface as boundary nodes.
> Boundary nodes only have a diagonal one in their row in the matrix.
> Interior nodes have 7 nonzeros in their row. Boundary processors in the
> processor grid has less nonzero. This is one source of load-imbalance. 
> Will
> load-imbalance get severer at coarser grids of an MG level?
>

 Yes.

 You can use a simple Jacobi solver to see the basic performance of your
 operator and machine. Do you see as much time spent in Vec Scatters?
 VecAXPY? etc.


>
> I attach a trace view figure that show activity of each ranks along
> the time axis in one KSPSove. White color means MPI wait. You can see
> white takes a large space.
>
> I don't have a good explanation why at large scale (1728 cores),
> processors wait longer time, as the communication pattern is still 7-point
> stencil in a cubic processor gird.
>
> --Junchao Zhang
>
> On Sat, Jun 9, 2018 at 11:32 AM, Smith, Barry F. 
> wrote:
>
>>
>>   Junchao,
>>
>>   Thanks, the load balance of matrix entries is remarkably
>> similar for the two runs so it can't be a matter of worse work load
>> imbalance for SOR for the larger case explaining why the SOR takes more
>> time.
>>
>>   Here is my guess (and I know no way to confirm it). In the
>> smaller case the overlap of different processes on the same node running
>> SOR at the same time is lower than the larger case hence the larger case 
>> is
>> slower because there are more SOR processes fighting over the same memory
>> bandwidth at the same time than in the smaller case.   Ahh, here is
>> something you can try, lets undersubscribe the memory bandwidth needs, 
>> run
>> on say 16 processes per node with 8 nodes and 16 processes per node with 
>> 64
>> nodes and send the two -log_view output files. I assume this is an LCRC
>> machine and NOT a KNL system?
>>
>>Thanks
>>
>>
>>Barry
>>
>>
>> > On Jun 9, 2018, at 8:29 AM, Mark Adams  wrote:
>> >
>> > 

Re: [petsc-dev] [petsc-users] Poor weak scaling when solving successive linearsystems

2018-06-14 Thread Junchao Zhang
On Wed, Jun 13, 2018 at 1:42 PM, Smith, Barry F.  wrote:

>
>
> > On Jun 13, 2018, at 1:09 PM, Zhang, Junchao  wrote:
> >
> > Mark,
> >   Yes, it is 7-point stencil. I tried your options,
> -pc_gamg_square_graph 0 -pc_gamg_threshold 0.0 -pc_gamg_repartition, and
> found they increased the time. I did not try hypre since I don't know how
> to set its options.
> >   I also tried periodic boundary condition and ran it with -mat_view
> ::load_balance. It gives fewer KSP iterations and but PETSc still reports
> load
> > imbalance at coarse levels.
>
>Is the overall scaling better, worse, or the same for periodic boundary
> conditions?
>
 OK, the queue job finally finished. I would worse with periodic boundary
conditions. But they have different KSP iterations and MatSOR call counts.
So the answer is not solid. See the attached figure for difference.



>
> >
> >
> > --Junchao Zhang
> >
> > On Tue, Jun 12, 2018 at 3:17 PM, Mark Adams  wrote:
> > This all looks reasonable to me. The VecScatters times are a little high
> but these are fast little solves (.2 seconds each).
> >
> > The RAP times are very low, suggesting we could optimize parameters a
> bit and reduce the iteration count. These are 7 point stencils as I recall.
> You could try -pc_gamg_square_graph 0 (instead of 1) and you probably want
> '-pc_gamg_threshold 0.0'.  You could also test hypre.
> >
> > And you should be able to improve coarse grid load imbalance with
> -pc_gamg_repartition.
> >
> > Mark
> >
> > On Tue, Jun 12, 2018 at 12:32 PM, Junchao Zhang 
> wrote:
> > Mark,
> >   I tried "-pc_gamg_type agg ..." options you mentioned, and also
> -ksp_type cg + PETSc's default PC bjacobi. In the latter case, to reduce
> execution time I called KSPSolve 100 times instead of 1000, and also used
> -ksp_max_it 100. In the 36x48=1728 ranks case, I also did a test with
> -log_sync. From there you can see a lot of time is spent on VecNormBarrier,
> which implies load imbalance. Note VecScatterBarrie time is misleading,
> since it barriers ALL ranks, but in reality VecScatter sort of syncs in a
> small neighborhood.
> >   Barry suggested trying periodic boundary condition so that the
> nonzeros are perfectly balanced across processes. I will try that to see
> what happens.
> >
> > --Junchao Zhang
> >
> > On Mon, Jun 11, 2018 at 8:09 AM, Mark Adams  wrote:
> >
> >
> > On Mon, Jun 11, 2018 at 12:46 AM, Junchao Zhang 
> wrote:
> > I used an LCRC machine named Bebop. I tested on its Intel Broadwell
> nodes. Each nodes has 2 CPUs and 36 cores in total. I collected data using
> 36 cores in a node or 18 cores in a node.  As you can see, 18 cores/node
> gave much better performance, which is reasonable as routines like MatSOR,
> MatMult, MatMultAdd are all bandwidth bound.
> >
> > The code uses a DMDA 3D grid, 7-point stencil, and defines
> nodes(vertices) at the surface or second to the surface as boundary nodes.
> Boundary nodes only have a diagonal one in their row in the matrix.
> Interior nodes have 7 nonzeros in their row. Boundary processors in the
> processor grid has less nonzero. This is one source of load-imbalance. Will
> load-imbalance get severer at coarser grids of an MG level?
> >
> > Yes.
> >
> > You can use a simple Jacobi solver to see the basic performance of your
> operator and machine. Do you see as much time spent in Vec Scatters?
> VecAXPY? etc.
> >
> >
> > I attach a trace view figure that show activity of each ranks along the
> time axis in one KSPSove. White color means MPI wait. You can see white
> takes a large space.
> >
> > I don't have a good explanation why at large scale (1728 cores),
> processors wait longer time, as the communication pattern is still 7-point
> stencil in a cubic processor gird.
> >
> > --Junchao Zhang
> >
> > On Sat, Jun 9, 2018 at 11:32 AM, Smith, Barry F. 
> wrote:
> >
> >   Junchao,
> >
> >   Thanks, the load balance of matrix entries is remarkably similar
> for the two runs so it can't be a matter of worse work load imbalance for
> SOR for the larger case explaining why the SOR takes more time.
> >
> >   Here is my guess (and I know no way to confirm it). In the smaller
> case the overlap of different processes on the same node running SOR at the
> same time is lower than the larger case hence the larger case is slower
> because there are more SOR processes fighting over the same memory
> bandwidth at the same time than in the smaller case.   Ahh, here is
> something you can try, lets undersubscribe the memory bandwidth needs, run
> on say 16 processes per node with 8 nodes and 16 processes per node with 64
> nodes and send the two -log_view output files. I assume this is an LCRC
> machine and NOT a KNL system?
> >
> >Thanks
> >
> >
> >Barry
> >
> >
> > > On Jun 9, 2018, at 8:29 AM, Mark Adams  wrote:
> > >
> > > -pc_gamg_type classical
> > >
> > > FYI, we only support smoothed aggregation "agg" (the default). (This
> thread started by saying you were using GAMG.)
> > >
> > > It is not 

Re: [petsc-dev] [petsc-users] Poor weak scaling when solving successive linearsystems

2018-06-13 Thread Smith, Barry F.



> On Jun 13, 2018, at 1:09 PM, Zhang, Junchao  wrote:
> 
> Mark,
>   Yes, it is 7-point stencil. I tried your options, -pc_gamg_square_graph 0 
> -pc_gamg_threshold 0.0 -pc_gamg_repartition, and found they increased the 
> time. I did not try hypre since I don't know how to set its options.
>   I also tried periodic boundary condition and ran it with -mat_view 
> ::load_balance. It gives fewer KSP iterations and but PETSc still reports load
> imbalance at coarse levels.

   Is the overall scaling better, worse, or the same for periodic boundary 
conditions?

>   
> 
> --Junchao Zhang
> 
> On Tue, Jun 12, 2018 at 3:17 PM, Mark Adams  wrote:
> This all looks reasonable to me. The VecScatters times are a little high but 
> these are fast little solves (.2 seconds each).
> 
> The RAP times are very low, suggesting we could optimize parameters a bit and 
> reduce the iteration count. These are 7 point stencils as I recall. You could 
> try -pc_gamg_square_graph 0 (instead of 1) and you probably want 
> '-pc_gamg_threshold 0.0'.  You could also test hypre.
> 
> And you should be able to improve coarse grid load imbalance with 
> -pc_gamg_repartition.
> 
> Mark
> 
> On Tue, Jun 12, 2018 at 12:32 PM, Junchao Zhang  wrote:
> Mark,
>   I tried "-pc_gamg_type agg ..." options you mentioned, and also -ksp_type 
> cg + PETSc's default PC bjacobi. In the latter case, to reduce execution time 
> I called KSPSolve 100 times instead of 1000, and also used -ksp_max_it 100. 
> In the 36x48=1728 ranks case, I also did a test with -log_sync. From there 
> you can see a lot of time is spent on VecNormBarrier, which implies load 
> imbalance. Note VecScatterBarrie time is misleading, since it barriers ALL 
> ranks, but in reality VecScatter sort of syncs in a small neighborhood.
>   Barry suggested trying periodic boundary condition so that the nonzeros are 
> perfectly balanced across processes. I will try that to see what happens.
> 
> --Junchao Zhang
> 
> On Mon, Jun 11, 2018 at 8:09 AM, Mark Adams  wrote:
> 
> 
> On Mon, Jun 11, 2018 at 12:46 AM, Junchao Zhang  wrote:
> I used an LCRC machine named Bebop. I tested on its Intel Broadwell nodes. 
> Each nodes has 2 CPUs and 36 cores in total. I collected data using 36 cores 
> in a node or 18 cores in a node.  As you can see, 18 cores/node gave much 
> better performance, which is reasonable as routines like MatSOR, MatMult, 
> MatMultAdd are all bandwidth bound.
> 
> The code uses a DMDA 3D grid, 7-point stencil, and defines nodes(vertices) at 
> the surface or second to the surface as boundary nodes. Boundary nodes only 
> have a diagonal one in their row in the matrix. Interior nodes have 7 
> nonzeros in their row. Boundary processors in the processor grid has less 
> nonzero. This is one source of load-imbalance. Will load-imbalance get 
> severer at coarser grids of an MG level?
> 
> Yes. 
> 
> You can use a simple Jacobi solver to see the basic performance of your 
> operator and machine. Do you see as much time spent in Vec Scatters? VecAXPY? 
> etc.
>  
> 
> I attach a trace view figure that show activity of each ranks along the time 
> axis in one KSPSove. White color means MPI wait. You can see white takes a 
> large space. 
> 
> I don't have a good explanation why at large scale (1728 cores), processors 
> wait longer time, as the communication pattern is still 7-point stencil in a 
> cubic processor gird.
> 
> --Junchao Zhang
> 
> On Sat, Jun 9, 2018 at 11:32 AM, Smith, Barry F.  wrote:
> 
>   Junchao,
> 
>   Thanks, the load balance of matrix entries is remarkably similar for 
> the two runs so it can't be a matter of worse work load imbalance for SOR for 
> the larger case explaining why the SOR takes more time. 
> 
>   Here is my guess (and I know no way to confirm it). In the smaller case 
> the overlap of different processes on the same node running SOR at the same 
> time is lower than the larger case hence the larger case is slower because 
> there are more SOR processes fighting over the same memory bandwidth at the 
> same time than in the smaller case.   Ahh, here is something you can try, 
> lets undersubscribe the memory bandwidth needs, run on say 16 processes per 
> node with 8 nodes and 16 processes per node with 64 nodes and send the two 
> -log_view output files. I assume this is an LCRC machine and NOT a KNL system?
> 
>Thanks
> 
> 
>Barry
> 
> 
> > On Jun 9, 2018, at 8:29 AM, Mark Adams  wrote:
> > 
> > -pc_gamg_type classical
> > 
> > FYI, we only support smoothed aggregation "agg" (the default). (This thread 
> > started by saying you were using GAMG.)
> > 
> > It is not clear how much this will make a difference for you, but you don't 
> > want to use classical because we do not support it. It is meant as a 
> > reference implementation for developers.
> > 
> > First, how did you get the idea to use classical? If the documentation lead 
> > you to believe this was a good thing to do then we need to fix that!
> > 
> > 

Re: [petsc-dev] [petsc-users] Poor weak scaling when solving successive linearsystems

2018-06-13 Thread Junchao Zhang
Mark,
  Yes, it is 7-point stencil. I tried your options,
-pc_gamg_square_graph 0 -pc_gamg_threshold
0.0 -pc_gamg_repartition, and found they increased the time. I did not try
hypre since I don't know how to set its options.
  I also tried periodic boundary condition and ran it with -mat_view
::load_balance. It gives fewer KSP iterations and but PETSc still reports
load imbalance at coarse levels.


--Junchao Zhang

On Tue, Jun 12, 2018 at 3:17 PM, Mark Adams  wrote:

> This all looks reasonable to me. The VecScatters times are a little high
> but these are fast little solves (.2 seconds each).
>
> The RAP times are very low, suggesting we could optimize parameters a bit
> and reduce the iteration count. These are 7 point stencils as I recall. You
> could try -pc_gamg_square_graph 0 (instead of 1) and you probably want
> '-pc_gamg_threshold 0.0'.  You could also test hypre.
>
> And you should be able to improve coarse grid load imbalance with
> -pc_gamg_repartition.
>
> Mark
>
> On Tue, Jun 12, 2018 at 12:32 PM, Junchao Zhang 
> wrote:
>
>> Mark,
>>   I tried "-pc_gamg_type agg ..." options you mentioned, and also
>> -ksp_type cg + PETSc's default PC bjacobi. In the latter case, to reduce
>> execution time I called KSPSolve 100 times instead of 1000, and also
>> used -ksp_max_it 100. In the 36x48=1728 ranks case, I also did a test with
>> -log_sync. From there you can see a lot of time is spent on VecNormBarrier,
>> which implies load imbalance. Note VecScatterBarrie time is misleading,
>> since it barriers ALL ranks, but in reality VecScatter sort of syncs in
>> a small neighborhood.
>>   Barry suggested trying periodic boundary condition so that the nonzeros
>> are perfectly balanced across processes. I will try that to see what
>> happens.
>>
>> --Junchao Zhang
>>
>> On Mon, Jun 11, 2018 at 8:09 AM, Mark Adams  wrote:
>>
>>>
>>>
>>> On Mon, Jun 11, 2018 at 12:46 AM, Junchao Zhang 
>>> wrote:
>>>
 I used an LCRC machine named Bebop. I tested on its Intel Broadwell
 nodes. Each nodes has 2 CPUs and 36 cores in total. I collected data using
 36 cores in a node or 18 cores in a node.  As you can see, 18 cores/node
 gave much better performance, which is reasonable as routines like MatSOR,
 MatMult, MatMultAdd are all bandwidth bound.

 The code uses a DMDA 3D grid, 7-point stencil, and defines
 nodes(vertices) at the surface or second to the surface as boundary nodes.
 Boundary nodes only have a diagonal one in their row in the matrix.
 Interior nodes have 7 nonzeros in their row. Boundary processors in the
 processor grid has less nonzero. This is one source of load-imbalance. Will
 load-imbalance get severer at coarser grids of an MG level?

>>>
>>> Yes.
>>>
>>> You can use a simple Jacobi solver to see the basic performance of your
>>> operator and machine. Do you see as much time spent in Vec Scatters?
>>> VecAXPY? etc.
>>>
>>>

 I attach a trace view figure that show activity of each ranks along the
 time axis in one KSPSove. White color means MPI wait. You can see
 white takes a large space.

 I don't have a good explanation why at large scale (1728 cores),
 processors wait longer time, as the communication pattern is still 7-point
 stencil in a cubic processor gird.

 --Junchao Zhang

 On Sat, Jun 9, 2018 at 11:32 AM, Smith, Barry F. 
 wrote:

>
>   Junchao,
>
>   Thanks, the load balance of matrix entries is remarkably similar
> for the two runs so it can't be a matter of worse work load imbalance for
> SOR for the larger case explaining why the SOR takes more time.
>
>   Here is my guess (and I know no way to confirm it). In the
> smaller case the overlap of different processes on the same node running
> SOR at the same time is lower than the larger case hence the larger case 
> is
> slower because there are more SOR processes fighting over the same memory
> bandwidth at the same time than in the smaller case.   Ahh, here is
> something you can try, lets undersubscribe the memory bandwidth needs, run
> on say 16 processes per node with 8 nodes and 16 processes per node with 
> 64
> nodes and send the two -log_view output files. I assume this is an LCRC
> machine and NOT a KNL system?
>
>Thanks
>
>
>Barry
>
>
> > On Jun 9, 2018, at 8:29 AM, Mark Adams  wrote:
> >
> > -pc_gamg_type classical
> >
> > FYI, we only support smoothed aggregation "agg" (the default). (This
> thread started by saying you were using GAMG.)
> >
> > It is not clear how much this will make a difference for you, but
> you don't want to use classical because we do not support it. It is meant
> as a reference implementation for developers.
> >
> > First, how did you get the idea to use classical? If the
> documentation lead you to believe this was a good 

Re: [petsc-dev] [petsc-users] Poor weak scaling when solving successive linearsystems

2018-06-12 Thread Mark Adams
This all looks reasonable to me. The VecScatters times are a little high
but these are fast little solves (.2 seconds each).

The RAP times are very low, suggesting we could optimize parameters a bit
and reduce the iteration count. These are 7 point stencils as I recall. You
could try -pc_gamg_square_graph 0 (instead of 1) and you probably want
'-pc_gamg_threshold 0.0'.  You could also test hypre.

And you should be able to improve coarse grid load imbalance with
-pc_gamg_repartition.

Mark

On Tue, Jun 12, 2018 at 12:32 PM, Junchao Zhang  wrote:

> Mark,
>   I tried "-pc_gamg_type agg ..." options you mentioned, and also
> -ksp_type cg + PETSc's default PC bjacobi. In the latter case, to reduce
> execution time I called KSPSolve 100 times instead of 1000, and also
> used -ksp_max_it 100. In the 36x48=1728 ranks case, I also did a test with
> -log_sync. From there you can see a lot of time is spent on VecNormBarrier,
> which implies load imbalance. Note VecScatterBarrie time is misleading,
> since it barriers ALL ranks, but in reality VecScatter sort of syncs in a
> small neighborhood.
>   Barry suggested trying periodic boundary condition so that the nonzeros
> are perfectly balanced across processes. I will try that to see what
> happens.
>
> --Junchao Zhang
>
> On Mon, Jun 11, 2018 at 8:09 AM, Mark Adams  wrote:
>
>>
>>
>> On Mon, Jun 11, 2018 at 12:46 AM, Junchao Zhang 
>> wrote:
>>
>>> I used an LCRC machine named Bebop. I tested on its Intel Broadwell
>>> nodes. Each nodes has 2 CPUs and 36 cores in total. I collected data using
>>> 36 cores in a node or 18 cores in a node.  As you can see, 18 cores/node
>>> gave much better performance, which is reasonable as routines like MatSOR,
>>> MatMult, MatMultAdd are all bandwidth bound.
>>>
>>> The code uses a DMDA 3D grid, 7-point stencil, and defines
>>> nodes(vertices) at the surface or second to the surface as boundary nodes.
>>> Boundary nodes only have a diagonal one in their row in the matrix.
>>> Interior nodes have 7 nonzeros in their row. Boundary processors in the
>>> processor grid has less nonzero. This is one source of load-imbalance. Will
>>> load-imbalance get severer at coarser grids of an MG level?
>>>
>>
>> Yes.
>>
>> You can use a simple Jacobi solver to see the basic performance of your
>> operator and machine. Do you see as much time spent in Vec Scatters?
>> VecAXPY? etc.
>>
>>
>>>
>>> I attach a trace view figure that show activity of each ranks along the
>>> time axis in one KSPSove. White color means MPI wait. You can see white
>>> takes a large space.
>>>
>>> I don't have a good explanation why at large scale (1728 cores),
>>> processors wait longer time, as the communication pattern is still 7-point
>>> stencil in a cubic processor gird.
>>>
>>> --Junchao Zhang
>>>
>>> On Sat, Jun 9, 2018 at 11:32 AM, Smith, Barry F. 
>>> wrote:
>>>

   Junchao,

   Thanks, the load balance of matrix entries is remarkably similar
 for the two runs so it can't be a matter of worse work load imbalance for
 SOR for the larger case explaining why the SOR takes more time.

   Here is my guess (and I know no way to confirm it). In the
 smaller case the overlap of different processes on the same node running
 SOR at the same time is lower than the larger case hence the larger case is
 slower because there are more SOR processes fighting over the same memory
 bandwidth at the same time than in the smaller case.   Ahh, here is
 something you can try, lets undersubscribe the memory bandwidth needs, run
 on say 16 processes per node with 8 nodes and 16 processes per node with 64
 nodes and send the two -log_view output files. I assume this is an LCRC
 machine and NOT a KNL system?

Thanks


Barry


 > On Jun 9, 2018, at 8:29 AM, Mark Adams  wrote:
 >
 > -pc_gamg_type classical
 >
 > FYI, we only support smoothed aggregation "agg" (the default). (This
 thread started by saying you were using GAMG.)
 >
 > It is not clear how much this will make a difference for you, but you
 don't want to use classical because we do not support it. It is meant as a
 reference implementation for developers.
 >
 > First, how did you get the idea to use classical? If the
 documentation lead you to believe this was a good thing to do then we need
 to fix that!
 >
 > Anyway, here is a generic input for GAMG:
 >
 > -pc_type gamg
 > -pc_gamg_type agg
 > -pc_gamg_agg_nsmooths 1
 > -pc_gamg_coarse_eq_limit 1000
 > -pc_gamg_reuse_interpolation true
 > -pc_gamg_square_graph 1
 > -pc_gamg_threshold 0.05
 > -pc_gamg_threshold_scale .0
 >
 >
 >
 >
 > On Thu, Jun 7, 2018 at 6:52 PM, Junchao Zhang 
 wrote:
 > OK, I have thought that space was a typo. btw, this option does not
 show up in -h.
 > I changed number of ranks to use all cores on each node 

Re: [petsc-dev] [petsc-users] Poor weak scaling when solving successive linearsystems

2018-06-11 Thread Junchao Zhang
Mark,
  I think your idea is good and submitted the jobs, but the jobs are in the
queue for a whole day.

--Junchao Zhang

On Mon, Jun 11, 2018 at 8:09 AM, Mark Adams  wrote:

>
>
> On Mon, Jun 11, 2018 at 12:46 AM, Junchao Zhang 
> wrote:
>
>> I used an LCRC machine named Bebop. I tested on its Intel Broadwell
>> nodes. Each nodes has 2 CPUs and 36 cores in total. I collected data using
>> 36 cores in a node or 18 cores in a node.  As you can see, 18 cores/node
>> gave much better performance, which is reasonable as routines like MatSOR,
>> MatMult, MatMultAdd are all bandwidth bound.
>>
>> The code uses a DMDA 3D grid, 7-point stencil, and defines
>> nodes(vertices) at the surface or second to the surface as boundary nodes.
>> Boundary nodes only have a diagonal one in their row in the matrix.
>> Interior nodes have 7 nonzeros in their row. Boundary processors in the
>> processor grid has less nonzero. This is one source of load-imbalance. Will
>> load-imbalance get severer at coarser grids of an MG level?
>>
>
> Yes.
>
> You can use a simple Jacobi solver to see the basic performance of your
> operator and machine. Do you see as much time spent in Vec Scatters?
> VecAXPY? etc.
>
>
>>
>> I attach a trace view figure that show activity of each ranks along the
>> time axis in one KSPSove. White color means MPI wait. You can see white
>> takes a large space.
>>
>> I don't have a good explanation why at large scale (1728 cores),
>> processors wait longer time, as the communication pattern is still 7-point
>> stencil in a cubic processor gird.
>>
>> --Junchao Zhang
>>
>> On Sat, Jun 9, 2018 at 11:32 AM, Smith, Barry F. 
>> wrote:
>>
>>>
>>>   Junchao,
>>>
>>>   Thanks, the load balance of matrix entries is remarkably similar
>>> for the two runs so it can't be a matter of worse work load imbalance for
>>> SOR for the larger case explaining why the SOR takes more time.
>>>
>>>   Here is my guess (and I know no way to confirm it). In the smaller
>>> case the overlap of different processes on the same node running SOR at the
>>> same time is lower than the larger case hence the larger case is slower
>>> because there are more SOR processes fighting over the same memory
>>> bandwidth at the same time than in the smaller case.   Ahh, here is
>>> something you can try, lets undersubscribe the memory bandwidth needs, run
>>> on say 16 processes per node with 8 nodes and 16 processes per node with 64
>>> nodes and send the two -log_view output files. I assume this is an LCRC
>>> machine and NOT a KNL system?
>>>
>>>Thanks
>>>
>>>
>>>Barry
>>>
>>>
>>> > On Jun 9, 2018, at 8:29 AM, Mark Adams  wrote:
>>> >
>>> > -pc_gamg_type classical
>>> >
>>> > FYI, we only support smoothed aggregation "agg" (the default). (This
>>> thread started by saying you were using GAMG.)
>>> >
>>> > It is not clear how much this will make a difference for you, but you
>>> don't want to use classical because we do not support it. It is meant as a
>>> reference implementation for developers.
>>> >
>>> > First, how did you get the idea to use classical? If the documentation
>>> lead you to believe this was a good thing to do then we need to fix that!
>>> >
>>> > Anyway, here is a generic input for GAMG:
>>> >
>>> > -pc_type gamg
>>> > -pc_gamg_type agg
>>> > -pc_gamg_agg_nsmooths 1
>>> > -pc_gamg_coarse_eq_limit 1000
>>> > -pc_gamg_reuse_interpolation true
>>> > -pc_gamg_square_graph 1
>>> > -pc_gamg_threshold 0.05
>>> > -pc_gamg_threshold_scale .0
>>> >
>>> >
>>> >
>>> >
>>> > On Thu, Jun 7, 2018 at 6:52 PM, Junchao Zhang 
>>> wrote:
>>> > OK, I have thought that space was a typo. btw, this option does not
>>> show up in -h.
>>> > I changed number of ranks to use all cores on each node to avoid
>>> misleading ratio in -log_view. Since one node has 36 cores, I ran with
>>> 6^3=216 ranks, and 12^3=1728 ranks. I also found call counts of MatSOR etc
>>> in the two tests were different. So they are not strict weak scaling tests.
>>> I tried to add -ksp_max_it 6 -pc_mg_levels 6, but still could not make the
>>> two have the same MatSOR count. Anyway, I attached the load balance output.
>>> >
>>> > I find PCApply_MG calls PCMGMCycle_Private, which is recursive and
>>> indirectly calls MatSOR_MPIAIJ. I believe the following code in
>>> MatSOR_MPIAIJ practically syncs {MatSOR, MatMultAdd}_SeqAIJ  between
>>> processors through VecScatter at each MG level. If SOR and MatMultAdd are
>>> imbalanced, the cost is accumulated along MG levels and shows up as large
>>> VecScatter cost.
>>> > 1460: while
>>> >  (its--) {
>>> >
>>> > 1461:   VecScatterBegin(mat->Mvctx,xx
>>> ,mat->lvec,INSERT_VALUES,SCATTER_FORWARD
>>> > );
>>> >
>>> > 1462:   VecScatterEnd(mat->Mvctx,xx,m
>>> at->lvec,INSERT_VALUES,SCATTER_FORWARD
>>> > );
>>> >
>>> >
>>> > 1464:   /* update rhs: bb1 = bb - B*x */
>>> > 1465:   VecScale
>>> > (mat->lvec,-1.0);
>>> >
>>> > 1466:   (*mat->B->ops->multadd)(mat->
>>> > B,mat->lvec,bb,bb1);
>>> >
>>> >
>>> > 

Re: [petsc-dev] [petsc-users] Poor weak scaling when solving successive linearsystems

2018-06-11 Thread Mark Adams
On Mon, Jun 11, 2018 at 12:46 AM, Junchao Zhang  wrote:

> I used an LCRC machine named Bebop. I tested on its Intel Broadwell nodes.
> Each nodes has 2 CPUs and 36 cores in total. I collected data using 36
> cores in a node or 18 cores in a node.  As you can see, 18 cores/node gave
> much better performance, which is reasonable as routines like MatSOR,
> MatMult, MatMultAdd are all bandwidth bound.
>
> The code uses a DMDA 3D grid, 7-point stencil, and defines nodes(vertices)
> at the surface or second to the surface as boundary nodes. Boundary nodes
> only have a diagonal one in their row in the matrix. Interior nodes have 7
> nonzeros in their row. Boundary processors in the processor grid has less
> nonzero. This is one source of load-imbalance. Will load-imbalance get
> severer at coarser grids of an MG level?
>

Yes.

You can use a simple Jacobi solver to see the basic performance of your
operator and machine. Do you see as much time spent in Vec Scatters?
VecAXPY? etc.


>
> I attach a trace view figure that show activity of each ranks along the
> time axis in one KSPSove. White color means MPI wait. You can see white
> takes a large space.
>
> I don't have a good explanation why at large scale (1728 cores),
> processors wait longer time, as the communication pattern is still 7-point
> stencil in a cubic processor gird.
>
> --Junchao Zhang
>
> On Sat, Jun 9, 2018 at 11:32 AM, Smith, Barry F. 
> wrote:
>
>>
>>   Junchao,
>>
>>   Thanks, the load balance of matrix entries is remarkably similar
>> for the two runs so it can't be a matter of worse work load imbalance for
>> SOR for the larger case explaining why the SOR takes more time.
>>
>>   Here is my guess (and I know no way to confirm it). In the smaller
>> case the overlap of different processes on the same node running SOR at the
>> same time is lower than the larger case hence the larger case is slower
>> because there are more SOR processes fighting over the same memory
>> bandwidth at the same time than in the smaller case.   Ahh, here is
>> something you can try, lets undersubscribe the memory bandwidth needs, run
>> on say 16 processes per node with 8 nodes and 16 processes per node with 64
>> nodes and send the two -log_view output files. I assume this is an LCRC
>> machine and NOT a KNL system?
>>
>>Thanks
>>
>>
>>Barry
>>
>>
>> > On Jun 9, 2018, at 8:29 AM, Mark Adams  wrote:
>> >
>> > -pc_gamg_type classical
>> >
>> > FYI, we only support smoothed aggregation "agg" (the default). (This
>> thread started by saying you were using GAMG.)
>> >
>> > It is not clear how much this will make a difference for you, but you
>> don't want to use classical because we do not support it. It is meant as a
>> reference implementation for developers.
>> >
>> > First, how did you get the idea to use classical? If the documentation
>> lead you to believe this was a good thing to do then we need to fix that!
>> >
>> > Anyway, here is a generic input for GAMG:
>> >
>> > -pc_type gamg
>> > -pc_gamg_type agg
>> > -pc_gamg_agg_nsmooths 1
>> > -pc_gamg_coarse_eq_limit 1000
>> > -pc_gamg_reuse_interpolation true
>> > -pc_gamg_square_graph 1
>> > -pc_gamg_threshold 0.05
>> > -pc_gamg_threshold_scale .0
>> >
>> >
>> >
>> >
>> > On Thu, Jun 7, 2018 at 6:52 PM, Junchao Zhang 
>> wrote:
>> > OK, I have thought that space was a typo. btw, this option does not
>> show up in -h.
>> > I changed number of ranks to use all cores on each node to avoid
>> misleading ratio in -log_view. Since one node has 36 cores, I ran with
>> 6^3=216 ranks, and 12^3=1728 ranks. I also found call counts of MatSOR etc
>> in the two tests were different. So they are not strict weak scaling tests.
>> I tried to add -ksp_max_it 6 -pc_mg_levels 6, but still could not make the
>> two have the same MatSOR count. Anyway, I attached the load balance output.
>> >
>> > I find PCApply_MG calls PCMGMCycle_Private, which is recursive and
>> indirectly calls MatSOR_MPIAIJ. I believe the following code in
>> MatSOR_MPIAIJ practically syncs {MatSOR, MatMultAdd}_SeqAIJ  between
>> processors through VecScatter at each MG level. If SOR and MatMultAdd are
>> imbalanced, the cost is accumulated along MG levels and shows up as large
>> VecScatter cost.
>> > 1460: while
>> >  (its--) {
>> >
>> > 1461:   VecScatterBegin(mat->Mvctx,xx,mat->lvec,INSERT_VALUES,SCA
>> TTER_FORWARD
>> > );
>> >
>> > 1462:   VecScatterEnd(mat->Mvctx,xx,mat->lvec,INSERT_VALUES,SCATTER
>> _FORWARD
>> > );
>> >
>> >
>> > 1464:   /* update rhs: bb1 = bb - B*x */
>> > 1465:   VecScale
>> > (mat->lvec,-1.0);
>> >
>> > 1466:   (*mat->B->ops->multadd)(mat->
>> > B,mat->lvec,bb,bb1);
>> >
>> >
>> > 1468:   /* local sweep */
>> > 1469:   (*mat->A->ops->sor)(mat->A,bb1,omega,SOR_SYMMETRIC_SWEEP,
>> > fshift,lits,1,xx);
>> >
>> > 1470: }
>> >
>> >
>> >
>> > --Junchao Zhang
>> >
>> > On Thu, Jun 7, 2018 at 3:11 PM, Smith, Barry F. 
>> wrote:
>> >
>> >
>> > > On Jun 7, 2018, at 12:27 PM, Zhang, 

Re: [petsc-dev] [petsc-users] Poor weak scaling when solving successive linearsystems

2018-06-09 Thread Smith, Barry F.


  Junchao,

  Thanks, the load balance of matrix entries is remarkably similar for the 
two runs so it can't be a matter of worse work load imbalance for SOR for the 
larger case explaining why the SOR takes more time. 

  Here is my guess (and I know no way to confirm it). In the smaller case 
the overlap of different processes on the same node running SOR at the same 
time is lower than the larger case hence the larger case is slower because 
there are more SOR processes fighting over the same memory bandwidth at the 
same time than in the smaller case.   Ahh, here is something you can try, lets 
undersubscribe the memory bandwidth needs, run on say 16 processes per node 
with 8 nodes and 16 processes per node with 64 nodes and send the two -log_view 
output files. I assume this is an LCRC machine and NOT a KNL system?

   Thanks


   Barry


> On Jun 9, 2018, at 8:29 AM, Mark Adams  wrote:
> 
> -pc_gamg_type classical
> 
> FYI, we only support smoothed aggregation "agg" (the default). (This thread 
> started by saying you were using GAMG.)
> 
> It is not clear how much this will make a difference for you, but you don't 
> want to use classical because we do not support it. It is meant as a 
> reference implementation for developers.
> 
> First, how did you get the idea to use classical? If the documentation lead 
> you to believe this was a good thing to do then we need to fix that!
> 
> Anyway, here is a generic input for GAMG:
> 
> -pc_type gamg 
> -pc_gamg_type agg 
> -pc_gamg_agg_nsmooths 1 
> -pc_gamg_coarse_eq_limit 1000 
> -pc_gamg_reuse_interpolation true 
> -pc_gamg_square_graph 1 
> -pc_gamg_threshold 0.05 
> -pc_gamg_threshold_scale .0 
> 
> 
> 
> 
> On Thu, Jun 7, 2018 at 6:52 PM, Junchao Zhang  wrote:
> OK, I have thought that space was a typo. btw, this option does not show up 
> in -h.
> I changed number of ranks to use all cores on each node to avoid misleading 
> ratio in -log_view. Since one node has 36 cores, I ran with 6^3=216 ranks, 
> and 12^3=1728 ranks. I also found call counts of MatSOR etc in the two tests 
> were different. So they are not strict weak scaling tests. I tried to add 
> -ksp_max_it 6 -pc_mg_levels 6, but still could not make the two have the same 
> MatSOR count. Anyway, I attached the load balance output.
> 
> I find PCApply_MG calls PCMGMCycle_Private, which is recursive and indirectly 
> calls MatSOR_MPIAIJ. I believe the following code in MatSOR_MPIAIJ 
> practically syncs {MatSOR, MatMultAdd}_SeqAIJ  between processors through 
> VecScatter at each MG level. If SOR and MatMultAdd are imbalanced, the cost 
> is accumulated along MG levels and shows up as large VecScatter cost.
> 1460: while
>  (its--) {
> 
> 1461:   
> VecScatterBegin(mat->Mvctx,xx,mat->lvec,INSERT_VALUES,SCATTER_FORWARD
> );
> 
> 1462:   
> VecScatterEnd(mat->Mvctx,xx,mat->lvec,INSERT_VALUES,SCATTER_FORWARD
> );
> 
> 
> 1464:   /* update rhs: bb1 = bb - B*x */
> 1465:   VecScale
> (mat->lvec,-1.0);
> 
> 1466:   (*mat->B->ops->multadd)(mat->
> B,mat->lvec,bb,bb1);
> 
> 
> 1468:   /* local sweep */
> 1469:   (*mat->A->ops->sor)(mat->A,bb1,omega,SOR_SYMMETRIC_SWEEP,
> fshift,lits,1,xx);
> 
> 1470: }
> 
> 
> 
> --Junchao Zhang
> 
> On Thu, Jun 7, 2018 at 3:11 PM, Smith, Barry F.  wrote:
> 
> 
> > On Jun 7, 2018, at 12:27 PM, Zhang, Junchao  wrote:
> > 
> > Searched but could not find this option, -mat_view::load_balance
> 
>There is a space between the view and the :   load_balance is a particular 
> viewer format that causes the printing of load balance information about 
> number of nonzeros in the matrix.
> 
>Barry
> 
> > 
> > --Junchao Zhang
> > 
> > On Thu, Jun 7, 2018 at 10:46 AM, Smith, Barry F.  wrote:
> >  So the only surprise in the results is the SOR. It is embarrassingly 
> > parallel and normally one would not see a jump.
> > 
> >  The load balance for SOR time 1.5 is better at 1000 processes than for 125 
> > processes of 2.1  not worse so this number doesn't easily explain it.
> > 
> >  Could you run the 125 and 1000 with -mat_view ::load_balance and see what 
> > you get out?
> > 
> >Thanks
> > 
> >  Barry
> > 
> >  Notice that the MatSOR time jumps a lot about 5 secs when the -log_sync is 
> > on. My only guess is that the MatSOR is sharing memory bandwidth (or some 
> > other resource? cores?) with the VecScatter and for some reason this is 
> > worse for 1000 cores but I don't know why.
> > 
> > > On Jun 6, 2018, at 9:13 PM, Junchao Zhang  wrote:
> > > 
> > > Hi, PETSc developers,
> > >  I tested Michael Becker's code. The code calls the same KSPSolve 1000 
> > > times in the second stage and needs cubic number of processors to run. I 
> > > ran with 125 ranks and 1000 ranks, with or without -log_sync option. I 
> > > attach the log view output files and a scaling loss excel file.
> > >  I profiled the code with 125 processors. It looks {MatSOR, MatMult, 
> > > MatMultAdd, MatMultTranspose, 

Re: [petsc-dev] [petsc-users] Poor weak scaling when solving successive linearsystems

2018-06-09 Thread Mark Adams
-pc_gamg_type classical


FYI, we only support smoothed aggregation "agg" (the default). (This thread
started by saying you were using GAMG.)

It is not clear how much this will make a difference for you, but you don't
want to use classical because we do not support it. It is meant as a
reference implementation for developers.

First, how did you get the idea to use classical? If the documentation lead
you to believe this was a good thing to do then we need to fix that!

Anyway, here is a generic input for GAMG:

-pc_type gamg
-pc_gamg_type agg
-pc_gamg_agg_nsmooths 1
-pc_gamg_coarse_eq_limit 1000
-pc_gamg_reuse_interpolation true
-pc_gamg_square_graph 1
-pc_gamg_threshold 0.05
-pc_gamg_threshold_scale .0




On Thu, Jun 7, 2018 at 6:52 PM, Junchao Zhang  wrote:

> OK, I have thought that space was a typo. btw, this option does not show
> up in -h.
> I changed number of ranks to use all cores on each node to avoid
> misleading ratio in -log_view. Since one node has 36 cores, I ran with
> 6^3=216 ranks, and 12^3=1728 ranks. I also found call counts of MatSOR etc
> in the two tests were different. So they are not strict weak scaling tests.
> I tried to add -ksp_max_it 6 -pc_mg_levels 6, but still could not make the
> two have the same MatSOR count. Anyway, I attached the load balance output.
>
> I find PCApply_MG calls PCMGMCycle_Private, which is recursive and
> indirectly calls MatSOR_MPIAIJ. I believe the following code in
> MatSOR_MPIAIJ practically syncs {MatSOR, MatMultAdd}_SeqAIJ  between
> processors through VecScatter at each MG level. If SOR and MatMultAdd are
> imbalanced, the cost is accumulated along MG levels and shows up as large
> VecScatter cost.
>
> 1460: while (its--) {1461:   VecScatterBegin 
> (mat->Mvctx,xx,mat->lvec,INSERT_VALUES
>  
> ,SCATTER_FORWARD
>  
> );1462:
>VecScatterEnd 
> (mat->Mvctx,xx,mat->lvec,INSERT_VALUES
>  
> ,SCATTER_FORWARD
>  
> );
> 1464:   /* update rhs: bb1 = bb - B*x */1465:   VecScale 
> (mat->lvec,-1.0);1466:
>(*mat->B->ops->multadd)(mat->B,mat->lvec,bb,bb1);
> 1468:   /* local sweep */1469:   
> (*mat->A->ops->sor)(mat->A,bb1,omega,SOR_SYMMETRIC_SWEEP 
> ,fshift,lits,1,xx);1470:
>  }
>
>
>
>
> --Junchao Zhang
>
> On Thu, Jun 7, 2018 at 3:11 PM, Smith, Barry F. 
> wrote:
>
>>
>>
>> > On Jun 7, 2018, at 12:27 PM, Zhang, Junchao 
>> wrote:
>> >
>> > Searched but could not find this option, -mat_view::load_balance
>>
>>There is a space between the view and the :   load_balance is a
>> particular viewer format that causes the printing of load balance
>> information about number of nonzeros in the matrix.
>>
>>Barry
>>
>> >
>> > --Junchao Zhang
>> >
>> > On Thu, Jun 7, 2018 at 10:46 AM, Smith, Barry F. 
>> wrote:
>> >  So the only surprise in the results is the SOR. It is embarrassingly
>> parallel and normally one would not see a jump.
>> >
>> >  The load balance for SOR time 1.5 is better at 1000 processes than for
>> 125 processes of 2.1  not worse so this number doesn't easily explain it.
>> >
>> >  Could you run the 125 and 1000 with -mat_view ::load_balance and see
>> what you get out?
>> >
>> >Thanks
>> >
>> >  Barry
>> >
>> >  Notice that the MatSOR time jumps a lot about 5 secs when the
>> -log_sync is on. My only guess is that the MatSOR is sharing memory
>> bandwidth (or some other resource? cores?) with the VecScatter and for some
>> reason this is worse for 1000 cores but I don't know why.
>> >
>> > > On Jun 6, 2018, at 9:13 PM, Junchao Zhang 
>> wrote:
>> > >
>> > > Hi, PETSc developers,
>> > >  I tested Michael Becker's code. The code calls the same KSPSolve
>> 1000 times in the second stage and needs cubic number of processors to run.
>> I ran with 125 ranks and 1000 ranks, with or without -log_sync option. I
>> attach the log view output files and a scaling loss excel file.
>> > >  I profiled the code with 125 processors. It looks {MatSOR, MatMult,
>> MatMultAdd, MatMultTranspose, MatMultTransposeAdd}_SeqAIJ in aij.c took
>> ~50% of the time,  The other half time was spent on waiting in MPI.
>> MatSOR_SeqAIJ took 30%, mostly in PetscSparseDenseMinusDot().
>> > >  I tested it on a 36 cores/node machine. I found 32 

Re: [petsc-dev] [petsc-users] Poor weak scaling when solving successive linearsystems

2018-06-07 Thread Junchao Zhang
OK, I have thought that space was a typo. btw, this option does not show up
in -h.
I changed number of ranks to use all cores on each node to avoid misleading
ratio in -log_view. Since one node has 36 cores, I ran with 6^3=216 ranks,
and 12^3=1728 ranks. I also found call counts of MatSOR etc in the two
tests were different. So they are not strict weak scaling tests. I tried to
add -ksp_max_it 6 -pc_mg_levels 6, but still could not make the two have
the same MatSOR count. Anyway, I attached the load balance output.

I find PCApply_MG calls PCMGMCycle_Private, which is recursive and
indirectly calls MatSOR_MPIAIJ. I believe the following code in
MatSOR_MPIAIJ practically syncs {MatSOR, MatMultAdd}_SeqAIJ  between
processors through VecScatter at each MG level. If SOR and MatMultAdd are
imbalanced, the cost is accumulated along MG levels and shows up as large
VecScatter cost.

1460: while (its--) {1461:   VecScatterBegin
(mat->Mvctx,xx,mat->lvec,INSERT_VALUES
,SCATTER_FORWARD
);1462:
  VecScatterEnd
(mat->Mvctx,xx,mat->lvec,INSERT_VALUES
,SCATTER_FORWARD
);
1464:   /* update rhs: bb1 = bb - B*x */1465:   VecScale
(mat->lvec,-1.0);1466:
  (*mat->B->ops->multadd)(mat->B,mat->lvec,bb,bb1);
1468:   /* local sweep */1469:
(*mat->A->ops->sor)(mat->A,bb1,omega,SOR_SYMMETRIC_SWEEP
,fshift,lits,1,xx);1470:
}




--Junchao Zhang

On Thu, Jun 7, 2018 at 3:11 PM, Smith, Barry F.  wrote:

>
>
> > On Jun 7, 2018, at 12:27 PM, Zhang, Junchao  wrote:
> >
> > Searched but could not find this option, -mat_view::load_balance
>
>There is a space between the view and the :   load_balance is a
> particular viewer format that causes the printing of load balance
> information about number of nonzeros in the matrix.
>
>Barry
>
> >
> > --Junchao Zhang
> >
> > On Thu, Jun 7, 2018 at 10:46 AM, Smith, Barry F. 
> wrote:
> >  So the only surprise in the results is the SOR. It is embarrassingly
> parallel and normally one would not see a jump.
> >
> >  The load balance for SOR time 1.5 is better at 1000 processes than for
> 125 processes of 2.1  not worse so this number doesn't easily explain it.
> >
> >  Could you run the 125 and 1000 with -mat_view ::load_balance and see
> what you get out?
> >
> >Thanks
> >
> >  Barry
> >
> >  Notice that the MatSOR time jumps a lot about 5 secs when the -log_sync
> is on. My only guess is that the MatSOR is sharing memory bandwidth (or
> some other resource? cores?) with the VecScatter and for some reason this
> is worse for 1000 cores but I don't know why.
> >
> > > On Jun 6, 2018, at 9:13 PM, Junchao Zhang  wrote:
> > >
> > > Hi, PETSc developers,
> > >  I tested Michael Becker's code. The code calls the same KSPSolve 1000
> times in the second stage and needs cubic number of processors to run. I
> ran with 125 ranks and 1000 ranks, with or without -log_sync option. I
> attach the log view output files and a scaling loss excel file.
> > >  I profiled the code with 125 processors. It looks {MatSOR, MatMult,
> MatMultAdd, MatMultTranspose, MatMultTransposeAdd}_SeqAIJ in aij.c took
> ~50% of the time,  The other half time was spent on waiting in MPI.
> MatSOR_SeqAIJ took 30%, mostly in PetscSparseDenseMinusDot().
> > >  I tested it on a 36 cores/node machine. I found 32 ranks/node gave
> better performance (about 10%) than 36 ranks/node in the 125 ranks
> testing.  I guess it is because processors in the former had more balanced
> memory bandwidth. I collected PAPI_DP_OPS (double precision operations) and
> PAPI_TOT_CYC (total cycles) of the 125 ranks case (see the attached files).
> It looks ranks at the two ends have less DP_OPS and TOT_CYC.
> > >  Does anyone familiar with the algorithm have quick explanations?
> > >
> > > --Junchao Zhang
> > >
> > > On Mon, Jun 4, 2018 at 11:59 AM, Michael Becker <
> michael.bec...@physik.uni-giessen.de> wrote:
> > > Hello again,
> > >
> > > this took me longer than I anticipated, but here we go.
> > > I did reruns of the cases where only half the processes per node were
> used (without -log_sync):
> > >
> > > 125 procs,1st   125 procs,2nd
> 1000 procs,1st  1000 procs,2nd
> > >   MaxRatio  

Re: [petsc-dev] [petsc-users] Poor weak scaling when solving successive linearsystems

2018-06-07 Thread Smith, Barry F.



> On Jun 7, 2018, at 12:27 PM, Zhang, Junchao  wrote:
> 
> Searched but could not find this option, -mat_view::load_balance

   There is a space between the view and the :   load_balance is a particular 
viewer format that causes the printing of load balance information about number 
of nonzeros in the matrix.

   Barry

> 
> --Junchao Zhang
> 
> On Thu, Jun 7, 2018 at 10:46 AM, Smith, Barry F.  wrote:
>  So the only surprise in the results is the SOR. It is embarrassingly 
> parallel and normally one would not see a jump.
> 
>  The load balance for SOR time 1.5 is better at 1000 processes than for 125 
> processes of 2.1  not worse so this number doesn't easily explain it.
> 
>  Could you run the 125 and 1000 with -mat_view ::load_balance and see what 
> you get out?
> 
>Thanks
> 
>  Barry
> 
>  Notice that the MatSOR time jumps a lot about 5 secs when the -log_sync is 
> on. My only guess is that the MatSOR is sharing memory bandwidth (or some 
> other resource? cores?) with the VecScatter and for some reason this is worse 
> for 1000 cores but I don't know why.
> 
> > On Jun 6, 2018, at 9:13 PM, Junchao Zhang  wrote:
> > 
> > Hi, PETSc developers,
> >  I tested Michael Becker's code. The code calls the same KSPSolve 1000 
> > times in the second stage and needs cubic number of processors to run. I 
> > ran with 125 ranks and 1000 ranks, with or without -log_sync option. I 
> > attach the log view output files and a scaling loss excel file.
> >  I profiled the code with 125 processors. It looks {MatSOR, MatMult, 
> > MatMultAdd, MatMultTranspose, MatMultTransposeAdd}_SeqAIJ in aij.c took 
> > ~50% of the time,  The other half time was spent on waiting in MPI.  
> > MatSOR_SeqAIJ took 30%, mostly in PetscSparseDenseMinusDot().
> >  I tested it on a 36 cores/node machine. I found 32 ranks/node gave better 
> > performance (about 10%) than 36 ranks/node in the 125 ranks testing.  I 
> > guess it is because processors in the former had more balanced memory 
> > bandwidth. I collected PAPI_DP_OPS (double precision operations) and 
> > PAPI_TOT_CYC (total cycles) of the 125 ranks case (see the attached files). 
> > It looks ranks at the two ends have less DP_OPS and TOT_CYC. 
> >  Does anyone familiar with the algorithm have quick explanations?
> > 
> > --Junchao Zhang
> > 
> > On Mon, Jun 4, 2018 at 11:59 AM, Michael Becker 
> >  wrote:
> > Hello again,
> > 
> > this took me longer than I anticipated, but here we go.
> > I did reruns of the cases where only half the processes per node were used 
> > (without -log_sync):
> > 
> > 125 procs,1st   125 procs,2nd  1000 
> > procs,1st  1000 procs,2nd
> >   MaxRatioMaxRatioMax   
> >  RatioMaxRatio
> > KSPSolve   1.203E+021.01.210E+021.0
> > 1.399E+021.11.365E+021.0
> > VecTDot6.376E+003.76.551E+004.0
> > 7.885E+002.97.175E+003.4
> > VecNorm4.579E+007.15.803E+00   10.2
> > 8.534E+006.96.026E+004.9
> > VecScale   1.070E-012.11.129E-012.2
> > 1.301E-012.51.270E-012.4
> > VecCopy1.123E-011.31.149E-011.3
> > 1.301E-011.61.359E-011.6
> > VecSet 7.063E-011.76.968E-011.7
> > 7.432E-011.87.425E-011.8
> > VecAXPY1.166E+001.41.167E+001.4
> > 1.221E+001.51.279E+001.6
> > VecAYPX1.317E+001.61.290E+001.6
> > 1.536E+001.91.499E+002.0
> > VecScatterBegin6.142E+003.25.974E+002.8
> > 6.448E+003.06.472E+002.9
> > VecScatterEnd  3.606E+014.23.551E+014.0
> > 5.244E+012.74.995E+012.7
> > MatMult3.561E+011.63.403E+011.5
> > 3.435E+011.43.332E+011.4
> > MatMultAdd 1.124E+012.01.130E+012.1
> > 2.093E+012.91.995E+012.7
> > MatMultTranspose   1.372E+012.51.388E+012.6
> > 1.477E+012.21.381E+012.1
> > MatSolve   1.949E-020.01.653E-020.0
> > 4.789E-020.04.466E-020.0
> > MatSOR 6.610E+011.36.673E+011.3
> > 7.111E+011.37.105E+011.3
> > MatResidual2.647E+011.72.667E+011.7
> > 2.446E+011.42.467E+011.5
> > PCSetUpOnBlocks5.266E-031.45.295E-031.4
> > 5.427E-031.55.289E-031.4
> > PCApply1.031E+021.01.035E+021.0
> > 1.180E+021.01.164E+021.0
> > 
> > I also slimmed down my code and basically wrote a simple weak scaling 

Re: [petsc-dev] [petsc-users] Poor weak scaling when solving successive linearsystems

2018-06-07 Thread Junchao Zhang
Searched but could not find this option, -mat_view::load_balance

--Junchao Zhang

On Thu, Jun 7, 2018 at 10:46 AM, Smith, Barry F.  wrote:

>  So the only surprise in the results is the SOR. It is embarrassingly
> parallel and normally one would not see a jump.
>
>  The load balance for SOR time 1.5 is better at 1000 processes than for
> 125 processes of 2.1  not worse so this number doesn't easily explain it.
>
>  Could you run the 125 and 1000 with -mat_view ::load_balance and see what
> you get out?
>
>Thanks
>
>  Barry
>
>  Notice that the MatSOR time jumps a lot about 5 secs when the -log_sync
> is on. My only guess is that the MatSOR is sharing memory bandwidth (or
> some other resource? cores?) with the VecScatter and for some reason this
> is worse for 1000 cores but I don't know why.
>
> > On Jun 6, 2018, at 9:13 PM, Junchao Zhang  wrote:
> >
> > Hi, PETSc developers,
> >  I tested Michael Becker's code. The code calls the same KSPSolve 1000
> times in the second stage and needs cubic number of processors to run. I
> ran with 125 ranks and 1000 ranks, with or without -log_sync option. I
> attach the log view output files and a scaling loss excel file.
> >  I profiled the code with 125 processors. It looks {MatSOR, MatMult,
> MatMultAdd, MatMultTranspose, MatMultTransposeAdd}_SeqAIJ in aij.c took
> ~50% of the time,  The other half time was spent on waiting in MPI.
> MatSOR_SeqAIJ took 30%, mostly in PetscSparseDenseMinusDot().
> >  I tested it on a 36 cores/node machine. I found 32 ranks/node gave
> better performance (about 10%) than 36 ranks/node in the 125 ranks
> testing.  I guess it is because processors in the former had more balanced
> memory bandwidth. I collected PAPI_DP_OPS (double precision operations) and
> PAPI_TOT_CYC (total cycles) of the 125 ranks case (see the attached files).
> It looks ranks at the two ends have less DP_OPS and TOT_CYC.
> >  Does anyone familiar with the algorithm have quick explanations?
> >
> > --Junchao Zhang
> >
> > On Mon, Jun 4, 2018 at 11:59 AM, Michael Becker <
> michael.bec...@physik.uni-giessen.de> wrote:
> > Hello again,
> >
> > this took me longer than I anticipated, but here we go.
> > I did reruns of the cases where only half the processes per node were
> used (without -log_sync):
> >
> > 125 procs,1st   125 procs,2nd  1000
> procs,1st  1000 procs,2nd
> >   MaxRatioMaxRatioMax
> RatioMaxRatio
> > KSPSolve   1.203E+021.01.210E+021.0
> 1.399E+021.11.365E+021.0
> > VecTDot6.376E+003.76.551E+004.0
> 7.885E+002.97.175E+003.4
> > VecNorm4.579E+007.15.803E+00   10.2
> 8.534E+006.96.026E+004.9
> > VecScale   1.070E-012.11.129E-012.2
> 1.301E-012.51.270E-012.4
> > VecCopy1.123E-011.31.149E-011.3
> 1.301E-011.61.359E-011.6
> > VecSet 7.063E-011.76.968E-011.7
> 7.432E-011.87.425E-011.8
> > VecAXPY1.166E+001.41.167E+001.4
> 1.221E+001.51.279E+001.6
> > VecAYPX1.317E+001.61.290E+001.6
> 1.536E+001.91.499E+002.0
> > VecScatterBegin6.142E+003.25.974E+002.8
> 6.448E+003.06.472E+002.9
> > VecScatterEnd  3.606E+014.23.551E+014.0
> 5.244E+012.74.995E+012.7
> > MatMult3.561E+011.63.403E+011.5
> 3.435E+011.43.332E+011.4
> > MatMultAdd 1.124E+012.01.130E+012.1
> 2.093E+012.91.995E+012.7
> > MatMultTranspose   1.372E+012.51.388E+012.6
> 1.477E+012.21.381E+012.1
> > MatSolve   1.949E-020.01.653E-020.0
> 4.789E-020.04.466E-020.0
> > MatSOR 6.610E+011.36.673E+011.3
> 7.111E+011.37.105E+011.3
> > MatResidual2.647E+011.72.667E+011.7
> 2.446E+011.42.467E+011.5
> > PCSetUpOnBlocks5.266E-031.45.295E-031.4
> 5.427E-031.55.289E-031.4
> > PCApply1.031E+021.01.035E+021.0
> 1.180E+021.01.164E+021.0
> >
> > I also slimmed down my code and basically wrote a simple weak scaling
> test (source files attached) so you can profile it yourself. I appreciate
> the offer Junchao, thank you.
> > You can adjust the system size per processor at runtime via
> "-nodes_per_proc 30" and the number of repeated calls to the function
> containing KSPsolve() via "-iterations 1000". The physical problem is
> simply calculating the electric potential from a homogeneous charge
> distribution, done multiple times to accumulate time in KSPsolve().
> > A job would be 

Re: [petsc-dev] [petsc-users] Poor weak scaling when solving successive linearsystems

2018-06-07 Thread Smith, Barry F.
 So the only surprise in the results is the SOR. It is embarrassingly parallel 
and normally one would not see a jump.

 The load balance for SOR time 1.5 is better at 1000 processes than for 125 
processes of 2.1  not worse so this number doesn't easily explain it.

 Could you run the 125 and 1000 with -mat_view ::load_balance and see what you 
get out?

   Thanks

 Barry

 Notice that the MatSOR time jumps a lot about 5 secs when the -log_sync is on. 
My only guess is that the MatSOR is sharing memory bandwidth (or some other 
resource? cores?) with the VecScatter and for some reason this is worse for 
1000 cores but I don't know why.

> On Jun 6, 2018, at 9:13 PM, Junchao Zhang  wrote:
> 
> Hi, PETSc developers,
>  I tested Michael Becker's code. The code calls the same KSPSolve 1000 times 
> in the second stage and needs cubic number of processors to run. I ran with 
> 125 ranks and 1000 ranks, with or without -log_sync option. I attach the log 
> view output files and a scaling loss excel file.
>  I profiled the code with 125 processors. It looks {MatSOR, MatMult, 
> MatMultAdd, MatMultTranspose, MatMultTransposeAdd}_SeqAIJ in aij.c took ~50% 
> of the time,  The other half time was spent on waiting in MPI.  MatSOR_SeqAIJ 
> took 30%, mostly in PetscSparseDenseMinusDot().
>  I tested it on a 36 cores/node machine. I found 32 ranks/node gave better 
> performance (about 10%) than 36 ranks/node in the 125 ranks testing.  I guess 
> it is because processors in the former had more balanced memory bandwidth. I 
> collected PAPI_DP_OPS (double precision operations) and PAPI_TOT_CYC (total 
> cycles) of the 125 ranks case (see the attached files). It looks ranks at the 
> two ends have less DP_OPS and TOT_CYC. 
>  Does anyone familiar with the algorithm have quick explanations?
> 
> --Junchao Zhang
> 
> On Mon, Jun 4, 2018 at 11:59 AM, Michael Becker 
>  wrote:
> Hello again,
> 
> this took me longer than I anticipated, but here we go.
> I did reruns of the cases where only half the processes per node were used 
> (without -log_sync):
> 
> 125 procs,1st   125 procs,2nd  1000 
> procs,1st  1000 procs,2nd
>   MaxRatioMaxRatioMax
> RatioMaxRatio
> KSPSolve   1.203E+021.01.210E+021.01.399E+02  
>   1.11.365E+021.0
> VecTDot6.376E+003.76.551E+004.07.885E+00  
>   2.97.175E+003.4
> VecNorm4.579E+007.15.803E+00   10.28.534E+00  
>   6.96.026E+004.9
> VecScale   1.070E-012.11.129E-012.21.301E-01  
>   2.51.270E-012.4
> VecCopy1.123E-011.31.149E-011.31.301E-01  
>   1.61.359E-011.6
> VecSet 7.063E-011.76.968E-011.77.432E-01  
>   1.87.425E-011.8
> VecAXPY1.166E+001.41.167E+001.41.221E+00  
>   1.51.279E+001.6
> VecAYPX1.317E+001.61.290E+001.61.536E+00  
>   1.91.499E+002.0
> VecScatterBegin6.142E+003.25.974E+002.86.448E+00  
>   3.06.472E+002.9
> VecScatterEnd  3.606E+014.23.551E+014.05.244E+01  
>   2.74.995E+012.7
> MatMult3.561E+011.63.403E+011.53.435E+01  
>   1.43.332E+011.4
> MatMultAdd 1.124E+012.01.130E+012.12.093E+01  
>   2.91.995E+012.7
> MatMultTranspose   1.372E+012.51.388E+012.61.477E+01  
>   2.21.381E+012.1
> MatSolve   1.949E-020.01.653E-020.04.789E-02  
>   0.04.466E-020.0
> MatSOR 6.610E+011.36.673E+011.37.111E+01  
>   1.37.105E+011.3
> MatResidual2.647E+011.72.667E+011.72.446E+01  
>   1.42.467E+011.5
> PCSetUpOnBlocks5.266E-031.45.295E-031.45.427E-03  
>   1.55.289E-031.4
> PCApply1.031E+021.01.035E+021.01.180E+02  
>   1.01.164E+021.0
> 
> I also slimmed down my code and basically wrote a simple weak scaling test 
> (source files attached) so you can profile it yourself. I appreciate the 
> offer Junchao, thank you.
> You can adjust the system size per processor at runtime via "-nodes_per_proc 
> 30" and the number of repeated calls to the function containing KSPsolve() 
> via "-iterations 1000". The physical problem is simply calculating the 
> electric potential from a homogeneous charge distribution, done multiple 
> times to accumulate time in KSPsolve().
> A job would be started using something like
> mpirun -n 125 ~/petsc_ws/ws_test -nodes_per_proc 30 -mesh_size 1E-4 
> -iterations 

Re: [petsc-dev] [petsc-users] Poor weak scaling when solving successive linearsystems

2018-06-07 Thread Smith, Barry F.
 So the only surprise in the results is the SOR. It is embarrassingly parallel 
and normally one would not see a jump.

 The load balance for SOR time 1.5 is better at 1000 processes than for 125 
processes of 2.1  not worse so this number doesn't easily explain it.

 Could you run the 125 and 1000 with -mat_view ::load_balance and see what you 
get out?

   Thanks

 Barry

 Notice that the MatSOR time jumps a lot about 5 secs when the -log_sync is on. 
My only guess is that the MatSOR is sharing memory bandwidth (or some other 
resource? cores?) with the VecScatter and for some reason this is worse for 
1000 cores but I don't know why.

> On Jun 6, 2018, at 9:13 PM, Junchao Zhang  wrote:
> 
> Hi, PETSc developers,
>  I tested Michael Becker's code. The code calls the same KSPSolve 1000 times 
> in the second stage and needs cubic number of processors to run. I ran with 
> 125 ranks and 1000 ranks, with or without -log_sync option. I attach the log 
> view output files and a scaling loss excel file.
>  I profiled the code with 125 processors. It looks {MatSOR, MatMult, 
> MatMultAdd, MatMultTranspose, MatMultTransposeAdd}_SeqAIJ in aij.c took ~50% 
> of the time,  The other half time was spent on waiting in MPI.  MatSOR_SeqAIJ 
> took 30%, mostly in PetscSparseDenseMinusDot().
>  I tested it on a 36 cores/node machine. I found 32 ranks/node gave better 
> performance (about 10%) than 36 ranks/node in the 125 ranks testing.  I guess 
> it is because processors in the former had more balanced memory bandwidth. I 
> collected PAPI_DP_OPS (double precision operations) and PAPI_TOT_CYC (total 
> cycles) of the 125 ranks case (see the attached files). It looks ranks at the 
> two ends have less DP_OPS and TOT_CYC. 
>  Does anyone familiar with the algorithm have quick explanations?
> 
> --Junchao Zhang
> 
> On Mon, Jun 4, 2018 at 11:59 AM, Michael Becker 
>  wrote:
> Hello again,
> 
> this took me longer than I anticipated, but here we go.
> I did reruns of the cases where only half the processes per node were used 
> (without -log_sync):
> 
> 125 procs,1st   125 procs,2nd  1000 
> procs,1st  1000 procs,2nd
>   MaxRatioMaxRatioMax
> RatioMaxRatio
> KSPSolve   1.203E+021.01.210E+021.01.399E+02  
>   1.11.365E+021.0
> VecTDot6.376E+003.76.551E+004.07.885E+00  
>   2.97.175E+003.4
> VecNorm4.579E+007.15.803E+00   10.28.534E+00  
>   6.96.026E+004.9
> VecScale   1.070E-012.11.129E-012.21.301E-01  
>   2.51.270E-012.4
> VecCopy1.123E-011.31.149E-011.31.301E-01  
>   1.61.359E-011.6
> VecSet 7.063E-011.76.968E-011.77.432E-01  
>   1.87.425E-011.8
> VecAXPY1.166E+001.41.167E+001.41.221E+00  
>   1.51.279E+001.6
> VecAYPX1.317E+001.61.290E+001.61.536E+00  
>   1.91.499E+002.0
> VecScatterBegin6.142E+003.25.974E+002.86.448E+00  
>   3.06.472E+002.9
> VecScatterEnd  3.606E+014.23.551E+014.05.244E+01  
>   2.74.995E+012.7
> MatMult3.561E+011.63.403E+011.53.435E+01  
>   1.43.332E+011.4
> MatMultAdd 1.124E+012.01.130E+012.12.093E+01  
>   2.91.995E+012.7
> MatMultTranspose   1.372E+012.51.388E+012.61.477E+01  
>   2.21.381E+012.1
> MatSolve   1.949E-020.01.653E-020.04.789E-02  
>   0.04.466E-020.0
> MatSOR 6.610E+011.36.673E+011.37.111E+01  
>   1.37.105E+011.3
> MatResidual2.647E+011.72.667E+011.72.446E+01  
>   1.42.467E+011.5
> PCSetUpOnBlocks5.266E-031.45.295E-031.45.427E-03  
>   1.55.289E-031.4
> PCApply1.031E+021.01.035E+021.01.180E+02  
>   1.01.164E+021.0
> 
> I also slimmed down my code and basically wrote a simple weak scaling test 
> (source files attached) so you can profile it yourself. I appreciate the 
> offer Junchao, thank you.
> You can adjust the system size per processor at runtime via "-nodes_per_proc 
> 30" and the number of repeated calls to the function containing KSPsolve() 
> via "-iterations 1000". The physical problem is simply calculating the 
> electric potential from a homogeneous charge distribution, done multiple 
> times to accumulate time in KSPsolve().
> A job would be started using something like
> mpirun -n 125 ~/petsc_ws/ws_test -nodes_per_proc 30 -mesh_size 1E-4 
> -iterations