Re: [petsc-users] Slow convergence while parallel computations.

Viktor Nazdrachev Fri, 03 Sep 2021 07:48:35 -0700

Hello Mark and Matthew!



I attached log files for serial and parallel cases and corresponding
information about GAMG preconditioner (using grep).

I have to notice, that assembling of global stiffness matrix in code
was performed by MatSetValues subrotuine (not MatSetValuesBlocked)

!nnds – number of nodes

!dmn=3

call MatCreate(Petsc_Comm_World,Mat_K,ierr)

call MatSetFromOptions(Mat_K,ierr)

call MatSetSizes(Mat_K,Petsc_Decide,Petsc_Decide,n,n,ierr_m)

…

call MatMPIAIJSetPreallocation(Mat_K,0,dbw,0,obw,ierr)

…

call MatSetOption(Mat_K,Mat_New_Nonzero_Allocation_Err,Petsc_False,ierr)

…

do i=1,nels

    call FormLocalK(i,k,indx,"Kp") ! find local stiffness matrix

     indx=indxmap(indx,2) !find global indices for DOFs

     call MatSetValues(Mat_K,ef_eldof,indx,ef_eldof,indx,k,Add_Values,ierr)

end do



But nullspace vector was created using VecSetBlockSize subroutine.



call VecCreate(Petsc_Comm_World,Vec_NullSpace,ierr)

call VecSetBlockSize(Vec_NullSpace,dmn,ierr)

call VecSetSizes(Vec_NullSpace,nnds*dmn,Petsc_Decide,ierr)

call VecSetUp(Vec_NullSpace,ierr)

call VecGetArrayF90(Vec_NullSpace,null_space,ierr)

…

call VecRestoreArrayF90(Vec_NullSpace,null_space,ierr)

call MatNullSpaceCreateRigidBody(Vec_NullSpace,matnull,ierr)

call MatSetNearNullSpace(Mat_K,matnull,ierr)



I suppose it can be one of the reasons of GAMG slow convergence.

So I attached log files for parallel run with «pure» GAMG precondtioner.





Kind regards,



Viktor Nazdrachev



R&D senior researcher



Geosteering Technologies LLC

пт, 3 сент. 2021 г. в 15:11, Matthew Knepley <[email protected]>:

> On Fri, Sep 3, 2021 at 8:02 AM Mark Adams <[email protected]> wrote:
>
>>
>>
>> On Fri, Sep 3, 2021 at 1:57 AM Viktor Nazdrachev <[email protected]>
>> wrote:
>>
>>> Hello, Lawrence!
>>> Thank you for your response!
>>>
>>> I attached log files (txt files with convergence behavior and RAM usage
>>> log in separate txt files) and resulting table with convergence
>>> investigation data(xls). Data for main non-regular grid with 500K cells and
>>> heterogeneous properties are in 500K folder, whereas data for simple
>>> uniform 125K cells grid with constant properties are in 125K folder.
>>>
>>>
>>> >>* On 1 Sep 2021, at 09:42, **Наздрачёв** Виктор** <numbersixvs at 
>>> >>gmail.com <https://lists.mcs.anl.gov/mailman/listinfo/petsc-users>**> 
>>> >>wrote:*
>>>
>>> >>
>>>
>>> >>* I have a 3D elasticity problem with heterogeneous properties.*
>>>
>>> >
>>>
>>> >What does your coefficient variation look like? How large is the contrast?
>>>
>>>
>>>
>>> Young modulus varies from 1 to 10 GPa, Poisson ratio varies from 0.3 to
>>> 0.44 and density – from 1700 to 2600 kg/m^3.
>>>
>>
>> That is not too bad. Poorly shaped elements are the next thing to worry
>> about. Try to keep the aspect ratio below 10 if possible.
>>
>>
>>>
>>>
>>>
>>>
>>> >>* There is unstructured grid with aspect ratio varied from 4 to 25. Zero 
>>> >>Dirichlet BCs  are imposed on bottom face of mesh. Also, Neumann 
>>> >>(traction) BCs are imposed on side faces. Gravity load is also accounted 
>>> >>for. The grid I use consists of 500k cells (which is approximately 1.6M 
>>> >>of DOFs).*
>>>
>>> >>
>>>
>>> >>* The best performance and memory usage for single MPI process was 
>>> >>obtained with HPDDM(BFBCG) solver and bjacobian + ICC (1) in subdomains 
>>> >>as preconditioner, it took 1 m 45 s and RAM 5.0 GB. Parallel computation 
>>> >>with 4 MPI processes took 2 m 46 s when using 5.6 GB of RAM. This because 
>>> >>of number of iterations required to achieve the same tolerance is 
>>> >>significantly increased.*
>>>
>>> >
>>>
>>> >How many iterations do you have in serial (and then in parallel)?
>>>
>>>
>>>
>>> Serial run is required 112 iterations to reach convergence 
>>> (log_hpddm(bfbcg)_bjacobian_icc_1_mpi.txt), parallel run with 4 MPI – 680 
>>> iterations.
>>>
>>>
>>>
>>> I attached log files for all simulations (txt files with convergence
>>> behavior and RAM usage log in separate txt files) and resulting table with
>>> convergence/memory usage data(xls). Data for main non-regular grid with
>>> 500K cells and heterogeneous properties are in 500K folder, whereas data
>>> for simple uniform 125K cells grid with constant properties are in 125K
>>> folder.
>>>
>>>
>>>
>>>
>>>
>>> >>* I`ve also tried PCGAMG (agg) preconditioner with IC**С** (1) 
>>> >>sub-precondtioner. For single MPI process, the calculation took 10 min 
>>> >>and 3.4 GB of RAM. To improve the convergence rate, the nullspace was 
>>> >>attached using MatNullSpaceCreateRigidBody and MatSetNearNullSpace 
>>> >>subroutines.  This has reduced calculation time to 3 m 58 s when using 
>>> >>4.3 GB of RAM. Also, there is peak memory usage with 14.1 GB, which 
>>> >>appears just before the start of the iterations. Parallel computation 
>>> >>with 4 MPI processes took 2 m 53 s when using 8.4 GB of RAM. In that case 
>>> >>the peak memory usage is about 22 GB.*
>>>
>>> >
>>>
>>> >Does the number of iterates increase in parallel? Again, how many 
>>> >iterations do you have?
>>>
>>>
>>>
>>> For case with 4 MPI processes and attached nullspace it is required 177 
>>> iterations to reach convergence (you may see detailed log in 
>>> log_hpddm(bfbcg)_gamg_nearnullspace_4_mpi.txt). For comparison, 90 
>>> iterations are required for sequential 
>>> run(log_hpddm(bfbcg)_gamg_nearnullspace_1_mpi.txt).
>>>
>>>
>> Again, do not use ICC. I am surprised to see such a large jump in
>> iteration count, but get ICC off the table.
>>
>> You will see variability in the iteration count with processor count with
>> GAMG. As much as 10% +-. Maybe more (random) variability , but usually less.
>>
>> You can decrease the memory a little, and the setup time a lot, by
>> aggressively coarsening, at the expense of higher iteration counts. It's a
>> balancing act.
>>
>> You can run with the defaults, add '-info', grep on GAMG and send the ~30
>> lines of output if you want advice on parameters.
>>
>
> Can you send the output of
>
>   -ksp_view -ksp_monitor_true_residual -ksp_converged_reason
>
>   Thanks,
>
>       Matt
>
>
>> Thanks,
>> Mark
>>
>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> >>* Are there ways to avoid decreasing of the convergence rate for bjacobi 
>>> >>precondtioner in parallel mode? Does it make sense to use hierarchical or 
>>> >>nested krylov methods with a local gmres solver (sub_pc_type gmres) and 
>>> >>some sub-precondtioner (for example, sub_pc_type bjacobi)?*
>>>
>>> >
>>>
>>> >bjacobi is only a one-level method, so you would not expect 
>>> >process-independent convergence rate for this kind of problem. If the 
>>> >coefficient variation is not too extreme, then I would expect GAMG (or 
>>> >some other smoothed aggregation package, perhaps -pc_type ml (you need 
>>> >--download-ml)) would work well with some tuning.
>>>
>>>
>>>
>>> Thanks for idea, but, unfortunately, ML cannot be compiled with 64bit
>>> integers (It is extremely necessary to perform computation on mesh with
>>> more than 10M cells).
>>>
>>>
>>>
>>>
>>>
>>> >If you have extremely high contrast coefficients you might need something 
>>> >with stronger coarse grids. If you can assemble so-called Neumann matrices 
>>> >(https://petsc.org/release/docs/manualpages/Mat/MATIS.html#MATIS) then you 
>>> >could try the geneo scheme offered by PCHPDDM.
>>>
>>>
>>>
>>>
>>>
>>> I found strange convergence behavior for HPDDM preconditioner. For 1 MPI
>>> process BFBCG solver did not converged
>>> (log_hpddm(bfbcg)_pchpddm_1_mpi.txt), while for 4 MPI processes computation
>>> was successful (1018 to reach convergence,
>>> log_hpddm(bfbcg)_pchpddm_4_mpi.txt).
>>>
>>> But it should be mentioned that stiffness matrix was created in AIJ
>>> format (our default matrix format in program).
>>>
>>> Matrix conversion to MATIS format via MatConvert subroutine resulted in
>>> losing of convergence for both serial and parallel run.
>>>
>>>
>>> >>* Is this peak memory usage expected for gamg preconditioner? is there 
>>> >>any way to reduce it?*
>>>
>>> >
>>>
>>> >I think that peak memory usage comes from building the coarse grids. Can 
>>> >you run with `-info` and grep for GAMG, this will provide some output that 
>>> >more expert GAMG users can interpret.
>>>
>>>
>>>
>>>  Thanks, I`ll try to use a strong threshold only for coarse grids.
>>>
>>>
>>>
>>> Kind regards,
>>>
>>>
>>>
>>> Viktor Nazdrachev
>>>
>>>
>>>
>>> R&D senior researcher
>>>
>>>
>>>
>>> Geosteering Technologies LLC
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> ср, 1 сент. 2021 г. в 12:02, Lawrence Mitchell <[email protected]>:
>>>
>>>>
>>>>
>>>> > On 1 Sep 2021, at 09:42, Наздрачёв Виктор <[email protected]>
>>>> wrote:
>>>> >
>>>> > I have a 3D elasticity problem with heterogeneous properties.
>>>>
>>>> What does your coefficient variation look like? How large is the
>>>> contrast?
>>>>
>>>> > There is unstructured grid with aspect ratio varied from 4 to 25.
>>>> Zero Dirichlet BCs  are imposed on bottom face of mesh. Also, Neumann
>>>> (traction) BCs are imposed on side faces. Gravity load is also accounted
>>>> for. The grid I use consists of 500k cells (which is approximately 1.6M of
>>>> DOFs).
>>>> >
>>>> > The best performance and memory usage for single MPI process was
>>>> obtained with HPDDM(BFBCG) solver and bjacobian + ICC (1) in subdomains as
>>>> preconditioner, it took 1 m 45 s and RAM 5.0 GB. Parallel computation with
>>>> 4 MPI processes took 2 m 46 s when using 5.6 GB of RAM. This because of
>>>> number of iterations required to achieve the same tolerance is
>>>> significantly increased.
>>>>
>>>> How many iterations do you have in serial (and then in parallel)?
>>>>
>>>> > I`ve also tried PCGAMG (agg) preconditioner with ICС (1)
>>>> sub-precondtioner. For single MPI process, the calculation took 10 min and
>>>> 3.4 GB of RAM. To improve the convergence rate, the nullspace was attached
>>>> using MatNullSpaceCreateRigidBody and MatSetNearNullSpace subroutines.
>>>> This has reduced calculation time to 3 m 58 s when using 4.3 GB of RAM.
>>>> Also, there is peak memory usage with 14.1 GB, which appears just before
>>>> the start of the iterations. Parallel computation with 4 MPI processes took
>>>> 2 m 53 s when using 8.4 GB of RAM. In that case the peak memory usage is
>>>> about 22 GB.
>>>>
>>>> Does the number of iterates increase in parallel? Again, how many
>>>> iterations do you have?
>>>>
>>>> > Are there ways to avoid decreasing of the convergence rate for
>>>> bjacobi precondtioner in parallel mode? Does it make sense to use
>>>> hierarchical or nested krylov methods with a local gmres solver
>>>> (sub_pc_type gmres) and some sub-precondtioner (for example, sub_pc_type
>>>> bjacobi)?
>>>>
>>>> bjacobi is only a one-level method, so you would not expect
>>>> process-independent convergence rate for this kind of problem. If the
>>>> coefficient variation is not too extreme, then I would expect GAMG (or some
>>>> other smoothed aggregation package, perhaps -pc_type ml (you need
>>>> --download-ml)) would work well with some tuning.
>>>>
>>>> If you have extremely high contrast coefficients you might need
>>>> something with stronger coarse grids. If you can assemble so-called Neumann
>>>> matrices (
>>>> https://petsc.org/release/docs/manualpages/Mat/MATIS.html#MATIS) then
>>>> you could try the geneo scheme offered by PCHPDDM.
>>>>
>>>> > Is this peak memory usage expected for gamg preconditioner? is there
>>>> any way to reduce it?
>>>>
>>>> I think that peak memory usage comes from building the coarse grids.
>>>> Can you run with `-info` and grep for GAMG, this will provide some output
>>>> that more expert GAMG users can interpret.
>>>>
>>>> Lawrence
>>>>
>>>>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
>

true_residual_logs_and_greped.rar
Description: Binary data

Re: [petsc-users] Slow convergence while parallel computations.

Reply via email to