Re: [petsc-users] Slow convergence while parallel computations.

Mark Adams Fri, 03 Sep 2021 07:56:29 -0700

That does not seem to be an ASCII file.

On Fri, Sep 3, 2021 at 10:48 AM Viktor Nazdrachev <[email protected]>
wrote:


> Hello Mark and Matthew!
>
>
>
> I attached log files for serial and parallel cases and corresponding 
> information about GAMG preconditioner (using grep).
>
> I have to notice, that assembling of global stiffness matrix in code was 
> performed by MatSetValues subrotuine (not MatSetValuesBlocked)
>
> !nnds – number of nodes
>
> !dmn=3
>
> call MatCreate(Petsc_Comm_World,Mat_K,ierr)
>
> call MatSetFromOptions(Mat_K,ierr)
>
> call MatSetSizes(Mat_K,Petsc_Decide,Petsc_Decide,n,n,ierr_m)
>
> …
>
> call MatMPIAIJSetPreallocation(Mat_K,0,dbw,0,obw,ierr)
>
> …
>
> call MatSetOption(Mat_K,Mat_New_Nonzero_Allocation_Err,Petsc_False,ierr)
>
> …
>
> do i=1,nels
>
>     call FormLocalK(i,k,indx,"Kp") ! find local stiffness matrix
>
>      indx=indxmap(indx,2) !find global indices for DOFs
>
>      call MatSetValues(Mat_K,ef_eldof,indx,ef_eldof,indx,k,Add_Values,ierr)
>
> end do
>
>
>
> But nullspace vector was created using VecSetBlockSize subroutine.
>
>
>
> call VecCreate(Petsc_Comm_World,Vec_NullSpace,ierr)
>
> call VecSetBlockSize(Vec_NullSpace,dmn,ierr)
>
> call VecSetSizes(Vec_NullSpace,nnds*dmn,Petsc_Decide,ierr)
>
> call VecSetUp(Vec_NullSpace,ierr)
>
> call VecGetArrayF90(Vec_NullSpace,null_space,ierr)
>
> …
>
> call VecRestoreArrayF90(Vec_NullSpace,null_space,ierr)
>
> call MatNullSpaceCreateRigidBody(Vec_NullSpace,matnull,ierr)
>
> call MatSetNearNullSpace(Mat_K,matnull,ierr)
>
>
>
> I suppose it can be one of the reasons of GAMG slow convergence.
>
> So I attached log files for parallel run with «pure» GAMG precondtioner.
>
>
>
>
>
> Kind regards,
>
>
>
> Viktor Nazdrachev
>
>
>
> R&D senior researcher
>
>
>
> Geosteering Technologies LLC
>
> пт, 3 сент. 2021 г. в 15:11, Matthew Knepley <[email protected]>:
>
>> On Fri, Sep 3, 2021 at 8:02 AM Mark Adams <[email protected]> wrote:
>>
>>>
>>>
>>> On Fri, Sep 3, 2021 at 1:57 AM Viktor Nazdrachev <[email protected]>
>>> wrote:
>>>
>>>> Hello, Lawrence!
>>>> Thank you for your response!
>>>>
>>>> I attached log files (txt files with convergence behavior and RAM usage
>>>> log in separate txt files) and resulting table with convergence
>>>> investigation data(xls). Data for main non-regular grid with 500K cells and
>>>> heterogeneous properties are in 500K folder, whereas data for simple
>>>> uniform 125K cells grid with constant properties are in 125K folder.
>>>>
>>>>
>>>> >>* On 1 Sep 2021, at 09:42, **Наздрачёв** Виктор** <numbersixvs at 
>>>> >>gmail.com <https://lists.mcs.anl.gov/mailman/listinfo/petsc-users>**> 
>>>> >>wrote:*
>>>>
>>>> >>
>>>>
>>>> >>* I have a 3D elasticity problem with heterogeneous properties.*
>>>>
>>>> >
>>>>
>>>> >What does your coefficient variation look like? How large is the contrast?
>>>>
>>>>
>>>>
>>>> Young modulus varies from 1 to 10 GPa, Poisson ratio varies from 0.3 to
>>>> 0.44 and density – from 1700 to 2600 kg/m^3.
>>>>
>>>
>>> That is not too bad. Poorly shaped elements are the next thing to worry
>>> about. Try to keep the aspect ratio below 10 if possible.
>>>
>>>
>>>>
>>>>
>>>>
>>>>
>>>> >>* There is unstructured grid with aspect ratio varied from 4 to 25. Zero 
>>>> >>Dirichlet BCs  are imposed on bottom face of mesh. Also, Neumann 
>>>> >>(traction) BCs are imposed on side faces. Gravity load is also accounted 
>>>> >>for. The grid I use consists of 500k cells (which is approximately 1.6M 
>>>> >>of DOFs).*
>>>>
>>>> >>
>>>>
>>>> >>* The best performance and memory usage for single MPI process was 
>>>> >>obtained with HPDDM(BFBCG) solver and bjacobian + ICC (1) in subdomains 
>>>> >>as preconditioner, it took 1 m 45 s and RAM 5.0 GB. Parallel computation 
>>>> >>with 4 MPI processes took 2 m 46 s when using 5.6 GB of RAM. This 
>>>> >>because of number of iterations required to achieve the same tolerance 
>>>> >>is significantly increased.*
>>>>
>>>> >
>>>>
>>>> >How many iterations do you have in serial (and then in parallel)?
>>>>
>>>>
>>>>
>>>> Serial run is required 112 iterations to reach convergence 
>>>> (log_hpddm(bfbcg)_bjacobian_icc_1_mpi.txt), parallel run with 4 MPI – 680 
>>>> iterations.
>>>>
>>>>
>>>>
>>>> I attached log files for all simulations (txt files with convergence
>>>> behavior and RAM usage log in separate txt files) and resulting table with
>>>> convergence/memory usage data(xls). Data for main non-regular grid with
>>>> 500K cells and heterogeneous properties are in 500K folder, whereas data
>>>> for simple uniform 125K cells grid with constant properties are in 125K
>>>> folder.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> >>* I`ve also tried PCGAMG (agg) preconditioner with IC**С** (1) 
>>>> >>sub-precondtioner. For single MPI process, the calculation took 10 min 
>>>> >>and 3.4 GB of RAM. To improve the convergence rate, the nullspace was 
>>>> >>attached using MatNullSpaceCreateRigidBody and MatSetNearNullSpace 
>>>> >>subroutines.  This has reduced calculation time to 3 m 58 s when using 
>>>> >>4.3 GB of RAM. Also, there is peak memory usage with 14.1 GB, which 
>>>> >>appears just before the start of the iterations. Parallel computation 
>>>> >>with 4 MPI processes took 2 m 53 s when using 8.4 GB of RAM. In that 
>>>> >>case the peak memory usage is about 22 GB.*
>>>>
>>>> >
>>>>
>>>> >Does the number of iterates increase in parallel? Again, how many 
>>>> >iterations do you have?
>>>>
>>>>
>>>>
>>>> For case with 4 MPI processes and attached nullspace it is required 177 
>>>> iterations to reach convergence (you may see detailed log in 
>>>> log_hpddm(bfbcg)_gamg_nearnullspace_4_mpi.txt). For comparison, 90 
>>>> iterations are required for sequential 
>>>> run(log_hpddm(bfbcg)_gamg_nearnullspace_1_mpi.txt).
>>>>
>>>>
>>> Again, do not use ICC. I am surprised to see such a large jump in
>>> iteration count, but get ICC off the table.
>>>
>>> You will see variability in the iteration count with processor count
>>> with GAMG. As much as 10% +-. Maybe more (random) variability , but usually
>>> less.
>>>
>>> You can decrease the memory a little, and the setup time a lot, by
>>> aggressively coarsening, at the expense of higher iteration counts. It's a
>>> balancing act.
>>>
>>> You can run with the defaults, add '-info', grep on GAMG and send the
>>> ~30 lines of output if you want advice on parameters.
>>>
>>
>> Can you send the output of
>>
>>   -ksp_view -ksp_monitor_true_residual -ksp_converged_reason
>>
>>   Thanks,
>>
>>       Matt
>>
>>
>>> Thanks,
>>> Mark
>>>
>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> >>* Are there ways to avoid decreasing of the convergence rate for bjacobi 
>>>> >>precondtioner in parallel mode? Does it make sense to use hierarchical 
>>>> >>or nested krylov methods with a local gmres solver (sub_pc_type gmres) 
>>>> >>and some sub-precondtioner (for example, sub_pc_type bjacobi)?*
>>>>
>>>> >
>>>>
>>>> >bjacobi is only a one-level method, so you would not expect 
>>>> >process-independent convergence rate for this kind of problem. If the 
>>>> >coefficient variation is not too extreme, then I would expect GAMG (or 
>>>> >some other smoothed aggregation package, perhaps -pc_type ml (you need 
>>>> >--download-ml)) would work well with some tuning.
>>>>
>>>>
>>>>
>>>> Thanks for idea, but, unfortunately, ML cannot be compiled with 64bit
>>>> integers (It is extremely necessary to perform computation on mesh with
>>>> more than 10M cells).
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> >If you have extremely high contrast coefficients you might need something 
>>>> >with stronger coarse grids. If you can assemble so-called Neumann 
>>>> >matrices 
>>>> >(https://petsc.org/release/docs/manualpages/Mat/MATIS.html#MATIS) then 
>>>> >you could try the geneo scheme offered by PCHPDDM.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> I found strange convergence behavior for HPDDM preconditioner. For 1
>>>> MPI process BFBCG solver did not converged
>>>> (log_hpddm(bfbcg)_pchpddm_1_mpi.txt), while for 4 MPI processes computation
>>>> was successful (1018 to reach convergence,
>>>> log_hpddm(bfbcg)_pchpddm_4_mpi.txt).
>>>>
>>>> But it should be mentioned that stiffness matrix was created in AIJ
>>>> format (our default matrix format in program).
>>>>
>>>> Matrix conversion to MATIS format via MatConvert subroutine resulted
>>>> in losing of convergence for both serial and parallel run.
>>>>
>>>>
>>>> >>* Is this peak memory usage expected for gamg preconditioner? is there 
>>>> >>any way to reduce it?*
>>>>
>>>> >
>>>>
>>>> >I think that peak memory usage comes from building the coarse grids. Can 
>>>> >you run with `-info` and grep for GAMG, this will provide some output 
>>>> >that more expert GAMG users can interpret.
>>>>
>>>>
>>>>
>>>>  Thanks, I`ll try to use a strong threshold only for coarse grids.
>>>>
>>>>
>>>>
>>>> Kind regards,
>>>>
>>>>
>>>>
>>>> Viktor Nazdrachev
>>>>
>>>>
>>>>
>>>> R&D senior researcher
>>>>
>>>>
>>>>
>>>> Geosteering Technologies LLC
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ср, 1 сент. 2021 г. в 12:02, Lawrence Mitchell <[email protected]>:
>>>>
>>>>>
>>>>>
>>>>> > On 1 Sep 2021, at 09:42, Наздрачёв Виктор <[email protected]>
>>>>> wrote:
>>>>> >
>>>>> > I have a 3D elasticity problem with heterogeneous properties.
>>>>>
>>>>> What does your coefficient variation look like? How large is the
>>>>> contrast?
>>>>>
>>>>> > There is unstructured grid with aspect ratio varied from 4 to 25.
>>>>> Zero Dirichlet BCs  are imposed on bottom face of mesh. Also, Neumann
>>>>> (traction) BCs are imposed on side faces. Gravity load is also accounted
>>>>> for. The grid I use consists of 500k cells (which is approximately 1.6M of
>>>>> DOFs).
>>>>> >
>>>>> > The best performance and memory usage for single MPI process was
>>>>> obtained with HPDDM(BFBCG) solver and bjacobian + ICC (1) in subdomains as
>>>>> preconditioner, it took 1 m 45 s and RAM 5.0 GB. Parallel computation with
>>>>> 4 MPI processes took 2 m 46 s when using 5.6 GB of RAM. This because of
>>>>> number of iterations required to achieve the same tolerance is
>>>>> significantly increased.
>>>>>
>>>>> How many iterations do you have in serial (and then in parallel)?
>>>>>
>>>>> > I`ve also tried PCGAMG (agg) preconditioner with ICС (1)
>>>>> sub-precondtioner. For single MPI process, the calculation took 10 min and
>>>>> 3.4 GB of RAM. To improve the convergence rate, the nullspace was attached
>>>>> using MatNullSpaceCreateRigidBody and MatSetNearNullSpace subroutines.
>>>>> This has reduced calculation time to 3 m 58 s when using 4.3 GB of RAM.
>>>>> Also, there is peak memory usage with 14.1 GB, which appears just before
>>>>> the start of the iterations. Parallel computation with 4 MPI processes 
>>>>> took
>>>>> 2 m 53 s when using 8.4 GB of RAM. In that case the peak memory usage is
>>>>> about 22 GB.
>>>>>
>>>>> Does the number of iterates increase in parallel? Again, how many
>>>>> iterations do you have?
>>>>>
>>>>> > Are there ways to avoid decreasing of the convergence rate for
>>>>> bjacobi precondtioner in parallel mode? Does it make sense to use
>>>>> hierarchical or nested krylov methods with a local gmres solver
>>>>> (sub_pc_type gmres) and some sub-precondtioner (for example, sub_pc_type
>>>>> bjacobi)?
>>>>>
>>>>> bjacobi is only a one-level method, so you would not expect
>>>>> process-independent convergence rate for this kind of problem. If the
>>>>> coefficient variation is not too extreme, then I would expect GAMG (or 
>>>>> some
>>>>> other smoothed aggregation package, perhaps -pc_type ml (you need
>>>>> --download-ml)) would work well with some tuning.
>>>>>
>>>>> If you have extremely high contrast coefficients you might need
>>>>> something with stronger coarse grids. If you can assemble so-called 
>>>>> Neumann
>>>>> matrices (
>>>>> https://petsc.org/release/docs/manualpages/Mat/MATIS.html#MATIS) then
>>>>> you could try the geneo scheme offered by PCHPDDM.
>>>>>
>>>>> > Is this peak memory usage expected for gamg preconditioner? is there
>>>>> any way to reduce it?
>>>>>
>>>>> I think that peak memory usage comes from building the coarse grids.
>>>>> Can you run with `-info` and grep for GAMG, this will provide some output
>>>>> that more expert GAMG users can interpret.
>>>>>
>>>>> Lawrence
>>>>>
>>>>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>> https://www.cse.buffalo.edu/~knepley/
>> <http://www.cse.buffalo.edu/~knepley/>
>>
>

Re: [petsc-users] Slow convergence while parallel computations.

Reply via email to