On 7/14/22 03:34, Paras Kumar wrote:
I am working on solving the a nonlinear coupled problem involving a vector
displacement field and a scalar phase-field variable. The code is MPI
parallelized using p:d:t and TrilinosWrappers for linear algebra.
Usually I use CG+AMG for solving the SLEs when solving for each of the
variables within a staggered scheme. But for certain scenarios, the iterative
linear solver fails and we switch to Amesos_Superludist solver. The code is
run on 2 nodes (144 MPI processes in total) and as shown by the code
performance monitor, the flop count of one of the nodes drops to (almost) zero
and only one one node seems to be doing the computations once the solver
switch from iterative to direct solver occurs. Please see attached flops and
memory bandwidth plots. The blue and red lines here represent the two nodes.
Similar observations were also made for a larger problem involving 8 nodes.
These plots seem to hint that Superlu-dist solver does not scale across
multiple nodes. One possible reason I could think of is that I probably missed
some option while installing dealii with trilinos and superlu-dist using
spack. I also attach the spack spec which I installed on the cluster. The gcc
compiler and corresponding openmpi@4.1.2 are available form the cluster.
Paras:
I'm not sure any of us have experience with Amesos:SuperLU, so I'm not sure
anyone will know right away what the problem may be.
But here are a couple of questions:
* What happens if you run the program with just two MPI jobs on one machine?
In that case, you can watch what the two programs are doing by having 'top'
run in a separate window.
* How do you distribute the matrix and right hand side? Are they both fully
distributed?
* Is the solution you get correct?
* If the answer to the last question is yes, then either Amesos or SuperLU is
apparently copying the data of the linear system from all other processes to
just one process that then solves the linear system. It might be useful to
take a debugger, running with just two MPI processes, to step into the Amesos
routines to see if you get to a place where that is happening, and then to
read the code in that place to see what flags need to be set to make sure the
solution really does happen in a distributed way.
That's about all I can offer.
Best
W.
--
------------------------------------------------------------------------
Wolfgang Bangerth email: bange...@colostate.edu
www: http://www.math.colostate.edu/~bangerth/
--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see
https://groups.google.com/d/forum/dealii?hl=en
---
You received this message because you are subscribed to the Google Groups "deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to dealii+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/dealii/b207e535-5f6b-f06a-e902-87628bbcf5e5%40colostate.edu.