Hi Jessica,

I don't know if this is related to your problem, but I recently saw a similar problem on a new machine here at LSU that also uses SLURM. Some
jobs would run but other would crash with a segmentation fault. It turned
out that whether the job would crash or not depended on the memory usage.
Large jobs would crash, but small jobs would run. I figured out that if I told SLURM to reserve enough memory even the large memory jobs would run.

It is possible that this could explain the behaviour you have observed where the job runs on 1 MPI process but not with multiple MPI processes.
Multiple MPI process runs will use more total memory as it has to allocate
ghost cells.

Can you try to add something like

#SBATCH --mem=64G

to your SLURM submit script and see if that fixes your problem. Of course I don't know how much memory your run uses and how much memory is available per node, so you'll of course have to put in a reasonable value.

Cheers,

  Peter


On Thu, 11 Aug 2022, Warren, Jessica Sawyer wrote:

Hi Roland,

The admins reinstalled openmpi and it now runs the hello script correctly. 
However, the Toolkit would still produce seg faults after srun.  Switching
to mvapich seems to have largely done the trick though, as the TOV job is
now able to start executing.  As long as there is only 1 MPI process (with
however many threads), the TOV job runs to completion correctly.  However,
anytime there are multiple MPI processes, it crashes at the first time
iteration:

INFO (TOVSolver): Done 
interpolation.---------------------------------------------------------------------------

Iteration      Time |              ADMBASE::alp |            HYDROBASE::rho
                    |      minimum      maximum |      minimum      maximum
---------------------------------------------------------------------------

        0     0.000 |    0.6698612    0.9966374 | 1.000000e-10    0.0012800
Rank 1 with PID 3964893 received signal 11
Writing backtrace to static_tov/backtrace.1.txt
srun: error: c40: task 1: Segmentation fault (core dumped)

The backtrace is attached, as well as the last portion of the output, and it
looks like the issue is tied to Carpet.  Are there some settings in the
parameter file that need adjusting or setting to fix this?  Or perhaps
specific settings for the number of ranks and threads?

Thank you,
Jessica


Dr. Jessica S. Warren Physics Lecturer
Indiana University Northwest
warre...@iun.edu

____________________________________________________________________________
From: Roland Haas
Sent: Thursday, August 11, 2022 8:32 AM
To: Warren, Jessica Sawyer
Cc: users@einsteintoolkit.org
Subject: Re: [Users] [External] Re: Running with SLURM
Hello Jessica,

If you get the same error from hello-world and from Cactus then it
would seem that there is still something off with the MPI stack.

The -lmpi_cxx option instructs the linker to link in C++ bindings for
MPI though for just the hello world example, it being C code, this is
not required and -lmpi alone is sufficient.

I would see two options that would let you get running somewhat quickly:

1. report your issues with OpenMPI and hello-world (including link to
the source code on the web, and the exact command line to compile) to
the admins and ask them for help

1.5 instead of using gcc to compile for OpenMPI do use the MPI official
compiler wrapper mpicc which would just be:

mpicc -o hello hello.c

that is you do not have to pass and library or inlcude options. If this
fails, I would definitely talk to the admins.

2. compile hello-world using mvapich. For this the easiest way is to
make sure to load the mvapich module and then use the same compiler
wrapper invication to compile:

mpicc -o hello hello.c

If 2 works then you can also compile the Einstein Toolkit with mvapich.
You have to make sure to load the correct module before compiling the
toolkit and then ExternalLibraries/MPI should figure out (from the
mpicc wrapper) how to compile the toolkit.

Yours,
Roland


> Hi Roland,
>
> Thank you so much.  The compute nodes are able to be used for
> compilation, and the directories match what is listed in
> make.MPI.defn.  When doing the 'hello' example you linked to, it was
> unable to compile due to a linker error (/usr/bin/ld: cannot find
> -lmpi_cxx).  I re-ran it in verbose mode and found the directory it
> was searching did exist and did have lmpi but not lmpi_cxx.  The
> admins said they had had some issues installing openmpi (couldn't
> recall exactly what), and recommended mpavich (since that does have
> lmpicxx installed and is their preferred implementation).  However,
> they reinstalled openmpi in an effort to get that to work and it did
> allow the 'hello' script to compile, but when executed it produced:
>
> --------------------------------------------------------------------------
> No OpenFabrics connection schemes reported that they were able to be
> used on a specific port.  As such, the openib BTL (OpenFabrics
> support) will be disabled for this port.
>
>   Local host:           h1
>   Local device:         mlx5_0
>   Local port:           1
>   CPCs attempted:       rdmacm, udcm
> --------------------------------------------------------------------------
> Hello world from processor h1.quartz.uits.iu.edu, rank 0 out of 1
> processors
>
> Similarly, doing the TOV job via sbatch, after the srun command it
> gave the same OpenFabrics message (for each MPI rank) and then the
> same segmentation faults as before.  I've contacted the admins about
> this and am waiting to hear back.  Do you have any recommendations -
> perhaps it would be easier to try switching over to mvapich?  If so,
> could you point me to some resources on how to reconfigure?
>
> Thank you,
> Jessica
>
> Dr. Jessica S. Warren
> Physics Lecturer
> Indiana University Northwest
> warre...@iun.edu
> ________________________________
> From: Roland Haas <rh...@illinois.edu>
> Sent: Tuesday, August 9, 2022 9:48 AM
> To: Warren, Jessica Sawyer <warre...@iun.edu>
> Cc: users@einsteintoolkit.org <users@einsteintoolkit.org>
> Subject: [External] Re: [Users] Running with SLURM
>
> Hello Jessica,
>
> You may also find something useful in the setting up a new machine
> seminar presentation:
>
>https://urldefense.com/v3/__https://www.einsteintoolkit.org/seminars/2022_0
2_24/index.html__;!!DZ3fjg!9JAgxc4juluJwklwTQgJGsYLXJIzzdHOqX8rwuiuymRXLrF
edDv4PXSatzu0HVAYDfBFpiYxw1_jUDmUew$
>
> Yours,
> Roland
>
> --
> My email is as private as my paper mail. I therefore support
> encrypting and signing email messages. Get my PGP key from
>https://urldefense.com/v3/__http://pgp.mit.edu__;!!DZ3fjg!9JAgxc4juluJwklwT
QgJGsYLXJIzzdHOqX8rwuiuymRXLrFedDv4PXSatzu0HVAYDfBFpiYxw19et3mEyg$
>  .


--
My email is as private as my paper mail. I therefore support encrypting
and signing email messages. Get my PGP key from http://pgp.mit.edu .

_______________________________________________
Users mailing list
Users@einsteintoolkit.org
http://lists.einsteintoolkit.org/mailman/listinfo/users

Reply via email to