Gilles, Joachim,
The command line to launch my application is:
mpiexec --mca orte_base_help_aggregate 0 \
--enable-recovery \
--mca mpi_param_check 1 \
--v \
--wdir ${work_dir} \
--hostfile ${MY_NODEFILE} \
--np ${num_proc} \
--map-by ppr:1:node \
executable \
… application specific args
Thanks,
Kurt
From: users <[email protected]> On Behalf Of Gilles Gouaillardet
via users
Sent: Friday, June 16, 2023 11:05 PM
To: Open MPI Users <[email protected]>
Cc: Gilles Gouaillardet <[email protected]>
Subject: [EXTERNAL] [BULK] Re: [OMPI users] OpenMPI crashes with TCP connection
error
CAUTION: This email originated from outside of NASA. Please take care when
clicking links or opening attachments. Use the "Report Message" button to
report suspicious messages to the NASA SOC.
Kurt,
I think Joachim was also asking for the command line used to launch your
application.
Since you are using Slurm and MPI_Comm_spawn(), it is important to understand
whether you are using mpirun or srun
FWIW, --mpi=pmix is a srun option. you can srun --mpi=list to find the
available options.
Cheers,
Gilles
On Sat, Jun 17, 2023 at 2:53 AM Mccall, Kurt E. (MSFC-EV41) via users
<[email protected]<mailto:[email protected]>> wrote:
Joachim,
Sorry to make you resort to divination. My sbatch command is as follows:
sbatch --ntasks-per-node=24 --nodes=16 --ntasks=384 --job-name $job_name
--exclusive --no-kill --verbose $release_dir/script.bash &
--mpi=pmix isn’t an option recognized by sbatch. Is there an alternative?
The slurm doc you mentioned has the following paragraph. Is it still true with
OpenMpi 4.1.5?
“NOTE: OpenMPI has a limitation that does not support calls to MPI_Comm_spawn()
from within a Slurm allocation. If you need to use the MPI_Comm_spawn()
function you will need to use another MPI implementation combined with PMI-2
since PMIx doesn't support it either.”
I use MPI_Comm_spawn extensively in my application.
Thanks,
Kurt
From: Jenke, Joachim <[email protected]<mailto:[email protected]>>
Sent: Thursday, June 15, 2023 5:33 PM
To: Open MPI Users <[email protected]<mailto:[email protected]>>
Cc: Mccall, Kurt E. (MSFC-EV41)
<[email protected]<mailto:[email protected]>>
Subject: [EXTERNAL] Re: OpenMPI crashes with TCP connection error
CAUTION: This email originated from outside of NASA. Please take care when
clicking links or opening attachments. Use the "Report Message" button to
report suspicious messages to the NASA SOC.
Hi Kurt,
Without knowing your exact MPI launch command, my cristal orb thinks you might
want to try the -mpi=pmix flag for srun as documented for slurm+openmpi:
https://slurm.schedmd.com/mpi_guide.html#open_mpi
-Joachim
________________________________
From: users
<[email protected]<mailto:[email protected]>> on
behalf of Mccall, Kurt E. (MSFC-EV41) via users
<[email protected]<mailto:[email protected]>>
Sent: Thursday, June 15, 2023 11:56:28 PM
To: [email protected]<mailto:[email protected]>
<[email protected]<mailto:[email protected]>>
Cc: Mccall, Kurt E. (MSFC-EV41)
<[email protected]<mailto:[email protected]>>
Subject: [OMPI users] OpenMPI crashes with TCP connection error
My job immediately crashes with the error message below. I don’t know where
to begin looking for the cause
of the error, or what information to provide to help you understand it. Maybe
you could clue me in 😊.
I am on RedHat 4.18.0, using Slurm 20.11.8 and OpenMPI 4.1.5 compiled with gcc
8.5.0.
I built OpenMPI with the following “configure” command:
./configure --prefix=/opt/openmpi/4.1.5_gnu --with-slurm --enable-debug
WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.
This attempted connection will be ignored; your MPI job may or may not
continue properly.
Local host: n001
PID: 985481