Joachim,

Sorry to make you resort to divination.   My sbatch command is as follows:

sbatch --ntasks-per-node=24 --nodes=16 --ntasks=384  --job-name $job_name  
--exclusive --no-kill --verbose $release_dir/script.bash &

--mpi=pmix isn’t an option recognized by sbatch.   Is there an alternative?   
The slurm doc you mentioned has the following paragraph.  Is it still true with 
OpenMpi 4.1.5?

“NOTE: OpenMPI has a limitation that does not support calls to MPI_Comm_spawn() 
from within a Slurm allocation. If you need to use the MPI_Comm_spawn() 
function you will need to use another MPI implementation combined with PMI-2 
since PMIx doesn't support it either.”

I use MPI_Comm_spawn extensively in my application.

Thanks,
Kurt


From: Jenke, Joachim <je...@itc.rwth-aachen.de>
Sent: Thursday, June 15, 2023 5:33 PM
To: Open MPI Users <users@lists.open-mpi.org>
Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mcc...@nasa.gov>
Subject: [EXTERNAL] Re: OpenMPI crashes with TCP connection error

CAUTION: This email originated from outside of NASA.  Please take care when 
clicking links or opening attachments.  Use the "Report Message" button to 
report suspicious messages to the NASA SOC.


Hi Kurt,

Without knowing your exact MPI launch command, my cristal orb thinks you might 
want to try the -mpi=pmix flag for srun as documented for slurm+openmpi:
https://slurm.schedmd.com/mpi_guide.html#open_mpi

-Joachim
________________________________
From: users 
<users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>> on 
behalf of Mccall, Kurt E. (MSFC-EV41) via users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>
Sent: Thursday, June 15, 2023 11:56:28 PM
To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>
Cc: Mccall, Kurt E. (MSFC-EV41) 
<kurt.e.mcc...@nasa.gov<mailto:kurt.e.mcc...@nasa.gov>>
Subject: [OMPI users] OpenMPI crashes with TCP connection error


My job immediately crashes with the error message below.   I don’t know where 
to begin looking for the cause

of the error, or what information to provide to help you understand it.   Maybe 
you could clue me in 😊.



I am on RedHat 4.18.0, using Slurm 20.11.8 and OpenMPI 4.1.5 compiled with gcc 
8.5.0.

I built OpenMPI with the following  “configure” command:



./configure --prefix=/opt/openmpi/4.1.5_gnu --with-slurm --enable-debug







WARNING: Open MPI accepted a TCP connection from what appears to be a

another Open MPI process but cannot find a corresponding process

entry for that peer.



This attempted connection will be ignored; your MPI job may or may not

continue properly.



  Local host: n001

  PID:        985481




Reply via email to