Hi all,

We’re running a small slurm dev cluster on Ubuntu and are facing issues with 
MPI/PMIx after upgrading slurm from 23.02.5 to 23.11.3.

The first job step to use MPI within a job fails roughly 80% of the time but 
following attempts to use MPI within the same job work fine. For the failing 
job step we see this error after hitting the MPI timeout:

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_reset_if_to: hpc-d-msh-01a02 
[1]: pmixp_coll_ring.c:741: 0x55a03f8d7a90: collective timeout seq=0
slurmstepd: error:  mpi/pmix_v4: pmixp_coll_log: hpc-d-msh-01a02 [1]: 
pmixp_coll.c:286: Dumping collective state
slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: 
pmixp_coll_ring.c:759: 0x55a03f8d7a90: COLL_FENCE_RING state seq=0
slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: 
pmixp_coll_ring.c:762: my peerid: 1:hpc-d-msh-01a02
slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: 
pmixp_coll_ring.c:769: neighbor id: next 0:hpc-d-msh-01a01, prev 
0:hpc-d-msh-01a01
slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: 
pmixp_coll_ring.c:778: Context ptr=0x55a03f8d7b08, #0, in-use=0
slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: 
pmixp_coll_ring.c:778: Context ptr=0x55a03f8d7b40, #1, in-use=0
slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: 
pmixp_coll_ring.c:778: Context ptr=0x55a03f8d7b78, #2, in-use=1
slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: 
pmixp_coll_ring.c:787:          seq=0 contribs: loc=1/prev=1/fwd=0
slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: 
pmixp_coll_ring.c:791:          neighbor contribs [2]:
slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: 
pmixp_coll_ring.c:824:                          done contrib: hpc-d-msh-01a01
slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: 
pmixp_coll_ring.c:826:                          wait contrib: -
slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: 
pmixp_coll_ring.c:828:          status=PMIXP_COLL_RING_FINILIZE
slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: 
pmixp_coll_ring.c:831:          buf (offset/size): 0/33362
[hpc-d-msh-01a01.tds.hpc.barf1.com:47652] pml_ucx.c:178  Error: Failed to 
receive UCX worker address: Not found (-13)
[hpc-d-msh-01a01.tds.hpc.barf1.com:47652] pml_ucx.c:477  Error: Failed to 
resolve UCX endpoint for rank 31
[hpc-d-msh-01a01:47652] *** An error occurred in MPI_Send
[hpc-d-msh-01a01:47652] *** reported by process [683360612,0]
[hpc-d-msh-01a01:47652] *** on communicator MPI_COMM_WORLD
[hpc-d-msh-01a01:47652] *** MPI_ERR_OTHER: known error not in list
[hpc-d-msh-01a01:47652] *** MPI_ERRORS_ARE_FATAL (processes in this 
communicator will now abort,
[hpc-d-msh-01a01:47652] ***    and potentially your MPI job)
slurmstepd: error: *** STEP 1890.0 ON hpc-d-msh-01a01 CANCELLED AT 
2024-02-14T16:14:52 ***


OpenMPI/PMIx versions have not changed and downgrading slurm to 23.02.5 seems 
to resolve the issue. We’d appreciate any pointers anyone might have.

Thanks

Oli


This email comprises confidential information of Mercedes-Benz Grand Prix 
Limited ("MGP") unless it contains an explicit statement to the contrary made 
by an authorised representative of MGP.

Contracts may only be concluded on behalf of MGP by its authorised signatories 
and not solely by email communication. No employee, agent, contractor, 
consultant and/or other representative of MGP is authorised to conclude any 
legally binding agreement on behalf of MGP by email alone without the express 
prior written confirmation of two authorised signatories of MGP.

Mercedes-Benz Grand Prix Limited. Registered in England No. 787446. Registered 
Office at Mercedes-Benz Grand Prix Limited, Operations Centre, Brackley, 
Northants NN13 7BD.

Note: The MGP Legal Department also acts on behalf of Mercedes-Benz Motorsport 
Limited ("MBM") and the above notice applies mutatis mutandis in respect of all 
email communications of MBM. MBM: Mercedes-Benz Motorsport Limited. Registered 
in England No. 13057973. Registered office at Mercedes-Benz Motorsport Limited, 
Lauda Drive, Brackley, Northants NN13 7BD.

Please consider the environment before printing this email.
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to