Hi,
We are trying to enable PMIx support for OpenMPI in the OpenHPC project but
we experience issues.
Submitting jobs via Slurm and/or OpenPBS seems to just hang in network
calls when PMIx is enabled. Without PMIx (i.e. no --with-pmix=...) the jobs
are successfully executed. Also everything works fine if using MPICH
("module swap openmpi4 mpich/3.4.3-ofi").
$ ompi_info | grep -i pmix
Configure command line: '--prefix=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5'
'--disable-static' '--enable-builtin-atomics' '--with-sge'
'--enable-mpi-cxx' '--with-hwloc=/opt/ohpc/pub/libs/hwloc'
'--with-pmix=/opt/ohpc/admin/pmix' '--with-libevent=external'
'--with-libfabric=/opt/ohpc/pub/mpi/libfabric/1.18.0'
'--with-ucx=/opt/ohpc/pub/mpi/ucx-ohpc/1.14.0' '--without-verbs'
'--with-tm=/opt/pbs/'
MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component
v4.1.5)
MCA pmix: ext3x (MCA v2.1.0, API v2.0.0, Component v4.1.5)
In an environment with three bare metal machines (one manager and two
compute nodes) managed by OpenPBS "strace mpirun hostname" ends with:
...
socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 21
setsockopt(21, SOL_SOCKET, SO_LINGER, {l_onoff=1, l_linger=5}, 8) = 0
connect(21, {sa_family=AF_INET, sin_port=htons(15003),
sin_addr=inet_addr("127.0.0.1")}, 16) = 0
write(21, "PKTV1\0\0\0\0, +2+22+26181.openhpc-o"..., 11307) = 11307
write(21, "PKTV1\0\0\0\0, +2+22+26181.openhpc-o"..., 11307) = 11307
ppoll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN},
{fd=20, events=POLLIN}], 4, NULL, NULL, 0
More complete output at https://pastebin.com/hmVeCifF
Another example is our simplified CI at Github Actions where we use Slurm:
Log file: ./tests/rms-harness/tests/family-gnu12-openmpi4/test_harness.log
not ok 1 [RMS/harness] Verify zero exit code from MPI job runs OK
(slurm/gnu12/openmpi4)
(from function `run_mpi_binary' in file ./common/functions, line 399,
in test file test_harness, line 23)
`run_mpi_binary ./mpi_exit 0 $NODES $TASKS' failed
job script = /tmp/job.ohpc.18553
Batch job 6 submitted
Job 6 failed...
Reason=NonZeroExitCode
[prun] Master compute host = bd2a644aa87c
[prun] Resource manager = slurm
[prun] Launch cmd = srun --mpi=pmix ./mpi_exit 0 (family=openmpi4)
srun: launch/slurm: launch_p_step_launch: StepId=6.0 aborted before
step completely launched.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: task 1 launch failed: Unspecified error
srun: error: c0: task 0: Killed
Any hints on how to debug it ?
OpenMPI 4.1.5 (
https://github.com/openhpc/ohpc/blob/3.x/components/mpi-families/openmpi/SPECS/openmpi.spec
)
PMIx 4.2.4 (
https://github.com/openhpc/ohpc/blob/3.x/components/rms/pmix/SPECS/pmix.spec
)
Regards,
Martin