HI, So I’m testing the use of Open MPI 5.0.0 pre-release with the Slurm/PMIx setup currently on NERSC Perlmutter system. First off, if I use the PRRte launch system, I don’t see the issue I’m raising here.
But, many NERSC users prefer to use the srun “native” launch method with applications compiled against Open MPI, hence this emal. The SLURM version on Perlmutter is currently 2023.02.2 The PMIx version that the admins used to build slurm against is pmix-4.2.3. I’ve attached the output of pmix_info. I’ve tested with Open MPI 5.0.0rc11 (or HEAD of 5.0.x) with both the PMIx embedded in the Open MPI and using the external PMIx 4.2.3 install. I get the same results below when my app is linked either against the system PMIx or the embedded one. My test application “works” but if I use srun, I get these types of messages: srun -n 2 -N 2 --mpi=pmix ./ring_c [cn316:2770176] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750 [cn315:1037721] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750 [cn315:1037721] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750 [cn315:1037721] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file base/gds_base_fns.c at line 268 [cn315:1037721] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file dstore_base.c at line 2624 [cn315:1037721] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file server/pmix_server.c at line 3417 [cn316:2770176] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750 [cn316:2770176] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file base/gds_base_fns.c at line 268 [cn316:2770176] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file dstore_base.c at line 2624 [cn316:2770176] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file server/pmix_server.c at line 3417 After a lot of stracing and adding debug statements to the PMIx I have control over – the one in the embedded Open MPI tarball, I realized that these messages are not coming from the app, but some transient process between the srun/slurmd processes and the application processes. The pids in these error messages are the parents of the MPI processes. I’ve tried various things like turning off the PMIX GDS shmem but that doesn’t help. Also I’ve toggled the various SLURM_PMIX env. variables but to no effect. This problem does not appear to be related to a recent slurm/pmix patch - https://bugs.schedmd.com/show_bug.cgi?id=16306#a0 and anyway it looks like that patch should be in 2023.02.2. Another bit of info: scontrol show config | grep -i pmix PMIxCliTmpDirBase = (null) PMIxCollFence = (null) PMIxDebug = 0 PMIxDirectConn = yes PMIxDirectConnEarly = no PMIxDirectConnUCX = no PMIxDirectSameArch = no PMIxEnv = (null) PMIxFenceBarrier = no PMIxNetDevicesUCX = (null) PMIxTimeout = 300 PMIxTlsUCX = (null) Now I myself don’t care too much about these messages. But for users it might be disconcerting and also may cause automated regression testing frameworks to report lots of errors. Should I ask NERSC to file a ticket with SchedMD or does someone know how to turn these messages off if in fact they are not important, or better yet know why a slurm process may be emitting these errors and how to fix it? Thanks for any ideas, Howard — [signature_61897647] Howard Pritchard Research Scientist HPC-ENV Los Alamos National Laboratory howa...@lanl.gov [signature_2560999014]<https://www.instagram.com/losalamosnatlab/>[signature_3849187500]<https://twitter.com/LosAlamosNatLab>[signature_1777390047]<https://www.linkedin.com/company/los-alamos-national-laboratory/>[signature_210780453]<https://www.facebook.com/LosAlamosNationalLab/>
Package: PMIx root@runner-zf5ntxsi-project-87-concurrent-0 Distribution PMIX: 4.2.3 PMIX repo revision: gitc5661387 PMIX release date: Feb 07, 2023 PMIX Standard: 4.2 PMIX Standard ABI: Stable (0.0), Provisional (0.0) Prefix: /usr Configured architecture: pmix.arch Configure host: runner-zf5ntxsi-project-87-concurrent-0 Configured by: root Configured on: Thu Mar 9 01:59:39 UTC 2023 Configure host: runner-zf5ntxsi-project-87-concurrent-0 Configure command line: '--host=x86_64-suse-linux-gnu' '--build=x86_64-suse-linux-gnu' '--program-prefix=' '--disable-dependency-tracking' '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib64' '--libexecdir=/usr/lib' '--localstatedir=/var' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--disable-dependency-tracking' Built by: Built on: Thu Mar 9 02:02:08 UTC 2023 Built host: runner-zf5ntxsi-project-87-concurrent-0 C compiler: gcc C compiler absolute: /usr/bin/gcc C compiler family name: GNU C compiler version: "7" "." "5" "." "0" Internal debug support: no dl support: yes Symbol vis. support: yes Manpages built: yes MCA bfrops: v12 (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA bfrops: v20 (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA bfrops: v21 (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA bfrops: v3 (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA bfrops: v4 (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA bfrops: v41 (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA gds: hash (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA gds: ds12 (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA gds: ds21 (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA pcompress: zlib (MCA v2.1.0, API v2.0.0, Component v4.2.3) MCA pdl: pdlopen (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA pfexec: linux (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA pif: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component v4.2.3) MCA pif: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component v4.2.3) MCA pinstalldirs: env (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA pinstalldirs: config (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA plog: default (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA plog: stdfd (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA plog: syslog (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA pmdl: ompi (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA pmdl: oshmem (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA pnet: opa (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA preg: compress (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA preg: native (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA preg: raw (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA prm: default (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA prm: slurm (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA psec: native (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA psec: none (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA psec: munge (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA psensor: file (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA psensor: heartbeat (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA pshmem: mmap (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA psquash: flex128 (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA psquash: native (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA pstat: linux (MCA v2.1.0, API v1.0.0, Component v4.2.3) MCA ptl: client (MCA v2.1.0, API v2.0.0, Component v4.2.3) MCA ptl: server (MCA v2.1.0, API v2.0.0, Component v4.2.3) MCA ptl: tool (MCA v2.1.0, API v2.0.0, Component v4.2.3)