HI,

So I’m testing the use of Open MPI 5.0.0 pre-release with the Slurm/PMIx setup 
currently on NERSC Perlmutter system.
First off, if I use the PRRte launch system, I don’t see the issue I’m raising 
here.

But, many NERSC users prefer to use the srun “native” launch method with 
applications compiled against Open MPI, hence this emal.

The SLURM version on Perlmutter is currently 2023.02.2

The PMIx version that the admins used to build slurm against is pmix-4.2.3.  
I’ve attached the output of  pmix_info.

I’ve tested with Open MPI 5.0.0rc11 (or HEAD of 5.0.x) with both the PMIx 
embedded in the Open MPI and using the external PMIx 4.2.3 install.
I get the same results below when my app is linked either against the system 
PMIx or the embedded one.

My test application “works” but if I use srun, I get these types of messages:

srun -n 2 -N 2 --mpi=pmix ./ring_c

[cn316:2770176] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at 
line 750

[cn315:1037721] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at 
line 750

[cn315:1037721] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at 
line 750

[cn315:1037721] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file base/gds_base_fns.c 
at line 268

[cn315:1037721] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file dstore_base.c at 
line 2624

[cn315:1037721] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file 
server/pmix_server.c at line 3417

[cn316:2770176] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at 
line 750

[cn316:2770176] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file base/gds_base_fns.c 
at line 268

[cn316:2770176] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file dstore_base.c at 
line 2624

[cn316:2770176] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file 
server/pmix_server.c at line 3417

After a lot of stracing and adding debug statements to the PMIx I have control 
over – the one in the embedded Open MPI tarball, I realized that these
messages are not coming from the app, but some transient process between the 
srun/slurmd processes and the application processes.
The pids in these error messages are the parents of the MPI processes.

I’ve tried various things like turning off the PMIX GDS shmem but that doesn’t 
help.  Also I’ve toggled the various SLURM_PMIX env. variables but to no effect.
This problem does not appear to be related to a recent slurm/pmix patch - 
https://bugs.schedmd.com/show_bug.cgi?id=16306#a0 and anyway it looks like that 
patch should be in 2023.02.2.

Another bit of info:

scontrol show config | grep -i pmix
PMIxCliTmpDirBase       = (null)
PMIxCollFence           = (null)
PMIxDebug               = 0
PMIxDirectConn          = yes
PMIxDirectConnEarly     = no
PMIxDirectConnUCX       = no
PMIxDirectSameArch      = no
PMIxEnv                 = (null)
PMIxFenceBarrier        = no
PMIxNetDevicesUCX       = (null)
PMIxTimeout             = 300
PMIxTlsUCX              = (null)

Now I myself don’t care too much about these messages.
But for users it might be disconcerting and also may cause automated regression 
testing frameworks to report lots of errors.

Should I ask NERSC to file a ticket with SchedMD or does someone know how to 
turn these messages off if in fact they are not important, or better yet know 
why a slurm process may be emitting these errors and how to fix it?

Thanks for any ideas,

Howard


—

[signature_61897647]

Howard Pritchard
Research Scientist
HPC-ENV

Los Alamos National Laboratory
howa...@lanl.gov

[signature_2560999014]<https://www.instagram.com/losalamosnatlab/>[signature_3849187500]<https://twitter.com/LosAlamosNatLab>[signature_1777390047]<https://www.linkedin.com/company/los-alamos-national-laboratory/>[signature_210780453]<https://www.facebook.com/LosAlamosNationalLab/>



                 Package: PMIx root@runner-zf5ntxsi-project-87-concurrent-0
                          Distribution
                    PMIX: 4.2.3
      PMIX repo revision: gitc5661387
       PMIX release date: Feb 07, 2023
           PMIX Standard: 4.2
       PMIX Standard ABI: Stable (0.0), Provisional (0.0)
                  Prefix: /usr
 Configured architecture: pmix.arch
          Configure host: runner-zf5ntxsi-project-87-concurrent-0
           Configured by: root
           Configured on: Thu Mar  9 01:59:39 UTC 2023
          Configure host: runner-zf5ntxsi-project-87-concurrent-0
  Configure command line: '--host=x86_64-suse-linux-gnu'
                          '--build=x86_64-suse-linux-gnu' '--program-prefix='
                          '--disable-dependency-tracking' '--prefix=/usr'
                          '--exec-prefix=/usr' '--bindir=/usr/bin'
                          '--sbindir=/usr/sbin' '--sysconfdir=/etc'
                          '--datadir=/usr/share' '--includedir=/usr/include'
                          '--libdir=/usr/lib64' '--libexecdir=/usr/lib'
                          '--localstatedir=/var' '--sharedstatedir=/var/lib'
                          '--mandir=/usr/share/man'
                          '--infodir=/usr/share/info'
                          '--disable-dependency-tracking'
                Built by: 
                Built on: Thu Mar  9 02:02:08 UTC 2023
              Built host: runner-zf5ntxsi-project-87-concurrent-0
              C compiler: gcc
     C compiler absolute: /usr/bin/gcc
  C compiler family name: GNU
      C compiler version: "7" "." "5" "." "0"
  Internal debug support: no
              dl support: yes
     Symbol vis. support: yes
          Manpages built: yes
              MCA bfrops: v12 (MCA v2.1.0, API v1.0.0, Component v4.2.3)
              MCA bfrops: v20 (MCA v2.1.0, API v1.0.0, Component v4.2.3)
              MCA bfrops: v21 (MCA v2.1.0, API v1.0.0, Component v4.2.3)
              MCA bfrops: v3 (MCA v2.1.0, API v1.0.0, Component v4.2.3)
              MCA bfrops: v4 (MCA v2.1.0, API v1.0.0, Component v4.2.3)
              MCA bfrops: v41 (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                 MCA gds: hash (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                 MCA gds: ds12 (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                 MCA gds: ds21 (MCA v2.1.0, API v1.0.0, Component v4.2.3)
           MCA pcompress: zlib (MCA v2.1.0, API v2.0.0, Component v4.2.3)
                 MCA pdl: pdlopen (MCA v2.1.0, API v1.0.0, Component v4.2.3)
              MCA pfexec: linux (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                 MCA pif: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
                          v4.2.3)
                 MCA pif: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
                          v4.2.3)
        MCA pinstalldirs: env (MCA v2.1.0, API v1.0.0, Component v4.2.3)
        MCA pinstalldirs: config (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                MCA plog: default (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                MCA plog: stdfd (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                MCA plog: syslog (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                MCA pmdl: ompi (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                MCA pmdl: oshmem (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                MCA pnet: opa (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                MCA preg: compress (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                MCA preg: native (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                MCA preg: raw (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                 MCA prm: default (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                 MCA prm: slurm (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                MCA psec: native (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                MCA psec: none (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                MCA psec: munge (MCA v2.1.0, API v1.0.0, Component v4.2.3)
             MCA psensor: file (MCA v2.1.0, API v1.0.0, Component v4.2.3)
             MCA psensor: heartbeat (MCA v2.1.0, API v1.0.0, Component
                          v4.2.3)
              MCA pshmem: mmap (MCA v2.1.0, API v1.0.0, Component v4.2.3)
             MCA psquash: flex128 (MCA v2.1.0, API v1.0.0, Component v4.2.3)
             MCA psquash: native (MCA v2.1.0, API v1.0.0, Component v4.2.3)
               MCA pstat: linux (MCA v2.1.0, API v1.0.0, Component v4.2.3)
                 MCA ptl: client (MCA v2.1.0, API v2.0.0, Component v4.2.3)
                 MCA ptl: server (MCA v2.1.0, API v2.0.0, Component v4.2.3)
                 MCA ptl: tool (MCA v2.1.0, API v2.0.0, Component v4.2.3)

Reply via email to