Hi Bennet

I suspect the problem here lies in the slurm PMIx plugin. Slurm 17.11 supports 
PMIx v2.0 as well as (I believe) PMIx v1.2. I’m not sure if slurm is somehow 
finding one of those on your system and building the plugin or not, but it 
looks like OMPI is picking up signs of PMIx being active and trying to use it - 
and hitting an incompatibility.

You can test this out by adding --mpi=pmi2 to your srun cmd line and see if 
that solves the problem (you may also need to add OMPI_MCA_pmix=s2 to your 
environment as slurm has a tendency to publish envars even when they aren’t 
being used).



> On Nov 29, 2017, at 5:44 AM, Bennet Fauber <ben...@umich.edu> wrote:
> 
> Howard,
> 
> Thanks very much for the help identifying what information I should provide.
> 
> This is some information about our SLURM version
> 
> $ srun --mpi list
> srun: MPI types are...
> srun: pmi2
> srun: pmix_v1
> srun: openmpi
> srun: pmix
> srun: none
> 
> $ srun --version
> slurm 17.11.0-0rc3
> 
> This is the output from my build script, which should show all the
> configure options I used.
> 
> Checking compilers and things
> OMPI is ompi
> COMP_NAME is gcc_4_8_5
> SRC_ROOT is /sw/src/arcts
> PREFIX_ROOT is /sw/arcts/centos7/apps
> PREFIX is /sw/arcts/centos7/apps/gcc_4_8_5/openmpi/2.1.2
> CONFIGURE_FLAGS are --disable-dlopen --enable-shared
> COMPILERS are CC=gcc CXX=g++ FC=gfortran F77=gfortran
> No modules loaded
> gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11)
> Copyright (C) 2015 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions.  There is NO
> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> 
> ./configure
>     --prefix=/sw/arcts/centos7/apps/gcc_4_8_5/openmpi/2.1.2
>     --mandir=/sw/arcts/centos7/apps/gcc_4_8_5/openmpi/2.1.2/share/man
>     --with-slurm
>     --with-pmi
>     --with-lustre
>     --with-verbs
>     --disable-dlopen --enable-shared
>     CC=gcc CXX=g++ FC=gfortran F77=gfortran
> 
> I remove the build directory and re-expand from the source tarball for
> each build, so there should not be lingering configuration files from
> prior trials.
> 
> Here is the output of
> 
>   ompi_info | grep pmix
> 
>                MCA pmix: s2 (MCA v2.1.0, API v2.0.0, Component v2.1.2)
>                MCA pmix: s1 (MCA v2.1.0, API v2.0.0, Component v2.1.2)
>                MCA pmix: pmix112 (MCA v2.1.0, API v2.0.0, Component v2.1.2)
>           MCA pmix base: ---------------------------------------------------
>           MCA pmix base: parameter "pmix" (current value: "", data
> source: default, level: 2 user/detail, type: string)
>                          Default selection set of components for the
> pmix framework (<none> means use all components that can be found)
>           MCA pmix base: ---------------------------------------------------
>           MCA pmix base: parameter "pmix_base_verbose" (current
> value: "error", data source: default, level: 8 dev/detail, type: int)
>                          Verbosity level for the pmix framework (default: 0)
>           MCA pmix base: parameter "pmix_base_async_modex" (current
> value: "false", data source: default, level: 9 dev/all, type: bool)
>           MCA pmix base: parameter "pmix_base_collect_data" (current
> value: "true", data source: default, level: 9 dev/all, type: bool)
>             MCA pmix s2: ---------------------------------------------------
>             MCA pmix s2: parameter "pmix_s2_priority" (current value:
> "20", data source: default, level: 9 dev/all, type: int)
>                          Priority of the pmix s2 component (default: 20)
>             MCA pmix s1: ---------------------------------------------------
>             MCA pmix s1: parameter "pmix_s1_priority" (current value:
> "10", data source: default, level: 9 dev/all, type: int)
>                          Priority of the pmix s1 component (default: 10)
> 
> I also attach the hello-mpi.c file I am using as a test.  I compiled it using
> 
> $ mpicc -o hello-mpi hello-mpi.c
> 
> and this is the information about the actual compile command
> 
> $ mpicc --showme -o hello-mpi hello-mpi.c
> gcc -o hello-mpi hello-mpi.c
> -I/sw/arcts/centos7/apps/gcc_4_8_5/openmpi/2.1.2/include -pthread
> -L/usr/lib64 -Wl,-rpath -Wl,/usr/lib64 -Wl,-rpath
> -Wl,/sw/arcts/centos7/apps/gcc_4_8_5/openmpi/2.1.2/lib
> -Wl,--enable-new-dtags
> -L/sw/arcts/centos7/apps/gcc_4_8_5/openmpi/2.1.2/lib -lmpi
> 
> 
> I use some variation on the following submit script
> 
> test.slurm
> -------------------------------------
> $ cat test.slurm
> #!/bin/bash
> #SBATCH -J JOBNAME
> #SBATCH --mail-user=ben...@umich.edu
> #SBATCH --mail-type=NONE
> 
> #SBATCH -N 2
> #SBATCH --ntasks-per-node=1
> #SBATCH --mem-per-cpu=1g
> #SBATCH --cpus-per-task=1
> #SBATCH -A hpcstaff
> #SBATCH -p standard
> 
> #Your code here
> 
> cd /home/bennet/hello
> srun ./hello-mpi
> -------------------------------------
> 
> The results are attached as slurm-114.out, where it looks to me like
> it is trying to invoke pmi2 instead of pmix.
> 
> If I use `srun --mpi pmix ./hello-mpi` in the file submitted to SLURM,
> I get a core dump.
> 
> [bn1.stage.arc-ts.umich.edu:34722] PMIX ERROR: BAD-PARAM in file
> src/dstore/pmix_esh.c at line 996
> [bn2.stage.arc-ts.umich.edu:04597] PMIX ERROR: BAD-PARAM in file
> src/dstore/pmix_esh.c at line 996
> [bn1:34722] *** Process received signal ***
> [bn1:34722] Signal: Segmentation fault (11)
> [bn1:34722] Signal code: Invalid permissions (2)
> [bn1:34722] Failing at address: 0xcf73a0
> [bn1:34722] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2b2420b1d370]
> [bn1:34722] [ 1] [0xcf73a0]
> [bn1:34722] *** End of error message ***
> [bn2:04597] *** Process received signal ***
> [bn2:04597] Signal: Segmentation fault (11)
> [bn2:04597] Signal code:  (128)
> [bn2:04597] Failing at address: (nil)
> [bn2:04597] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2ab526447370]
> [bn2:04597] [ 1]
> /sw/arcts/centos7/apps/gcc_4_8_5/openmpi/2.1.2/lib/libopen-pal.so.20(+0x12291b)[0x2ab52706291b]
> [bn2:04597] [ 2]
> /sw/arcts/centos7/apps/gcc_4_8_5/openmpi/2.1.2/lib/libopen-pal.so.20(PMIx_Init+0x82)[0x2ab527052e32]
> [bn2:04597] [ 3]
> /sw/arcts/centos7/apps/gcc_4_8_5/openmpi/2.1.2/lib/libopen-pal.so.20(pmix1_client_init+0x62)[0x2ab52703b052]
> [bn2:04597] [ 4]
> /sw/arcts/centos7/apps/gcc_4_8_5/openmpi/2.1.2/lib/libopen-rte.so.20(+0x6786b)[0x2ab526c9686b]
> [bn2:04597] [ 5]
> /sw/arcts/centos7/apps/gcc_4_8_5/openmpi/2.1.2/lib/libopen-rte.so.20(orte_init+0x225)[0x2ab526c54165]
> [bn2:04597] [ 6]
> /sw/arcts/centos7/apps/gcc_4_8_5/openmpi/2.1.2/lib/libmpi.so.20(ompi_mpi_init+0x305)[0x2ab5260ab4d5]
> [bn2:04597] [ 7]
> /sw/arcts/centos7/apps/gcc_4_8_5/openmpi/2.1.2/lib/libmpi.so.20(MPI_Init+0x83)[0x2ab5260c95b3]
> [bn2:04597] [ 8] /home/bennet/hello/./hello-mpi[0x4009d5]
> [bn2:04597] [ 9] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ab526675b35]
> [bn2:04597] [10] /home/bennet/hello/./hello-mpi[0x4008d9]
> [bn2:04597] *** End of error message ***
> srun: error: bn1: task 0: Segmentation fault (core dumped)
> srun: error: bn2: task 1: Segmentation fault (core dumped)
> 
> 
> If I use `srun --mpi openmpi` in the submit script, the job hangs, and
> when I cancel it, I get
> 
> [bn2.stage.arc-ts.umich.edu:04855] PMI_Init [pmix_s1.c:162:s1_init]:
> PMI is not initialized
> [bn1.stage.arc-ts.umich.edu:35000] PMI_Init [pmix_s1.c:162:s1_init]:
> PMI is not initialized
> [warn] opal_libevent2022_event_active: event has no event_base set.
> [warn] opal_libevent2022_event_active: event has no event_base set.
> slurmstepd: error: *** STEP 116.0 ON bn1 CANCELLED AT 2017-11-29T08:42:54 ***
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> slurmstepd: error: *** JOB 116 ON bn1 CANCELLED AT 2017-11-29T08:42:54 ***
> 
> Any thoughts you might have on this would be very much appreciated.
> 
> Thanks,  -- bennet
> 
> 
> 
> On Fri, Nov 17, 2017 at 11:45 PM, Howard Pritchard <hpprit...@gmail.com> 
> wrote:
>> Hello Bennet,
>> 
>> What you are trying to do using srun as the job launcher should work.  Could
>> you post the contents
>> of /etc/slurm/slurm.conf for your system?
>> 
>> Could you also post the output of the following command:
>> 
>> ompi_info --all | grep pmix
>> 
>> to the mail list.
>> 
>> the config.log from your build would also be useful.
>> 
>> Howard
>> 
>> 
>> 2017-11-16 9:30 GMT-07:00 r...@open-mpi.org <r...@open-mpi.org>:
>>> 
>>> What Charles said was true but not quite complete. We still support the
>>> older PMI libraries but you likely have to point us to wherever slurm put
>>> them.
>>> 
>>> However,we definitely recommend using PMIx as you will get a faster launch
>>> 
>>> Sent from my iPad
>>> 
>>>> On Nov 16, 2017, at 9:11 AM, Bennet Fauber <ben...@umich.edu> wrote:
>>>> 
>>>> Charlie,
>>>> 
>>>> Thanks a ton!  Yes, we are missing two of the three steps.
>>>> 
>>>> Will report back after we get pmix installed and after we rebuild
>>>> Slurm.  We do have a new enough version of it, at least, so we might
>>>> have missed the target, but we did at least hit the barn.  ;-)
>>>> 
>>>> 
>>>> 
>>>>> On Thu, Nov 16, 2017 at 10:54 AM, Charles A Taylor <chas...@ufl.edu>
>>>>> wrote:
>>>>> Hi Bennet,
>>>>> 
>>>>> Three things...
>>>>> 
>>>>> 1. OpenMPI 2.x requires PMIx in lieu of pmi1/pmi2.
>>>>> 
>>>>> 2. You will need slurm 16.05 or greater built with —with-pmix
>>>>> 
>>>>> 2a. You will need pmix 1.1.5 which you can get from github.
>>>>> (https://github.com/pmix/tarballs).
>>>>> 
>>>>> 3. then, to launch your mpi tasks on the allocated resources,
>>>>> 
>>>>>  srun —mpi=pmix ./hello-mpi
>>>>> 
>>>>> I’m replying to the list because,
>>>>> 
>>>>> a) this information is harder to find than you might think.
>>>>> b) someone/anyone can correct me if I’’m giving a bum steer.
>>>>> 
>>>>> Hope this helps,
>>>>> 
>>>>> Charlie Taylor
>>>>> University of Florida
>>>>> 
>>>>> On Nov 16, 2017, at 10:34 AM, Bennet Fauber <ben...@umich.edu> wrote:
>>>>> 
>>>>> I think that OpenMPI is supposed to support SLURM integration such that
>>>>> 
>>>>>  srun ./hello-mpi
>>>>> 
>>>>> should work?  I built OMPI 2.1.2 with
>>>>> 
>>>>> export CONFIGURE_FLAGS='--disable-dlopen --enable-shared'
>>>>> export COMPILERS='CC=gcc CXX=g++ FC=gfortran F77=gfortran'
>>>>> 
>>>>> CMD="./configure \
>>>>>  --prefix=${PREFIX} \
>>>>>  --mandir=${PREFIX}/share/man \
>>>>>  --with-slurm \
>>>>>  --with-pmi \
>>>>>  --with-lustre \
>>>>>  --with-verbs \
>>>>>  $CONFIGURE_FLAGS \
>>>>>  $COMPILERS
>>>>> 
>>>>> I have a simple hello-mpi.c (source included below), which compiles
>>>>> and runs with mpirun, both on the login node and in a job.  However,
>>>>> when I try to use srun in place of mpirun, I get instead a hung job,
>>>>> which upon cancellation produces this output.
>>>>> 
>>>>> [bn2.stage.arc-ts.umich.edu:116377] PMI_Init [pmix_s1.c:162:s1_init]:
>>>>> PMI is not initialized
>>>>> [bn1.stage.arc-ts.umich.edu:36866] PMI_Init [pmix_s1.c:162:s1_init]:
>>>>> PMI is not initialized
>>>>> [warn] opal_libevent2022_event_active: event has no event_base set.
>>>>> [warn] opal_libevent2022_event_active: event has no event_base set.
>>>>> slurmstepd: error: *** STEP 86.0 ON bn1 CANCELLED AT
>>>>> 2017-11-16T10:03:24 ***
>>>>> srun: Job step aborted: Waiting up to 32 seconds for job step to
>>>>> finish.
>>>>> slurmstepd: error: *** JOB 86 ON bn1 CANCELLED AT 2017-11-16T10:03:24
>>>>> ***
>>>>> 
>>>>> The SLURM web page suggests that OMPI 2.x and later support PMIx, and
>>>>> to use `srun --mpi=pimx`, however that no longer seems to be an
>>>>> option, and using the `openmpi` type isn't working (neither is pmi2).
>>>>> 
>>>>> [bennet@beta-build hello]$ srun --mpi=list
>>>>> srun: MPI types are...
>>>>> srun: mpi/pmi2
>>>>> srun: mpi/lam
>>>>> srun: mpi/openmpi
>>>>> srun: mpi/mpich1_shmem
>>>>> srun: mpi/none
>>>>> srun: mpi/mvapich
>>>>> srun: mpi/mpich1_p4
>>>>> srun: mpi/mpichgm
>>>>> srun: mpi/mpichmx
>>>>> 
>>>>> To get the Intel PMI to work with srun, I have to set
>>>>> 
>>>>>  I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
>>>>> 
>>>>> Is there a comparable environment variable that must be set to enable
>>>>> `srun` to work?
>>>>> 
>>>>> Am I missing a build option or misspecifying one?
>>>>> 
>>>>> -- bennet
>>>>> 
>>>>> 
>>>>> Source of hello-mpi.c
>>>>> ==========================================
>>>>> #include <stdio.h>
>>>>> #include <stdlib.h>
>>>>> #include "mpi.h"
>>>>> 
>>>>> int main(int argc, char **argv){
>>>>> 
>>>>> int rank;          /* rank of process */
>>>>> int numprocs;      /* size of COMM_WORLD */
>>>>> int namelen;
>>>>> int tag=10;        /* expected tag */
>>>>> int message;       /* Recv'd message */
>>>>> char processor_name[MPI_MAX_PROCESSOR_NAME];
>>>>> MPI_Status status; /* status of recv */
>>>>> 
>>>>> /* call Init, size, and rank */
>>>>> MPI_Init(&argc, &argv);
>>>>> MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
>>>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>>>> MPI_Get_processor_name(processor_name, &namelen);
>>>>> 
>>>>> printf("Process %d on %s out of %d\n", rank, processor_name, numprocs);
>>>>> 
>>>>> if(rank != 0){
>>>>>  MPI_Recv(&message,    /*buffer for message */
>>>>>                  1,    /*MAX count to recv */
>>>>>            MPI_INT,    /*type to recv */
>>>>>                  0,    /*recv from 0 only */
>>>>>                tag,    /*tag of messgae */
>>>>>     MPI_COMM_WORLD,    /*communicator to use */
>>>>>            &status);   /*status object */
>>>>>  printf("Hello from process %d!\n",rank);
>>>>> }
>>>>> else{
>>>>>  /* rank 0 ONLY executes this */
>>>>>  printf("MPI_COMM_WORLD is %d processes big!\n", numprocs);
>>>>>  int x;
>>>>>  for(x=1; x<numprocs; x++){
>>>>>     MPI_Send(&x,          /*send x to process x */
>>>>>               1,          /*number to send */
>>>>>         MPI_INT,          /*type to send */
>>>>>               x,          /*rank to send to */
>>>>>             tag,          /*tag for message */
>>>>>   MPI_COMM_WORLD);        /*communicator to use */
>>>>>  }
>>>>> } /* end else */
>>>>> 
>>>>> 
>>>>> /* always call at end */
>>>>> MPI_Finalize();
>>>>> 
>>>>> return 0;
>>>>> }
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org
>>>>> 
>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwICAg&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=t2C9i2WW8vYudLmnfvtKjpqTlBguLeivBwHAaQ1TcM4&s=aakHf5ypdTOe4-hQ86pcEN9FmiW1Xyngln5ODOUwCqQ&e=
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org
>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> <config.log><slurm.conf><hello-mpi.c><slurm-114.out>_______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to