Lawrence,

My apologies for the incorrect link.
The patch is at https://github.com/open-mpi/ompi/pull/6672.patch

Cheers,

Gilles

On Sat, Jul 6, 2019 at 12:21 PM Gilles Gouaillardet
<gilles.gouaillar...@gmail.com> wrote:
>
> Lawrence,
>
> this is a known issue (--cpu_bind optioned was removed from SLURM in
> favor of the --cpu-bind option) and the fix will be available in Open
> MPI 4.0.1
>
> Meanwhile, you can manually download and apply the patch at
> https://github.com/open-mpi/ompi/pull/6445.patch or use a nightly
> build of the v4.0.x branch.
>
> Cheers,
>
> Gilles
>
> On Sat, Jul 6, 2019 at 5:35 AM Sorrillo, Lawrence via devel
> <devel@lists.open-mpi.org> wrote:
> >
> > Hello,
> >
> >
> >
> > My attempt to run and troubleshoot my an OMPI job under a slurm allocation 
> > does not work as I would expect.
> >
> > The result below has led me to believe that under the hood, in this setup 
> > (SLURM with OMPI) that the correct srun options is not being used
> >
> > when I call, mpirun directly.
> >
> >
> >
> > Specifically the “cpu-bind=none” is breaking, but it also looks like the 
> > nodelist is incorrect.
> >
> >
> >
> > The job script.
> >
> >
> >
> >   1 #!/bin/bash
> >
> >   2
> >
> >   3 #SBATCH --job-name=mpi-hostname
> >
> >   4 #SBATCH --partition=dev
> >
> >   5 #SBATCH --account=Account1
> >
> >   6 #SBATCH --time=01:00:00
> >
> >   7 #SBATCH --nodes=2
> >
> >   8 #SBATCH --ntasks-per-node=1
> >
> >   9 #SBATCH --begin=now+10
> >
> > 10 #SBATCH --output="%x-%u-%N-%j.txt"      # 
> > jobName-userId-hostName-jobId.txt
> >
> > 11
> >
> > 12
> >
> > 13
> >
> > 14 # ---------------------------------------------------------------------- 
> > #
> >
> > 15 #module load DefApps
> >
> > 16 #module purge >/dev/null 2>&1
> >
> > 17 ##module load staging/slurm >/dev/null 2>&1
> >
> > 18 #module load  gcc/4.8.5 openmpi >/dev/null 2>&1
> >
> > 19 #module --ignore_cache spider openmpi/3.1.3 >/dev/null 2>&1
> >
> > 20 #
> >
> > 21 # ---------------------------------------------------------------------- 
> > #
> >
> > 22 #
> >
> > 23 MPI_RUN=$(which orterun)
> >
> > 24 if [[ -z "${MPI_RUN:+x}" ]]; then
> >
> > 25   echo "ERROR: Cannot find 'mpirun' executable..exiting"
> >
> > 26   exit 1
> >
> > 27 fi
> >
> > 28
> >
> > 29 echo
> >
> > 30 #CMD="orterun  -npernode 1 -np 2  /bin/hostname"
> >
> > 31 #CMD="srun /bin/hostname"
> >
> > 32 #CMD="srun -N2 -n2 --mpi=pmi2  /bin/hostname"
> >
> > 33 #MCMD="/sw/dev/openmpi401/bin/mpirun --bind-to-core  --report-bindings  
> > -mca btl openib,self -mca plm_base_verbose 10  /bin/hostname"
> >
> > 34 MCMD="/sw/dev/openmpi401/bin/mpirun  --report-bindings  -mca btl 
> > openib,self -mca plm_base_verbose 10  /bin/hostname"
> >
> > 35 echo "INFO: Executing the command: $MCMD"
> >
> > 36 $MCMD
> >
> > 37 sync
> >
> >
> >
> > Here is the output:
> >
> >
> >
> > 2 user1@node-login7g:~/git/slurm-jobs$ more 
> > mpi-hostname-user1-node513-835.txt
> >
> >   3
> >
> >   4 INFO: Executing the command: /sw/dev/openmpi401/bin/mpirun  
> > --report-bindings  -mca btl openib,self -mca plm_base_verbose 10  
> > /bin/hostname
> >
> >   5 [node513:32514] mca: base: components_register: registering framework 
> > plm components
> >
> >   6 [node513:32514] mca: base: components_register: found loaded component 
> > isolated
> >
> >   7 [node513:32514] mca: base: components_register: component isolated has 
> > no register or open function
> >
> >   8 [node513:32514] mca: base: components_register: found loaded component 
> > rsh
> >
> >   9 [node513:32514] mca: base: components_register: component rsh register 
> > function successful
> >
> > 10 [node513:32514] mca: base: components_register: found loaded component 
> > slurm
> >
> > 11 [node513:32514] mca: base: components_register: component slurm register 
> > function successful
> >
> > 12 [node513:32514] mca: base: components_register: found loaded component tm
> >
> > 13 [node513:32514] mca: base: components_register: component tm register 
> > function successful
> >
> > 14 [node513:32514] mca: base: components_open: opening plm components
> >
> > 15 [node513:32514] mca: base: components_open: found loaded component 
> > isolated
> >
> > 16 [node513:32514] mca: base: components_open: component isolated open 
> > function successful
> >
> > 17 [node513:32514] mca: base: components_open: found loaded component rsh
> >
> > 18 [node513:32514] mca: base: components_open: component rsh open function 
> > successful
> >
> > 19 [node513:32514] mca: base: components_open: found loaded component slurm
> >
> > 20 [node513:32514] mca: base: components_open: component slurm open 
> > function successful
> >
> > 21 [node513:32514] mca: base: components_open: found loaded component tm
> >
> > 22 [node513:32514] mca: base: components_open: component tm open function 
> > successful
> >
> > 23 [node513:32514] mca:base:select: Auto-selecting plm components
> >
> > 24 [node513:32514] mca:base:select:(  plm) Querying component [isolated]
> >
> > 25 [node513:32514] mca:base:select:(  plm) Query of component [isolated] 
> > set priority to 0
> >
> > 26 [node513:32514] mca:base:select:(  plm) Querying component [rsh]
> >
> > 27 [node513:32514] mca:base:select:(  plm) Query of component [rsh] set 
> > priority to 10
> >
> > 28 [node513:32514] mca:base:select:(  plm) Querying component [slurm]
> >
> > 29 [node513:32514] mca:base:select:(  plm) Query of component [slurm] set 
> > priority to 75
> >
> > 30 [node513:32514] mca:base:select:(  plm) Querying component [tm]
> >
> > 31 [node513:32514] mca:base:select:(  plm) Selected component [slurm]
> >
> > 32 [node513:32514] mca: base: close: component isolated closed
> >
> > 33 [node513:32514] mca: base: close: unloading component isolated
> >
> > 34 [node513:32514] mca: base: close: component rsh closed
> >
> > 35 [node513:32514] mca: base: close: unloading component rsh
> >
> > 36 [node513:32514] mca: base: close: component tm closed
> >
> > 37 [node513:32514] mca: base: close: unloading component tm
> >
> > 38 [node513:32514] [[4367,0],0] plm:slurm: final top-level argv:
> >
> > 39         srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none 
> > --nodes=1 --nodelist=node514 --ntasks=1 orted -mca orte_report_bindings "1" 
> > -mca ess "slurm" -mca ess_base_jobid "286195712" -mca ess_base_vpid "1" 
> > -mca ess_base_num_procs "2" -mca orte_node_regex "node[3:513-514]@0(2)" 
> > -mca orte_hnp_uri "286195712.0;tcp://172.30.146
> >
> > 40 10.38.146.45:43031" -mca btl "openib,self" -mca plm_base_verbose "10"
> >
> > 41 srun: unrecognized option '--cpu_bind=none'
> >
> > 42 srun: unrecognized option '--cpu_bind=none'
> >
> > 43 Try "srun --help" for more information
> >
> > 44 
> > --------------------------------------------------------------------------
> >
> > 45 An ORTE daemon has unexpectedly failed after launch and before
> >
> > 46 communicating back to mpirun. This could be caused by a number
> >
> > 47 of factors, including an inability to create a connection back
> >
> > 48 to mpirun due to a lack of common network interfaces and/or no
> >
> > 49 route found between them. Please check network connectivity
> >
> > 50 (including firewalls and network routing requirements).
> >
> > 51 
> > --------------------------------------------------------------------------
> >
> > 52 [node513:32514] mca: base: close: component slurm closed
> >
> > 53 [node513:32514] mca: base: close: unloading component slurm
> >
> > 54 user1@node-login7g:~/git/slurm-jobs$
> >
> > ~
> >
> >
> >
> > Here are some ompi_info output:
> >
> >
> >
> >                  Package: Open MPI user1@node-login7g Distribution
> >
> >                 Open MPI: 4.0.1
> >
> >   Open MPI repo revision: v4.0.1
> >
> >    Open MPI release date: Mar 26, 2019
> >
> >                 Open RTE: 4.0.1
> >
> >   Open RTE repo revision: v4.0.1
> >
> >    Open RTE release date: Mar 26, 2019
> >
> >                     OPAL: 4.0.1
> >
> >       OPAL repo revision: v4.0.1
> >
> >        OPAL release date: Mar 26, 2019
> >
> >                  MPI API: 3.1.0
> >
> >             Ident string: 4.0.1
> >
> >                   Prefix: /sw/dev/openmpi401
> >
> > Configured architecture: x86_64-unknown-linux-gnu
> >
> >           Configure host: node-login7g
> >
> >            Configured by: user1
> >
> >            Configured on: Wed Jul  3 12:13:45 EDT 2019
> >
> >           Configure host: node-login7g
> >
> >   Configure command line: '--prefix=/sw/dev/openmpi401' '--enable-shared' 
> > '--enable-static' '--enable-mpi-cxx' '--with-zlib=/usr' '--without-psm' 
> > '--without-libfabric' '--without-mxm' '--with-verbs' '--without-psm2' 
> > '--without-alps' '--without-lsf' '--without-sge' '--with-slurm' '--with-tm' 
> > '--without-load-leveler' '--disable-memchecker'
> >
> > '--disable-java' '--disable-mpi-java' '--without-cuda' 
> > '--enable-cxx-exceptions'
> >
> >                 Built by: user1
> >
> >                 Built on: Wed Jul  3 12:24:11 EDT 2019
> >
> >               Built host: node-login7g
> >
> >               C bindings: yes
> >
> >             C++ bindings: yes
> >
> >              Fort mpif.h: yes (all)
> >
> >             Fort use mpi: yes (limited: overloading)
> >
> >        Fort use mpi size: deprecated-ompi-info-value
> >
> >         Fort use mpi_f08: no
> >
> > Fort mpi_f08 compliance: The mpi_f08 module was not built
> >
> >   Fort mpi_f08 subarrays: no
> >
> >            Java bindings: no
> >
> >   Wrapper compiler rpath: runpath
> >
> >               C compiler: gcc
> >
> >      C compiler absolute: /usr/bin/gcc
> >
> >   C compiler family name: GNU
> >
> >       C compiler version: 4.8.5
> >
> >             C++ compiler: g++
> >
> >    C++ compiler absolute: /usr/bin/g++
> >
> >            Fort compiler: gfortran
> >
> >        Fort compiler abs: /usr/bin/gfortran
> >
> >
> >
> > …..
> >
> >
> >
> > _______________________________________________
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to