Lawrence,

this is a known issue (--cpu_bind optioned was removed from SLURM in
favor of the --cpu-bind option) and the fix will be available in Open
MPI 4.0.1

Meanwhile, you can manually download and apply the patch at
https://github.com/open-mpi/ompi/pull/6445.patch or use a nightly
build of the v4.0.x branch.

Cheers,

Gilles

On Sat, Jul 6, 2019 at 5:35 AM Sorrillo, Lawrence via devel
<devel@lists.open-mpi.org> wrote:
>
> Hello,
>
>
>
> My attempt to run and troubleshoot my an OMPI job under a slurm allocation 
> does not work as I would expect.
>
> The result below has led me to believe that under the hood, in this setup 
> (SLURM with OMPI) that the correct srun options is not being used
>
> when I call, mpirun directly.
>
>
>
> Specifically the “cpu-bind=none” is breaking, but it also looks like the 
> nodelist is incorrect.
>
>
>
> The job script.
>
>
>
>   1 #!/bin/bash
>
>   2
>
>   3 #SBATCH --job-name=mpi-hostname
>
>   4 #SBATCH --partition=dev
>
>   5 #SBATCH --account=Account1
>
>   6 #SBATCH --time=01:00:00
>
>   7 #SBATCH --nodes=2
>
>   8 #SBATCH --ntasks-per-node=1
>
>   9 #SBATCH --begin=now+10
>
> 10 #SBATCH --output="%x-%u-%N-%j.txt"      # jobName-userId-hostName-jobId.txt
>
> 11
>
> 12
>
> 13
>
> 14 # ---------------------------------------------------------------------- #
>
> 15 #module load DefApps
>
> 16 #module purge >/dev/null 2>&1
>
> 17 ##module load staging/slurm >/dev/null 2>&1
>
> 18 #module load  gcc/4.8.5 openmpi >/dev/null 2>&1
>
> 19 #module --ignore_cache spider openmpi/3.1.3 >/dev/null 2>&1
>
> 20 #
>
> 21 # ---------------------------------------------------------------------- #
>
> 22 #
>
> 23 MPI_RUN=$(which orterun)
>
> 24 if [[ -z "${MPI_RUN:+x}" ]]; then
>
> 25   echo "ERROR: Cannot find 'mpirun' executable..exiting"
>
> 26   exit 1
>
> 27 fi
>
> 28
>
> 29 echo
>
> 30 #CMD="orterun  -npernode 1 -np 2  /bin/hostname"
>
> 31 #CMD="srun /bin/hostname"
>
> 32 #CMD="srun -N2 -n2 --mpi=pmi2  /bin/hostname"
>
> 33 #MCMD="/sw/dev/openmpi401/bin/mpirun --bind-to-core  --report-bindings  
> -mca btl openib,self -mca plm_base_verbose 10  /bin/hostname"
>
> 34 MCMD="/sw/dev/openmpi401/bin/mpirun  --report-bindings  -mca btl 
> openib,self -mca plm_base_verbose 10  /bin/hostname"
>
> 35 echo "INFO: Executing the command: $MCMD"
>
> 36 $MCMD
>
> 37 sync
>
>
>
> Here is the output:
>
>
>
> 2 user1@node-login7g:~/git/slurm-jobs$ more mpi-hostname-user1-node513-835.txt
>
>   3
>
>   4 INFO: Executing the command: /sw/dev/openmpi401/bin/mpirun  
> --report-bindings  -mca btl openib,self -mca plm_base_verbose 10  
> /bin/hostname
>
>   5 [node513:32514] mca: base: components_register: registering framework plm 
> components
>
>   6 [node513:32514] mca: base: components_register: found loaded component 
> isolated
>
>   7 [node513:32514] mca: base: components_register: component isolated has no 
> register or open function
>
>   8 [node513:32514] mca: base: components_register: found loaded component rsh
>
>   9 [node513:32514] mca: base: components_register: component rsh register 
> function successful
>
> 10 [node513:32514] mca: base: components_register: found loaded component 
> slurm
>
> 11 [node513:32514] mca: base: components_register: component slurm register 
> function successful
>
> 12 [node513:32514] mca: base: components_register: found loaded component tm
>
> 13 [node513:32514] mca: base: components_register: component tm register 
> function successful
>
> 14 [node513:32514] mca: base: components_open: opening plm components
>
> 15 [node513:32514] mca: base: components_open: found loaded component isolated
>
> 16 [node513:32514] mca: base: components_open: component isolated open 
> function successful
>
> 17 [node513:32514] mca: base: components_open: found loaded component rsh
>
> 18 [node513:32514] mca: base: components_open: component rsh open function 
> successful
>
> 19 [node513:32514] mca: base: components_open: found loaded component slurm
>
> 20 [node513:32514] mca: base: components_open: component slurm open function 
> successful
>
> 21 [node513:32514] mca: base: components_open: found loaded component tm
>
> 22 [node513:32514] mca: base: components_open: component tm open function 
> successful
>
> 23 [node513:32514] mca:base:select: Auto-selecting plm components
>
> 24 [node513:32514] mca:base:select:(  plm) Querying component [isolated]
>
> 25 [node513:32514] mca:base:select:(  plm) Query of component [isolated] set 
> priority to 0
>
> 26 [node513:32514] mca:base:select:(  plm) Querying component [rsh]
>
> 27 [node513:32514] mca:base:select:(  plm) Query of component [rsh] set 
> priority to 10
>
> 28 [node513:32514] mca:base:select:(  plm) Querying component [slurm]
>
> 29 [node513:32514] mca:base:select:(  plm) Query of component [slurm] set 
> priority to 75
>
> 30 [node513:32514] mca:base:select:(  plm) Querying component [tm]
>
> 31 [node513:32514] mca:base:select:(  plm) Selected component [slurm]
>
> 32 [node513:32514] mca: base: close: component isolated closed
>
> 33 [node513:32514] mca: base: close: unloading component isolated
>
> 34 [node513:32514] mca: base: close: component rsh closed
>
> 35 [node513:32514] mca: base: close: unloading component rsh
>
> 36 [node513:32514] mca: base: close: component tm closed
>
> 37 [node513:32514] mca: base: close: unloading component tm
>
> 38 [node513:32514] [[4367,0],0] plm:slurm: final top-level argv:
>
> 39         srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none 
> --nodes=1 --nodelist=node514 --ntasks=1 orted -mca orte_report_bindings "1" 
> -mca ess "slurm" -mca ess_base_jobid "286195712" -mca ess_base_vpid "1" -mca 
> ess_base_num_procs "2" -mca orte_node_regex "node[3:513-514]@0(2)" -mca 
> orte_hnp_uri "286195712.0;tcp://172.30.146
>
> 40 10.38.146.45:43031" -mca btl "openib,self" -mca plm_base_verbose "10"
>
> 41 srun: unrecognized option '--cpu_bind=none'
>
> 42 srun: unrecognized option '--cpu_bind=none'
>
> 43 Try "srun --help" for more information
>
> 44 --------------------------------------------------------------------------
>
> 45 An ORTE daemon has unexpectedly failed after launch and before
>
> 46 communicating back to mpirun. This could be caused by a number
>
> 47 of factors, including an inability to create a connection back
>
> 48 to mpirun due to a lack of common network interfaces and/or no
>
> 49 route found between them. Please check network connectivity
>
> 50 (including firewalls and network routing requirements).
>
> 51 --------------------------------------------------------------------------
>
> 52 [node513:32514] mca: base: close: component slurm closed
>
> 53 [node513:32514] mca: base: close: unloading component slurm
>
> 54 user1@node-login7g:~/git/slurm-jobs$
>
> ~
>
>
>
> Here are some ompi_info output:
>
>
>
>                  Package: Open MPI user1@node-login7g Distribution
>
>                 Open MPI: 4.0.1
>
>   Open MPI repo revision: v4.0.1
>
>    Open MPI release date: Mar 26, 2019
>
>                 Open RTE: 4.0.1
>
>   Open RTE repo revision: v4.0.1
>
>    Open RTE release date: Mar 26, 2019
>
>                     OPAL: 4.0.1
>
>       OPAL repo revision: v4.0.1
>
>        OPAL release date: Mar 26, 2019
>
>                  MPI API: 3.1.0
>
>             Ident string: 4.0.1
>
>                   Prefix: /sw/dev/openmpi401
>
> Configured architecture: x86_64-unknown-linux-gnu
>
>           Configure host: node-login7g
>
>            Configured by: user1
>
>            Configured on: Wed Jul  3 12:13:45 EDT 2019
>
>           Configure host: node-login7g
>
>   Configure command line: '--prefix=/sw/dev/openmpi401' '--enable-shared' 
> '--enable-static' '--enable-mpi-cxx' '--with-zlib=/usr' '--without-psm' 
> '--without-libfabric' '--without-mxm' '--with-verbs' '--without-psm2' 
> '--without-alps' '--without-lsf' '--without-sge' '--with-slurm' '--with-tm' 
> '--without-load-leveler' '--disable-memchecker'
>
> '--disable-java' '--disable-mpi-java' '--without-cuda' 
> '--enable-cxx-exceptions'
>
>                 Built by: user1
>
>                 Built on: Wed Jul  3 12:24:11 EDT 2019
>
>               Built host: node-login7g
>
>               C bindings: yes
>
>             C++ bindings: yes
>
>              Fort mpif.h: yes (all)
>
>             Fort use mpi: yes (limited: overloading)
>
>        Fort use mpi size: deprecated-ompi-info-value
>
>         Fort use mpi_f08: no
>
> Fort mpi_f08 compliance: The mpi_f08 module was not built
>
>   Fort mpi_f08 subarrays: no
>
>            Java bindings: no
>
>   Wrapper compiler rpath: runpath
>
>               C compiler: gcc
>
>      C compiler absolute: /usr/bin/gcc
>
>   C compiler family name: GNU
>
>       C compiler version: 4.8.5
>
>             C++ compiler: g++
>
>    C++ compiler absolute: /usr/bin/g++
>
>            Fort compiler: gfortran
>
>        Fort compiler abs: /usr/bin/gfortran
>
>
>
> …..
>
>
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to