Lawrence, this is a known issue (--cpu_bind optioned was removed from SLURM in favor of the --cpu-bind option) and the fix will be available in Open MPI 4.0.1
Meanwhile, you can manually download and apply the patch at https://github.com/open-mpi/ompi/pull/6445.patch or use a nightly build of the v4.0.x branch. Cheers, Gilles On Sat, Jul 6, 2019 at 5:35 AM Sorrillo, Lawrence via devel <[email protected]> wrote: > > Hello, > > > > My attempt to run and troubleshoot my an OMPI job under a slurm allocation > does not work as I would expect. > > The result below has led me to believe that under the hood, in this setup > (SLURM with OMPI) that the correct srun options is not being used > > when I call, mpirun directly. > > > > Specifically the “cpu-bind=none” is breaking, but it also looks like the > nodelist is incorrect. > > > > The job script. > > > > 1 #!/bin/bash > > 2 > > 3 #SBATCH --job-name=mpi-hostname > > 4 #SBATCH --partition=dev > > 5 #SBATCH --account=Account1 > > 6 #SBATCH --time=01:00:00 > > 7 #SBATCH --nodes=2 > > 8 #SBATCH --ntasks-per-node=1 > > 9 #SBATCH --begin=now+10 > > 10 #SBATCH --output="%x-%u-%N-%j.txt" # jobName-userId-hostName-jobId.txt > > 11 > > 12 > > 13 > > 14 # ---------------------------------------------------------------------- # > > 15 #module load DefApps > > 16 #module purge >/dev/null 2>&1 > > 17 ##module load staging/slurm >/dev/null 2>&1 > > 18 #module load gcc/4.8.5 openmpi >/dev/null 2>&1 > > 19 #module --ignore_cache spider openmpi/3.1.3 >/dev/null 2>&1 > > 20 # > > 21 # ---------------------------------------------------------------------- # > > 22 # > > 23 MPI_RUN=$(which orterun) > > 24 if [[ -z "${MPI_RUN:+x}" ]]; then > > 25 echo "ERROR: Cannot find 'mpirun' executable..exiting" > > 26 exit 1 > > 27 fi > > 28 > > 29 echo > > 30 #CMD="orterun -npernode 1 -np 2 /bin/hostname" > > 31 #CMD="srun /bin/hostname" > > 32 #CMD="srun -N2 -n2 --mpi=pmi2 /bin/hostname" > > 33 #MCMD="/sw/dev/openmpi401/bin/mpirun --bind-to-core --report-bindings > -mca btl openib,self -mca plm_base_verbose 10 /bin/hostname" > > 34 MCMD="/sw/dev/openmpi401/bin/mpirun --report-bindings -mca btl > openib,self -mca plm_base_verbose 10 /bin/hostname" > > 35 echo "INFO: Executing the command: $MCMD" > > 36 $MCMD > > 37 sync > > > > Here is the output: > > > > 2 user1@node-login7g:~/git/slurm-jobs$ more mpi-hostname-user1-node513-835.txt > > 3 > > 4 INFO: Executing the command: /sw/dev/openmpi401/bin/mpirun > --report-bindings -mca btl openib,self -mca plm_base_verbose 10 > /bin/hostname > > 5 [node513:32514] mca: base: components_register: registering framework plm > components > > 6 [node513:32514] mca: base: components_register: found loaded component > isolated > > 7 [node513:32514] mca: base: components_register: component isolated has no > register or open function > > 8 [node513:32514] mca: base: components_register: found loaded component rsh > > 9 [node513:32514] mca: base: components_register: component rsh register > function successful > > 10 [node513:32514] mca: base: components_register: found loaded component > slurm > > 11 [node513:32514] mca: base: components_register: component slurm register > function successful > > 12 [node513:32514] mca: base: components_register: found loaded component tm > > 13 [node513:32514] mca: base: components_register: component tm register > function successful > > 14 [node513:32514] mca: base: components_open: opening plm components > > 15 [node513:32514] mca: base: components_open: found loaded component isolated > > 16 [node513:32514] mca: base: components_open: component isolated open > function successful > > 17 [node513:32514] mca: base: components_open: found loaded component rsh > > 18 [node513:32514] mca: base: components_open: component rsh open function > successful > > 19 [node513:32514] mca: base: components_open: found loaded component slurm > > 20 [node513:32514] mca: base: components_open: component slurm open function > successful > > 21 [node513:32514] mca: base: components_open: found loaded component tm > > 22 [node513:32514] mca: base: components_open: component tm open function > successful > > 23 [node513:32514] mca:base:select: Auto-selecting plm components > > 24 [node513:32514] mca:base:select:( plm) Querying component [isolated] > > 25 [node513:32514] mca:base:select:( plm) Query of component [isolated] set > priority to 0 > > 26 [node513:32514] mca:base:select:( plm) Querying component [rsh] > > 27 [node513:32514] mca:base:select:( plm) Query of component [rsh] set > priority to 10 > > 28 [node513:32514] mca:base:select:( plm) Querying component [slurm] > > 29 [node513:32514] mca:base:select:( plm) Query of component [slurm] set > priority to 75 > > 30 [node513:32514] mca:base:select:( plm) Querying component [tm] > > 31 [node513:32514] mca:base:select:( plm) Selected component [slurm] > > 32 [node513:32514] mca: base: close: component isolated closed > > 33 [node513:32514] mca: base: close: unloading component isolated > > 34 [node513:32514] mca: base: close: component rsh closed > > 35 [node513:32514] mca: base: close: unloading component rsh > > 36 [node513:32514] mca: base: close: component tm closed > > 37 [node513:32514] mca: base: close: unloading component tm > > 38 [node513:32514] [[4367,0],0] plm:slurm: final top-level argv: > > 39 srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none > --nodes=1 --nodelist=node514 --ntasks=1 orted -mca orte_report_bindings "1" > -mca ess "slurm" -mca ess_base_jobid "286195712" -mca ess_base_vpid "1" -mca > ess_base_num_procs "2" -mca orte_node_regex "node[3:513-514]@0(2)" -mca > orte_hnp_uri "286195712.0;tcp://172.30.146 > > 40 10.38.146.45:43031" -mca btl "openib,self" -mca plm_base_verbose "10" > > 41 srun: unrecognized option '--cpu_bind=none' > > 42 srun: unrecognized option '--cpu_bind=none' > > 43 Try "srun --help" for more information > > 44 -------------------------------------------------------------------------- > > 45 An ORTE daemon has unexpectedly failed after launch and before > > 46 communicating back to mpirun. This could be caused by a number > > 47 of factors, including an inability to create a connection back > > 48 to mpirun due to a lack of common network interfaces and/or no > > 49 route found between them. Please check network connectivity > > 50 (including firewalls and network routing requirements). > > 51 -------------------------------------------------------------------------- > > 52 [node513:32514] mca: base: close: component slurm closed > > 53 [node513:32514] mca: base: close: unloading component slurm > > 54 user1@node-login7g:~/git/slurm-jobs$ > > ~ > > > > Here are some ompi_info output: > > > > Package: Open MPI user1@node-login7g Distribution > > Open MPI: 4.0.1 > > Open MPI repo revision: v4.0.1 > > Open MPI release date: Mar 26, 2019 > > Open RTE: 4.0.1 > > Open RTE repo revision: v4.0.1 > > Open RTE release date: Mar 26, 2019 > > OPAL: 4.0.1 > > OPAL repo revision: v4.0.1 > > OPAL release date: Mar 26, 2019 > > MPI API: 3.1.0 > > Ident string: 4.0.1 > > Prefix: /sw/dev/openmpi401 > > Configured architecture: x86_64-unknown-linux-gnu > > Configure host: node-login7g > > Configured by: user1 > > Configured on: Wed Jul 3 12:13:45 EDT 2019 > > Configure host: node-login7g > > Configure command line: '--prefix=/sw/dev/openmpi401' '--enable-shared' > '--enable-static' '--enable-mpi-cxx' '--with-zlib=/usr' '--without-psm' > '--without-libfabric' '--without-mxm' '--with-verbs' '--without-psm2' > '--without-alps' '--without-lsf' '--without-sge' '--with-slurm' '--with-tm' > '--without-load-leveler' '--disable-memchecker' > > '--disable-java' '--disable-mpi-java' '--without-cuda' > '--enable-cxx-exceptions' > > Built by: user1 > > Built on: Wed Jul 3 12:24:11 EDT 2019 > > Built host: node-login7g > > C bindings: yes > > C++ bindings: yes > > Fort mpif.h: yes (all) > > Fort use mpi: yes (limited: overloading) > > Fort use mpi size: deprecated-ompi-info-value > > Fort use mpi_f08: no > > Fort mpi_f08 compliance: The mpi_f08 module was not built > > Fort mpi_f08 subarrays: no > > Java bindings: no > > Wrapper compiler rpath: runpath > > C compiler: gcc > > C compiler absolute: /usr/bin/gcc > > C compiler family name: GNU > > C compiler version: 4.8.5 > > C++ compiler: g++ > > C++ compiler absolute: /usr/bin/g++ > > Fort compiler: gfortran > > Fort compiler abs: /usr/bin/gfortran > > > > ….. > > > > _______________________________________________ > devel mailing list > [email protected] > https://lists.open-mpi.org/mailman/listinfo/devel _______________________________________________ devel mailing list [email protected] https://lists.open-mpi.org/mailman/listinfo/devel
