Lawrence, My apologies for the incorrect link. The patch is at https://github.com/open-mpi/ompi/pull/6672.patch
Cheers, Gilles On Sat, Jul 6, 2019 at 12:21 PM Gilles Gouaillardet <[email protected]> wrote: > > Lawrence, > > this is a known issue (--cpu_bind optioned was removed from SLURM in > favor of the --cpu-bind option) and the fix will be available in Open > MPI 4.0.1 > > Meanwhile, you can manually download and apply the patch at > https://github.com/open-mpi/ompi/pull/6445.patch or use a nightly > build of the v4.0.x branch. > > Cheers, > > Gilles > > On Sat, Jul 6, 2019 at 5:35 AM Sorrillo, Lawrence via devel > <[email protected]> wrote: > > > > Hello, > > > > > > > > My attempt to run and troubleshoot my an OMPI job under a slurm allocation > > does not work as I would expect. > > > > The result below has led me to believe that under the hood, in this setup > > (SLURM with OMPI) that the correct srun options is not being used > > > > when I call, mpirun directly. > > > > > > > > Specifically the “cpu-bind=none” is breaking, but it also looks like the > > nodelist is incorrect. > > > > > > > > The job script. > > > > > > > > 1 #!/bin/bash > > > > 2 > > > > 3 #SBATCH --job-name=mpi-hostname > > > > 4 #SBATCH --partition=dev > > > > 5 #SBATCH --account=Account1 > > > > 6 #SBATCH --time=01:00:00 > > > > 7 #SBATCH --nodes=2 > > > > 8 #SBATCH --ntasks-per-node=1 > > > > 9 #SBATCH --begin=now+10 > > > > 10 #SBATCH --output="%x-%u-%N-%j.txt" # > > jobName-userId-hostName-jobId.txt > > > > 11 > > > > 12 > > > > 13 > > > > 14 # ---------------------------------------------------------------------- > > # > > > > 15 #module load DefApps > > > > 16 #module purge >/dev/null 2>&1 > > > > 17 ##module load staging/slurm >/dev/null 2>&1 > > > > 18 #module load gcc/4.8.5 openmpi >/dev/null 2>&1 > > > > 19 #module --ignore_cache spider openmpi/3.1.3 >/dev/null 2>&1 > > > > 20 # > > > > 21 # ---------------------------------------------------------------------- > > # > > > > 22 # > > > > 23 MPI_RUN=$(which orterun) > > > > 24 if [[ -z "${MPI_RUN:+x}" ]]; then > > > > 25 echo "ERROR: Cannot find 'mpirun' executable..exiting" > > > > 26 exit 1 > > > > 27 fi > > > > 28 > > > > 29 echo > > > > 30 #CMD="orterun -npernode 1 -np 2 /bin/hostname" > > > > 31 #CMD="srun /bin/hostname" > > > > 32 #CMD="srun -N2 -n2 --mpi=pmi2 /bin/hostname" > > > > 33 #MCMD="/sw/dev/openmpi401/bin/mpirun --bind-to-core --report-bindings > > -mca btl openib,self -mca plm_base_verbose 10 /bin/hostname" > > > > 34 MCMD="/sw/dev/openmpi401/bin/mpirun --report-bindings -mca btl > > openib,self -mca plm_base_verbose 10 /bin/hostname" > > > > 35 echo "INFO: Executing the command: $MCMD" > > > > 36 $MCMD > > > > 37 sync > > > > > > > > Here is the output: > > > > > > > > 2 user1@node-login7g:~/git/slurm-jobs$ more > > mpi-hostname-user1-node513-835.txt > > > > 3 > > > > 4 INFO: Executing the command: /sw/dev/openmpi401/bin/mpirun > > --report-bindings -mca btl openib,self -mca plm_base_verbose 10 > > /bin/hostname > > > > 5 [node513:32514] mca: base: components_register: registering framework > > plm components > > > > 6 [node513:32514] mca: base: components_register: found loaded component > > isolated > > > > 7 [node513:32514] mca: base: components_register: component isolated has > > no register or open function > > > > 8 [node513:32514] mca: base: components_register: found loaded component > > rsh > > > > 9 [node513:32514] mca: base: components_register: component rsh register > > function successful > > > > 10 [node513:32514] mca: base: components_register: found loaded component > > slurm > > > > 11 [node513:32514] mca: base: components_register: component slurm register > > function successful > > > > 12 [node513:32514] mca: base: components_register: found loaded component tm > > > > 13 [node513:32514] mca: base: components_register: component tm register > > function successful > > > > 14 [node513:32514] mca: base: components_open: opening plm components > > > > 15 [node513:32514] mca: base: components_open: found loaded component > > isolated > > > > 16 [node513:32514] mca: base: components_open: component isolated open > > function successful > > > > 17 [node513:32514] mca: base: components_open: found loaded component rsh > > > > 18 [node513:32514] mca: base: components_open: component rsh open function > > successful > > > > 19 [node513:32514] mca: base: components_open: found loaded component slurm > > > > 20 [node513:32514] mca: base: components_open: component slurm open > > function successful > > > > 21 [node513:32514] mca: base: components_open: found loaded component tm > > > > 22 [node513:32514] mca: base: components_open: component tm open function > > successful > > > > 23 [node513:32514] mca:base:select: Auto-selecting plm components > > > > 24 [node513:32514] mca:base:select:( plm) Querying component [isolated] > > > > 25 [node513:32514] mca:base:select:( plm) Query of component [isolated] > > set priority to 0 > > > > 26 [node513:32514] mca:base:select:( plm) Querying component [rsh] > > > > 27 [node513:32514] mca:base:select:( plm) Query of component [rsh] set > > priority to 10 > > > > 28 [node513:32514] mca:base:select:( plm) Querying component [slurm] > > > > 29 [node513:32514] mca:base:select:( plm) Query of component [slurm] set > > priority to 75 > > > > 30 [node513:32514] mca:base:select:( plm) Querying component [tm] > > > > 31 [node513:32514] mca:base:select:( plm) Selected component [slurm] > > > > 32 [node513:32514] mca: base: close: component isolated closed > > > > 33 [node513:32514] mca: base: close: unloading component isolated > > > > 34 [node513:32514] mca: base: close: component rsh closed > > > > 35 [node513:32514] mca: base: close: unloading component rsh > > > > 36 [node513:32514] mca: base: close: component tm closed > > > > 37 [node513:32514] mca: base: close: unloading component tm > > > > 38 [node513:32514] [[4367,0],0] plm:slurm: final top-level argv: > > > > 39 srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none > > --nodes=1 --nodelist=node514 --ntasks=1 orted -mca orte_report_bindings "1" > > -mca ess "slurm" -mca ess_base_jobid "286195712" -mca ess_base_vpid "1" > > -mca ess_base_num_procs "2" -mca orte_node_regex "node[3:513-514]@0(2)" > > -mca orte_hnp_uri "286195712.0;tcp://172.30.146 > > > > 40 10.38.146.45:43031" -mca btl "openib,self" -mca plm_base_verbose "10" > > > > 41 srun: unrecognized option '--cpu_bind=none' > > > > 42 srun: unrecognized option '--cpu_bind=none' > > > > 43 Try "srun --help" for more information > > > > 44 > > -------------------------------------------------------------------------- > > > > 45 An ORTE daemon has unexpectedly failed after launch and before > > > > 46 communicating back to mpirun. This could be caused by a number > > > > 47 of factors, including an inability to create a connection back > > > > 48 to mpirun due to a lack of common network interfaces and/or no > > > > 49 route found between them. Please check network connectivity > > > > 50 (including firewalls and network routing requirements). > > > > 51 > > -------------------------------------------------------------------------- > > > > 52 [node513:32514] mca: base: close: component slurm closed > > > > 53 [node513:32514] mca: base: close: unloading component slurm > > > > 54 user1@node-login7g:~/git/slurm-jobs$ > > > > ~ > > > > > > > > Here are some ompi_info output: > > > > > > > > Package: Open MPI user1@node-login7g Distribution > > > > Open MPI: 4.0.1 > > > > Open MPI repo revision: v4.0.1 > > > > Open MPI release date: Mar 26, 2019 > > > > Open RTE: 4.0.1 > > > > Open RTE repo revision: v4.0.1 > > > > Open RTE release date: Mar 26, 2019 > > > > OPAL: 4.0.1 > > > > OPAL repo revision: v4.0.1 > > > > OPAL release date: Mar 26, 2019 > > > > MPI API: 3.1.0 > > > > Ident string: 4.0.1 > > > > Prefix: /sw/dev/openmpi401 > > > > Configured architecture: x86_64-unknown-linux-gnu > > > > Configure host: node-login7g > > > > Configured by: user1 > > > > Configured on: Wed Jul 3 12:13:45 EDT 2019 > > > > Configure host: node-login7g > > > > Configure command line: '--prefix=/sw/dev/openmpi401' '--enable-shared' > > '--enable-static' '--enable-mpi-cxx' '--with-zlib=/usr' '--without-psm' > > '--without-libfabric' '--without-mxm' '--with-verbs' '--without-psm2' > > '--without-alps' '--without-lsf' '--without-sge' '--with-slurm' '--with-tm' > > '--without-load-leveler' '--disable-memchecker' > > > > '--disable-java' '--disable-mpi-java' '--without-cuda' > > '--enable-cxx-exceptions' > > > > Built by: user1 > > > > Built on: Wed Jul 3 12:24:11 EDT 2019 > > > > Built host: node-login7g > > > > C bindings: yes > > > > C++ bindings: yes > > > > Fort mpif.h: yes (all) > > > > Fort use mpi: yes (limited: overloading) > > > > Fort use mpi size: deprecated-ompi-info-value > > > > Fort use mpi_f08: no > > > > Fort mpi_f08 compliance: The mpi_f08 module was not built > > > > Fort mpi_f08 subarrays: no > > > > Java bindings: no > > > > Wrapper compiler rpath: runpath > > > > C compiler: gcc > > > > C compiler absolute: /usr/bin/gcc > > > > C compiler family name: GNU > > > > C compiler version: 4.8.5 > > > > C++ compiler: g++ > > > > C++ compiler absolute: /usr/bin/g++ > > > > Fort compiler: gfortran > > > > Fort compiler abs: /usr/bin/gfortran > > > > > > > > ….. > > > > > > > > _______________________________________________ > > devel mailing list > > [email protected] > > https://lists.open-mpi.org/mailman/listinfo/devel _______________________________________________ devel mailing list [email protected] https://lists.open-mpi.org/mailman/listinfo/devel
