I'm launching using qsub, the line from my previous message is from the corresponding qsub job submission script. FWIW, here is the whole script:
-------------------- #!/bin/bash #PBS -N FOO #PBS -l select=1:ncpus=1:mpiprocs=1:host=node1+1:ncpus=1:mpiprocs=1:host=node2 #PBS -l Walltime=00:01:00 #PBS -o "foo.out" #PBS -e "foo.err" #PBS -q myqueue #PBS -V cd $PBS_O_WORKDIR cat $PBS_NODEFILE >nodes.txt mpipath=/usr/mpi/gcc/openmpi-4.1.2rc2 mpibinpath=$mpipath/bin mpilibpath=$mpipath/lib64 export PATH=$mpibinpath:$PATH export LD_LIBRARY_PATH=$mpilibpath:$LD_LIBRARY_PATH mpirun -n 2 -hostfile $PBS_NODEFILE ./foo -------------------- Here, "foo" is a small MPI program that just prints OMPI_COMM_WORLD_LOCAL_RANK and OMPI_COMM_WORLD_LOCAL_SIZE, and exits. On Tue, Jan 18, 2022 at 10:54 PM Ralph Castain via users <users@lists.open-mpi.org> wrote: > > Are you launching the job with "mpirun"? I'm not familiar with that cmd line > and don't know what it does. > > Most likely explanation is that the mpirun from the prebuilt versions doesn't > have TM support, and therefore doesn't understand the 1ppn directive in your > cmd line. My guess is that you are using the ssh launcher - what is odd is > that you should wind up with two procs on the first node, in which case those > envars are correct. If you are seeing one proc on each node, then something > is wrong. > > > > On Jan 18, 2022, at 1:33 PM, Crni Gorac via users > > <users@lists.open-mpi.org> wrote: > > > > I have one process per node, here is corresponding line from my job > > submission script (with compute nodes named "node1" and "node2"): > > > > #PBS -l > > select=1:ncpus=1:mpiprocs=1:host=node1+1:ncpus=1:mpiprocs=1:host=node2 > > > > On Tue, Jan 18, 2022 at 10:20 PM Ralph Castain via users > > <users@lists.open-mpi.org> wrote: > >> > >> Afraid I can't understand your scenario - when you say you "submit a job" > >> to run on two nodes, how many processes are you running on each node?? > >> > >> > >>> On Jan 18, 2022, at 1:07 PM, Crni Gorac via users > >>> <users@lists.open-mpi.org> wrote: > >>> > >>> Using OpenMPI 4.1.2 from MLNX_OFED_LINUX-5.5-1.0.3.2 distribution, and > >>> have PBS 18.1.4 installed on my cluster (cluster nodes are running > >>> CentOS 7.9). When I try to submit a job that will run on two nodes in > >>> the cluster, both ranks get OMPI_COMM_WORLD_LOCAL_SIZE set to 2, > >>> instead of 1, and OMPI_COMM_WORLD_LOCAL_RANK are set to 0 and 1, > >>> instead of both being 0. At the same time, the hostfile generated by > >>> PBS ($PBS_NODEFILE) properly contains two nodes listed. > >>> > >>> I've tried with OpenMPI 3 from HPC-X, and the same thing happens too. > >>> However, when I build OpenMPI myself (notable difference from above > >>> mentioned pre-built MPI versions is that I use "--with-tm" option to > >>> point to my PBS installation), then OMPI_COMM_WORLD_LOCAL_SIZE and > >>> OMPI_COMM_WORLD_LOCAL_RANK are set properly. > >>> > >>> I'm not sure how to debug the problem, and whether it is possible to > >>> fix it at all with a pre-built OpenMPI version, so any suggestion is > >>> welcome. > >>> > >>> Thanks. > >> > >> > >