I'm launching using qsub, the line from my previous message is from
the corresponding qsub job submission script.  FWIW, here is the whole
script:

--------------------
#!/bin/bash
#PBS -N FOO
#PBS -l select=1:ncpus=1:mpiprocs=1:host=node1+1:ncpus=1:mpiprocs=1:host=node2
#PBS -l Walltime=00:01:00
#PBS -o "foo.out"
#PBS -e "foo.err"
#PBS -q myqueue
#PBS -V

cd $PBS_O_WORKDIR
cat $PBS_NODEFILE >nodes.txt

mpipath=/usr/mpi/gcc/openmpi-4.1.2rc2
mpibinpath=$mpipath/bin
mpilibpath=$mpipath/lib64
export PATH=$mpibinpath:$PATH
export LD_LIBRARY_PATH=$mpilibpath:$LD_LIBRARY_PATH

mpirun -n 2 -hostfile $PBS_NODEFILE ./foo
--------------------

Here, "foo" is a small MPI program that just prints
OMPI_COMM_WORLD_LOCAL_RANK and OMPI_COMM_WORLD_LOCAL_SIZE, and exits.

On Tue, Jan 18, 2022 at 10:54 PM Ralph Castain via users
<users@lists.open-mpi.org> wrote:
>
> Are you launching the job with "mpirun"? I'm not familiar with that cmd line 
> and don't know what it does.
>
> Most likely explanation is that the mpirun from the prebuilt versions doesn't 
> have TM support, and therefore doesn't understand the 1ppn directive in your 
> cmd line. My guess is that you are using the ssh launcher - what is odd is 
> that you should wind up with two procs on the first node, in which case those 
> envars are correct. If you are seeing one proc on each node, then something 
> is wrong.
>
>
> > On Jan 18, 2022, at 1:33 PM, Crni Gorac via users 
> > <users@lists.open-mpi.org> wrote:
> >
> > I have one process per node, here is corresponding line from my job
> > submission script (with compute nodes named "node1" and "node2"):
> >
> > #PBS -l 
> > select=1:ncpus=1:mpiprocs=1:host=node1+1:ncpus=1:mpiprocs=1:host=node2
> >
> > On Tue, Jan 18, 2022 at 10:20 PM Ralph Castain via users
> > <users@lists.open-mpi.org> wrote:
> >>
> >> Afraid I can't understand your scenario - when you say you "submit a job" 
> >> to run on two nodes, how many processes are you running on each node??
> >>
> >>
> >>> On Jan 18, 2022, at 1:07 PM, Crni Gorac via users 
> >>> <users@lists.open-mpi.org> wrote:
> >>>
> >>> Using OpenMPI 4.1.2 from MLNX_OFED_LINUX-5.5-1.0.3.2 distribution, and
> >>> have PBS 18.1.4 installed on my cluster (cluster nodes are running
> >>> CentOS 7.9).  When I try to submit a job that will run on two nodes in
> >>> the cluster, both ranks get OMPI_COMM_WORLD_LOCAL_SIZE set to 2,
> >>> instead of 1, and OMPI_COMM_WORLD_LOCAL_RANK are set to 0 and 1,
> >>> instead of both being 0.  At the same time, the hostfile generated by
> >>> PBS ($PBS_NODEFILE) properly contains two nodes listed.
> >>>
> >>> I've tried with OpenMPI 3 from HPC-X, and the same thing happens too.
> >>> However, when I build OpenMPI myself (notable difference from above
> >>> mentioned pre-built MPI versions is that I use "--with-tm" option to
> >>> point to my PBS installation), then OMPI_COMM_WORLD_LOCAL_SIZE and
> >>> OMPI_COMM_WORLD_LOCAL_RANK are set properly.
> >>>
> >>> I'm not sure how to debug the problem, and whether it is possible to
> >>> fix it at all with a pre-built OpenMPI version, so any suggestion is
> >>> welcome.
> >>>
> >>> Thanks.
> >>
> >>
>
>

Reply via email to