In order to be able to checkpoint openmpi jobs with blcr, we have
configured openmpi as follows 

./configure --prefix=/data1/packages/openmpi/1.5.1-blcr-without-tm
--disable-openib-connectx-xrc --disable-openib-rdmacm --with-ft=cr
--enable-mpi-threads --enable-ft-thread --with-blcr=/usr
--with-blcr-libdir=/usr/include --without-tm

When used in conjunction with torque2.5.3, we are able to start the
following job with 8 cores on one node, but if we try to start the same
job with 4 cores on each of two nodes, the job starts 4 cores on the
primary node, but not the remaining 4 cores on the second node.

$ cat PBStest
#!/bin/sh
#PBS -c enabled
#PBS -l walltime=25:00:00
#PBS -l nodes=2:ppn=4
#PBS -m ae
#PBS -M g...@ansto.gov.au
#PBS -N Prob8
#PBS -r n
#PBS -q blcrq
source /etc/profile.d/00-modules.sh
module load mpi/openmpi_1.5-blcr-without-tm
NN=`cat $PBS_NODEFILE | wc -l`
cd $PBS_O_WORKDIR
cat $PBS_NODEFILE > hostfile
cat $PBS_NODEFILE
pwd
echo "NN = $NN "
date
which mpirun
cd $PBS_O_WORKDIR
mpirun -am ft-enable-cr  -machinefile hostfile  ex5mpi  testData   
 --------------------------------------------------------------
The hostfile correctly lists the primary node 4 times, and then the
second node 4 times.

When openmpi is built --with-tm, which is the default if --without-tm is
not specified, the job correctly starts on the 8 cores spread across the
4 nodes.

blcr needs cr_mpirun to start the job without torque support to be able
to checkpoint the mpi job correctly.

My question is whether it is possible for the script above to be
modified in order to start on multiple nodes if openmpi has been built
with --without-tm and, if so, what needs to be added or deleted from the
script?
I have tried -mca plm ^tm with openmpi built --with-tm which also will
not start the second 4  mpi ranks.

Any suggestions gratefully accepted.
Greg Doherty
ANSTO 



Reply via email to