In order to be able to checkpoint openmpi jobs with blcr, we have configured openmpi as follows
./configure --prefix=/data1/packages/openmpi/1.5.1-blcr-without-tm --disable-openib-connectx-xrc --disable-openib-rdmacm --with-ft=cr --enable-mpi-threads --enable-ft-thread --with-blcr=/usr --with-blcr-libdir=/usr/include --without-tm When used in conjunction with torque2.5.3, we are able to start the following job with 8 cores on one node, but if we try to start the same job with 4 cores on each of two nodes, the job starts 4 cores on the primary node, but not the remaining 4 cores on the second node. $ cat PBStest #!/bin/sh #PBS -c enabled #PBS -l walltime=25:00:00 #PBS -l nodes=2:ppn=4 #PBS -m ae #PBS -M g...@ansto.gov.au #PBS -N Prob8 #PBS -r n #PBS -q blcrq source /etc/profile.d/00-modules.sh module load mpi/openmpi_1.5-blcr-without-tm NN=`cat $PBS_NODEFILE | wc -l` cd $PBS_O_WORKDIR cat $PBS_NODEFILE > hostfile cat $PBS_NODEFILE pwd echo "NN = $NN " date which mpirun cd $PBS_O_WORKDIR mpirun -am ft-enable-cr -machinefile hostfile ex5mpi testData -------------------------------------------------------------- The hostfile correctly lists the primary node 4 times, and then the second node 4 times. When openmpi is built --with-tm, which is the default if --without-tm is not specified, the job correctly starts on the 8 cores spread across the 4 nodes. blcr needs cr_mpirun to start the job without torque support to be able to checkpoint the mpi job correctly. My question is whether it is possible for the script above to be modified in order to start on multiple nodes if openmpi has been built with --without-tm and, if so, what needs to be added or deleted from the script? I have tried -mca plm ^tm with openmpi built --with-tm which also will not start the second 4 mpi ranks. Any suggestions gratefully accepted. Greg Doherty ANSTO