Hi all,
I submit a parallel ORCA (Quantum Chemistry Program) job on multiple
nodes in Rocks SGE, and get the follow error information,
--------------------------------------------------------------------------
A hostfile was provided that contains at least one node not
present in the allocation:
hostfile: test.nodes
node: compute-0-67
If you are operating in a resource-managed environment, then only
nodes that are in the allocation can be used in the hostfile. You
may find relative node syntax to be a useful alternative to
specifying absolute node names see the orte_hosts man page for
further information.
--------------------------------------------------------------------------
The ORCA program compiled with openmpi, here, I used orte parallel
environment in Rocks SGE.
$ qconf -sp orte
pe_name orte
slots 9999
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary TRUE
The submitted sge script:
#!/bin/bash
# Job submission script:
# Usage: qsub <this_script>
#
#$ -cwd
#$ -j y
#$ -o test.sge.o$JOB_ID
#$ -S /bin/bash
#$ -N test
#$ -pe orte 24
#$ -l h_vmem=3.67G
#$ -l h_rt=1240:00:00
# go to work dir
cd $SGE_O_WORKDIR
# load the module env for ORCA
source /usr/share/Modules/init/sh
module load intel/compiler/2011.7.256
source /share/apps/mpi/openmpi2.0.2-ifort/bin/mpivars.sh
export orcapath=/share/apps/orca4.0.0
export RSH_COMMAND="ssh"
#creat scratch dir on nfs dir
tdir=/home/data/$SGE_O_LOGNAME/$JOB_ID
mkdir -p $tdir
#cat $PE_HOSTFILE
PeHostfile2MachineFile()
{
cat $1 | while read line; do
# echo $line
host=`echo $line|cut -f1 -d" "|cut -f1 -d"."`
nslots=`echo $line|cut -f2 -d" "`
i=1
while [ $i -le $nslots ]; do
# add here code to map regular hostnames into ATM hostnames
echo $host
i=`expr $i + 1`
done
done
}
PeHostfile2MachineFile $PE_HOSTFILE >> $tdir/test.nodes
cp ${SGE_O_WORKDIR}/test.inp $tdir
cd $tdir
echo "ORCA job start at" `date`
time $orcapath/orca test.inp > ${SGE_O_WORKDIR}/test.log
rm ${tdir}/test.inp
rm ${tdir}/test.*tmp 2>/dev/null
rm ${tdir}/test.*tmp.* 2>/dev/null
mv ${tdir}/test.* $SGE_O_WORKDIR
echo "ORCA job finished at" `date`
echo "Work Dir is : $SGE_O_WORKDIR"
rm -rf $tdir
rm $SGE_O_WORKDIR/test.sge
However, the job can run normally on multiple nodes in Torque.
Can someone help me? Thanks very much!
Best regards!
Yong Wu
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users