Hi, > Am 07.04.2017 um 09:42 schrieb Yong Wu <[email protected]>: > > Hi all, > I submit a parallel ORCA (Quantum Chemistry Program) job on multiple nodes > in Rocks SGE, and get the follow error information, > -------------------------------------------------------------------------- > A hostfile was provided that contains at least one node not > present in the allocation: > > hostfile: test.nodes > node: compute-0-67 > > If you are operating in a resource-managed environment, then only > nodes that are in the allocation can be used in the hostfile. You > may find relative node syntax to be a useful alternative to > specifying absolute node names see the orte_hosts man page for > further information. > --------------------------------------------------------------------------
Although a nodefile is not necessary, it might point to a bug in Open MPI - see
below to get rid of it. Can you please post the output of the $PE_HOSTFILE and
the converted test.nodes for a run, and the allocation you got:
qstat -g t
(You can limit the output to your user account and all the lines belonging to
the job in question.)
> The ORCA program compiled with openmpi, here, I used orte parallel
> environment in Rocks SGE.
Well, you can decide whether I answer here or on the ORCA list ;-)
> $ qconf -sp orte
> pe_name orte
> slots 9999
> user_lists NONE
> xuser_lists NONE
> start_proc_args /bin/true
> stop_proc_args /bin/true
> allocation_rule $fill_up
> control_slaves TRUE
> job_is_first_task FALSE
> urgency_slots min
> accounting_summary TRUE
This is fine.
> The submitted sge script:
> #!/bin/bash
> # Job submission script:
> # Usage: qsub <this_script>
> #
> #$ -cwd
> #$ -j y
> #$ -o test.sge.o$JOB_ID
> #$ -S /bin/bash
> #$ -N test
> #$ -pe orte 24
> #$ -l h_vmem=3.67G
> #$ -l h_rt=1240:00:00
>
> # go to work dir
> cd $SGE_O_WORKDIR
There is a switch for it:
#$ -cwd
>
> # load the module env for ORCA
> source /usr/share/Modules/init/sh
> module load intel/compiler/2011.7.256
> source /share/apps/mpi/openmpi2.0.2-ifort/bin/mpivars.sh
The "mpivars.sh" seems not to be in the default Open MPI compilation. Where is
it coming from, what's inside?
Did you compile Open MPI with the "--with-sge" in the ./configure step? In case
you didn't compile it on your own, you should see something like this:
$ ompi_info | grep grid
MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component v2.1.0)
> export orcapath=/share/apps/orca4.0.0
> export RSH_COMMAND="ssh"
>
> #creat scratch dir on nfs dir
> tdir=/home/data/$SGE_O_LOGNAME/$JOB_ID
> mkdir -p $tdir
>
> #cat $PE_HOSTFILE
>
> PeHostfile2MachineFile()
> {
> cat $1 | while read line; do
> # echo $line
> host=`echo $line|cut -f1 -d" "|cut -f1 -d"."`
> nslots=`echo $line|cut -f2 -d" "`
> i=1
> while [ $i -le $nslots ]; do
> # add here code to map regular hostnames into ATM hostnames
> echo $host
> i=`expr $i + 1`
> done
> done
> }
>
> PeHostfile2MachineFile $PE_HOSTFILE >> $tdir/test.nodes
In former times, this conversion was done in the start_proc_args. Nowadays you
neither need this conversion, nor any "machines" file, nor the "test.nodes"
file any longer. Open MPI will detect on it's own the correct number of slots
to use on each node.
There are only some multi-serial computations in ORCA, which need rsh/ssh and a
nodefile (I have to check whether they don't just pull the information out of a
`mpiexec`).
> cp ${SGE_O_WORKDIR}/test.inp $tdir
>
> cd $tdir
Side note:
In ORCA there seem several types of jobs to exist:
- some types of ORCA jobs can compute happily in $TMPDIR using the scratch
directory on the nodes (even in case the job needs more than one machine)
- some need a shared scratch directory, like you create here in the shared /home
- some will start several serial processes on the granted nodes by the defined
$RSH_COMMAND
-- Reuti
>
> echo "ORCA job start at" `date`
>
> time $orcapath/orca test.inp > ${SGE_O_WORKDIR}/test.log
>
> rm ${tdir}/test.inp
> rm ${tdir}/test.*tmp 2>/dev/null
> rm ${tdir}/test.*tmp.* 2>/dev/null
> mv ${tdir}/test.* $SGE_O_WORKDIR
>
> echo "ORCA job finished at" `date`
>
> echo "Work Dir is : $SGE_O_WORKDIR"
>
> rm -rf $tdir
> rm $SGE_O_WORKDIR/test.sge
>
>
> However, the job can run normally on multiple nodes in Torque.
>
> Can someone help me? Thanks very much!
>
> Best regards!
> Yong Wu
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
