Hi all,

Firstly, hello to the mailing list for the first time!  Secondly, sorry for the 
non-descript subject line, but I couldn't really think how to be more specific! 
 

Anyway, I am currently having a problem getting OpenMPI to work within my 
installation of SGE 6.2u5.  I compiled OpenMPI 1.4.2 from source, and installed 
under /usr/local/packages/openmpi-1.4.2.  Software on my system is controlled 
by the Modules framework which adds the bin and lib directories to PATH and 
LD_LIBRARY_PATH respectively when a user is connected to an execution node.  I 
configured a parallel environment in which OpenMPI is to be used: 

pe_name            mpi
slots              16
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $round_robin
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE

I then tried a simple job submission script:

#!/bin/bash
#
#$ -S /bin/bash
. /etc/profile
module add ompi gcc
mpirun hostname

If the parallel environment runs within one execution host (8 slots per host), 
then all is fine.  However, if scheduled across  several nodes, I get an error:

execv: No such file or directory
execv: No such file or directory
execv: No such file or directory
--------------------------------------------------------------------------
A daemon (pid 1629) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
mpirun: clean termination accomplished


I'm at a loss on how to start debugging this, and I don't seem to be getting 
anything useful using the mpirun '-d' and '-v' switches.  SGE logs don't note 
anything.  Can anyone suggest either what is wrong, or how I might progress 
with getting more information?

Many thanks,


Chris



--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778






Reply via email to