I am in the process of setting up a grid engine (SGE) cluster for running Open MPI applications. I'll detail the set up below, but my current problem is that this call to Span_multiple never seems to return.
// Spawn all of the children processes. _intercomm = MPI::COMM_WORLD.Spawn_multiple( _nProc, const_cast<const char **>(_command), const_cast<const char ***>(_arg), _maxProc, _info, 0, errCode ); I'm new to both SGE and MPI, which is making this problem difficult for me to troubleshoot. I can schedule simple (non-MPI) jobs on the SGE grid with qsub. I can use qsub to schedule multiple copies of a simple Hello World type of application using mpirun spawn the processes in a script like this: #!/bin/sh # #$ -S /bin/sh #$ -V #$ -pe orte 4 #$ -cwd #$ -j yes export LD_LIBRARY_PATH=/${VXR_STATIC}/openmpi-1.5.4/lib mpirun -np 4 ./mpihello $* That seems to work. The processes report the hostname where they were run, and they appear to be scheduled on different machines in my SGE grid. The problem is with a program, mpitest, that tries to use Spawn_multiple to launch multiple child processes. The script that I submit to the SGE grid looks like this: #!/bin/sh # #$ -S /bin/sh #$ -V #$ -pe orte 1- #$ -cwd #$ -j yes export LD_LIBRARY_PATH=/${VXR_STATIC}/openmpi-1.5.4/lib ./mpitest $* The mpitest program is the one that calls Spawn_multiple. In this case, it just tries to run multiple copies of itself. If I restrict my SGE configuration so that the orte parallel environment has to run all jobs on a single host, then mpitest runs to completion, spawning 4 "child" processes that are scheduled via SGE to run on the same host as the root process. The processes Send and Recv some messages, and the program exits. If I permit SGE to schedule jobs on multiple hosts, then the child processes appear to be scheduled and launched. (That is, I can see them as children of the sge_execd and sge_shepherd processes on various machines.) But the original call to Spawn_multiple doesn't appear to return in the root mpitest. I assume that there's some problem setting up the communications channel among the different processes, but it's possible that my mpitest code is just buggy. I already tried disabling the firewall on all of the machines. I'm not sure how else to get useful debug information at this stage of the troubleshooting. It would be great if someone could look at the attached code and just let me know whether what I'm doing is horribly incorrect. If it should work, then I can focus on systems and SGE configuration issues. If the code is broken and really shouldn't work, then I'd like to fix that first, of course. Thanks, ---Tom
mpitest.tgz
Description: Binary data