On 02/02/09 06:12, Reuti wrote:
Am 02.02.2009 um 11:31 schrieb Sangamesh B:

On Mon, Feb 2, 2009 at 12:15 PM, Reuti <re...@staff.uni-marburg.de> wrote:
Am 02.02.2009 um 05:44 schrieb Sangamesh B:

On Sun, Feb 1, 2009 at 10:37 PM, Reuti <re...@staff.uni-marburg.de> wrote:

Am 01.02.2009 um 16:00 schrieb Sangamesh B:

On Sat, Jan 31, 2009 at 6:27 PM, Reuti <re...@staff.uni-marburg.de>

Am 31.01.2009 um 08:49 schrieb Sangamesh B:

On Fri, Jan 30, 2009 at 10:20 PM, Reuti <re...@staff.uni-marburg.de>

Am 30.01.2009 um 15:02 schrieb Sangamesh B:

Dear Open MPI,

Do you have a solution for the following problem of Open MPI (1.3)
when run through Grid Engine.

I changed global execd params with H_MEMORYLOCKED=infinity and
restarted the sgeexecd in all nodes.

But still the problem persists:

 $cat err.77.CPMD-OMPI
ssh_exchange_identification: Connection closed by remote host

I think this might already be the reason why it's not working. A
program is running fine through SGE?


Any Open MPI parallel job thru SGE runs only if its running on a
single node (i.e. 8processes on 8 cores of a single node). If number
of processes is more than 8, then SGE will schedule it on 2 nodes -
the job will fail with the above error.

Now I did a loose integration of Open MPI 1.3 with SGE. The job runs,
but all 16 processes run on a single node.

What are the entries in `qconf -sconf`for:


$ qconf -sconf
execd_spool_dir              /opt/gridengine/default/spool
qrsh_command                 /usr/bin/ssh
rsh_command                  /usr/bin/ssh
rlogin_command               /usr/bin/ssh
rsh_daemon                   /usr/sbin/sshd
qrsh_daemon                  /usr/sbin/sshd
reprioritize                 0

Do you must use ssh? Often in a private cluster the rsh based one is ok,
with SGE 6.2 the built-in mechanism of SGE. Otherwise please follow this:


I think its better to check once with Open MPI 1.2.8

What is your mpirun command in the jobscript - you are getting there
mpirun from Open MPI? According to the output below, it's not a loose
integration, but you prepare alraedy a machinefile, which is
Open MPI.

No. I've not prepared the machinefile for Open MPI.
For Tight integartion job:

/opt/mpi/openmpi/1.3/intel/bin/mpirun -np $NSLOTS
$CPMDBIN/cpmd311-ompi-mkl.x  wf1.in $PP_LIBRARY >

For loose integration job:

/opt/mpi/openmpi/1.3/intel/bin/mpirun -np $NSLOTS -hostfile
$TMPDIR/machines  $CPMDBIN/cpmd311-ompi-mkl.x  wf1.in $PP_LIBRARY >

a) you compiled Open MPI with "--with-sge"?

Yes. But ompi_info shows only one component of sge

$ /opt/mpi/openmpi/1.3/intel/bin/ompi_info | grep gridengine
MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3)

b) when the $SGE_ROOT variable is set, Open MPI will use a Tight

In SGE job submit script, I set SGE_ROOT= <nothing>

This will set the variable to an empty string. You need to use:

unset SGE_ROOT

I used 'unset SGE_ROOT' in the job submission script. Its working now.
Hello world jobs are working now. (single & multiple nodes)

Thank you for the help.

What can be the problem with tight integration?

There are obviously two issues for now with the Tight Integration for SGE:

- Some processes might throw an "err=2" for unknown reason and only from time to time, but run fine.

- Processes vanish into daemon although SGE's qrsh is used automatically (successive `ps -e f`show that it's called with "... orted --daemonize ..." for a short while) - this I overlooked in my last post when I stated it's working, as my process allocation was fine. Only that they weren't bound to any sge_shepherd.

Seems SGE integration is broken, and it would be indeed better to stay with 1.2.8 for now :-/

-- Reuti

I still do not know what is going on with the errno=2 issue. However, the use of --daemonize does seem wrong and we will fix that. I have created a ticket to track it.


Also, I would not say that SGE integration is completely broken in 1.3. Rather, assuming you do not run into the errno=2 issues, the main issue is that Open MPI does not properly account for the MPI job. It does gather up the allocation and run the job.




Reply via email to