On 02/02/09 06:12, Reuti wrote:
Am 02.02.2009 um 11:31 schrieb Sangamesh B:

On Mon, Feb 2, 2009 at 12:15 PM, Reuti <re...@staff.uni-marburg.de> wrote:
Am 02.02.2009 um 05:44 schrieb Sangamesh B:

On Sun, Feb 1, 2009 at 10:37 PM, Reuti <re...@staff.uni-marburg.de> wrote:

Am 01.02.2009 um 16:00 schrieb Sangamesh B:

On Sat, Jan 31, 2009 at 6:27 PM, Reuti <re...@staff.uni-marburg.de>
wrote:

Am 31.01.2009 um 08:49 schrieb Sangamesh B:

On Fri, Jan 30, 2009 at 10:20 PM, Reuti <re...@staff.uni-marburg.de>
wrote:

Am 30.01.2009 um 15:02 schrieb Sangamesh B:

Dear Open MPI,

Do you have a solution for the following problem of Open MPI (1.3)
when run through Grid Engine.

I changed global execd params with H_MEMORYLOCKED=infinity and
restarted the sgeexecd in all nodes.

But still the problem persists:

 $cat err.77.CPMD-OMPI
ssh_exchange_identification: Connection closed by remote host

I think this might already be the reason why it's not working. A
mpihello
program is running fine through SGE?

No.

Any Open MPI parallel job thru SGE runs only if its running on a
single node (i.e. 8processes on 8 cores of a single node). If number
of processes is more than 8, then SGE will schedule it on 2 nodes -
the job will fail with the above error.

Now I did a loose integration of Open MPI 1.3 with SGE. The job runs,
but all 16 processes run on a single node.

What are the entries in `qconf -sconf`for:

rsh_command
rsh_daemon

$ qconf -sconf
global:
execd_spool_dir              /opt/gridengine/default/spool
...
.....
qrsh_command                 /usr/bin/ssh
rsh_command                  /usr/bin/ssh
rlogin_command               /usr/bin/ssh
rsh_daemon                   /usr/sbin/sshd
qrsh_daemon                  /usr/sbin/sshd
reprioritize                 0

Do you must use ssh? Often in a private cluster the rsh based one is ok,
or
with SGE 6.2 the built-in mechanism of SGE. Otherwise please follow this:

http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html


I think its better to check once with Open MPI 1.2.8

What is your mpirun command in the jobscript - you are getting there
the
mpirun from Open MPI? According to the output below, it's not a loose
integration, but you prepare alraedy a machinefile, which is
superfluous
for
Open MPI.

No. I've not prepared the machinefile for Open MPI.
For Tight integartion job:

/opt/mpi/openmpi/1.3/intel/bin/mpirun -np $NSLOTS
$CPMDBIN/cpmd311-ompi-mkl.x  wf1.in $PP_LIBRARY >
wf1.out_OMPI$NSLOTS.$JOB_ID

For loose integration job:

/opt/mpi/openmpi/1.3/intel/bin/mpirun -np $NSLOTS -hostfile
$TMPDIR/machines  $CPMDBIN/cpmd311-ompi-mkl.x  wf1.in $PP_LIBRARY >
wf1.out_OMPI_$JOB_ID.$NSLOTS

a) you compiled Open MPI with "--with-sge"?

Yes. But ompi_info shows only one component of sge

$ /opt/mpi/openmpi/1.3/intel/bin/ompi_info | grep gridengine
MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3)

b) when the $SGE_ROOT variable is set, Open MPI will use a Tight
Integration
automatically.

In SGE job submit script, I set SGE_ROOT= <nothing>

This will set the variable to an empty string. You need to use:

unset SGE_ROOT

Right.
I used 'unset SGE_ROOT' in the job submission script. Its working now.
Hello world jobs are working now. (single & multiple nodes)

Thank you for the help.

What can be the problem with tight integration?

There are obviously two issues for now with the Tight Integration for SGE:

- Some processes might throw an "err=2" for unknown reason and only from time to time, but run fine.

- Processes vanish into daemon although SGE's qrsh is used automatically (successive `ps -e f`show that it's called with "... orted --daemonize ..." for a short while) - this I overlooked in my last post when I stated it's working, as my process allocation was fine. Only that they weren't bound to any sge_shepherd.

Seems SGE integration is broken, and it would be indeed better to stay with 1.2.8 for now :-/

-- Reuti

I still do not know what is going on with the errno=2 issue. However, the use of --daemonize does seem wrong and we will fix that. I have created a ticket to track it.

https://svn.open-mpi.org/trac/ompi/ticket/1783

Also, I would not say that SGE integration is completely broken in 1.3. Rather, assuming you do not run into the errno=2 issues, the main issue is that Open MPI does not properly account for the MPI job. It does gather up the allocation and run the job.

Rolf

--

=========================
rolf.vandeva...@sun.com
781-442-3043
=========================

Reply via email to