
the daemons will fork into daemon land - no accounting, no control by SGE via qdel (nevertheless it runs, just not tightly integrated):


-- Reuti

Am 26.02.2009 um 06:13 schrieb Sangamesh B:

Hello Reuti,

   I'm sorry for the late response.

On Mon, Jan 26, 2009 at 7:11 PM, Reuti <re...@staff.uni-marburg.de> wrote:
Am 25.01.2009 um 06:16 schrieb Sangamesh B:

Thanks Reuti for the reply.

On Sun, Jan 25, 2009 at 2:22 AM, Reuti <re...@staff.uni- marburg.de> wrote:

Am 24.01.2009 um 17:12 schrieb Jeremy Stout:

The RLIMIT error is very common when using OpenMPI + OFED + Sun Grid
Engine. You can find more information and several remedies here:

I usually resolve this problem by adding "ulimit -l unlimited" near
the top of the SGE startup script on the computation nodes and
restarting SGE on every node.

Did you request/set any limits with SGE's h_vmem/h_stack resource

Was this also your problem:

http://gridengine.sunsource.net/ds/viewMessage.do? dsForumId=38&dsMessageId=99442

I've not posted that mail. But the same setting is not working for me:

$ qconf -sconf
execd_params                 H_MEMORYLOCKED=infinity

But I'm using "unset SGE_ROOT" (suggested by you) inside sge job
submission script with a Loose integration of Open MPI with SGE. Its
working fine.

I'm curious to know why Open MPI-1.3 is not working with Tight
Integration to SGE 6.0U8 in a Rocks-4.3 cluster.

In other cluster Open MPI-1.3 works well with Tight Integration to SGE.

Thanks a lot,


-- Reuti


The used queue is as follows:
qconf -sq ib.q
qname                 ib.q
hostlist              @ibhosts
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               orte
rerun                 FALSE
slots                 8
tmpdir                /tmp
shell                 /bin/bash
prolog                NONE
epilog                NONE
shell_start_mode      unix_behavior
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY

# qconf -sp orte
pe_name           orte
slots             999
user_lists        NONE
xuser_lists       NONE
start_proc_args   /bin/true
stop_proc_args    /bin/true
allocation_rule   $fill_up
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min
# qconf -shgrp @ibhosts
group_name @ibhosts
hostlist node-0-0.local node-0-1.local node-0-2.local node-0-3.local \ node-0-4.local node-0-5.local node-0-6.local node-0-7.local \ node-0-8.local node-0-9.local node-0-10.local node-0-11.local \ node-0-12.local node-0-13.local node-0-14.local node-0-16.local \ node-0-17.local node-0-18.local node-0-19.local node-0-20.local \
        node-0-21.local node-0-22.local

The Hostnames for IB interface are like ibc0 ibc1.. ibc22

Is this difference caussing the problem.

ssh issues:
between master & node: works fine but with some delay.

between nodes: works fine, no delay

From command line the open mpi jobs were run with no error, even

master node is not used in hostfile.


-- Reuti

Jeremy Stout

On Sat, Jan 24, 2009 at 6:06 AM, Sangamesh B <forum....@gmail.com>

Hello all,

Open MPI 1.3 is installed on Rocks 4.3 Linux cluster with support of
SGE i.e using --with-sge.
But the ompi_info shows only one component:
# /opt/mpi/openmpi/1.3/intel/bin/ompi_info | grep gridengine
MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3)

Is this right? Because during ompi installation SGE qmaster daemon was
not working.

Now the problem is, the open mpi parallel jobs submitted thru
gridengine are failing (when run on multiple nodes) with the error:

$ cat err.26.Helloworld-PRL
ssh_exchange_identification: Connection closed by remote host

----------------------------------------------------------------- --------- A daemon (pid 8462) died unexpectedly with status 129 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.

----------------------------------------------------------------- ---------

----------------------------------------------------------------- --------- mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.

----------------------------------------------------------------- ---------
mpirun: clean termination accomplished

When the job runs on single node, it runs well with producing the
output but with an error:
$ cat err.23.Helloworld-PRL
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
 This will severely limit memory registrations.
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
 This will severely limit memory registrations.
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
 This will severely limit memory registrations.
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
 This will severely limit memory registrations.
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
 This will severely limit memory registrations.

----------------------------------------------------------------- ---------
WARNING: There was an error initializing an OpenFabrics device.

 Local host:   node-0-4.local
 Local device: mthca0

----------------------------------------------------------------- ---------
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
 This will severely limit memory registrations.
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
 This will severely limit memory registrations.
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
 This will severely limit memory registrations.
[node-0-4.local:07869] 7 more processes have sent help message
help-mpi-btl-openib.txt / error in device init
[node-0-4.local:07869] Set MCA parameter "orte_base_help_aggregate" to
0 to see all help / error messages

What may be the problem for this behavior?

users mailing list

users mailing list

users mailing list

users mailing list

users mailing list

users mailing list

Reply via email to