>> The RLIMIT error is very common when using OpenMPI + OFED + Sun Grid
>> Engine. You can find more information and several remedies here:
>> I usually resolve this problem by adding "ulimit -l unlimited" near
>> the top of the SGE startup script on the computation nodes and
>> restarting SGE on every node.
> Did you request/set any limits with SGE's h_vmem/h_stack resource request?

The used queue is as follows:
qconf -sq ib.q
qname                 ib.q
hostlist              @ibhosts
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               orte
rerun                 FALSE
slots                 8
tmpdir                /tmp
shell                 /bin/bash
prolog                NONE
epilog                NONE
shell_start_mode      unix_behavior
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY

# qconf -sp orte
pe_name           orte
slots             999
user_lists        NONE
xuser_lists       NONE
start_proc_args   /bin/true
stop_proc_args    /bin/true
allocation_rule   $fill_up
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min
# qconf -shgrp @ibhosts
group_name @ibhosts
hostlist node-0-0.local node-0-1.local node-0-2.local node-0-3.local \
         node-0-4.local node-0-5.local node-0-6.local node-0-7.local \
         node-0-8.local node-0-9.local node-0-10.local node-0-11.local \
         node-0-12.local node-0-13.local node-0-14.local node-0-16.local \
         node-0-17.local node-0-18.local node-0-19.local node-0-20.local \
         node-0-21.local node-0-22.local

The Hostnames for IB interface are like ibc0 ibc1.. ibc22

Is this difference caussing the problem.

ssh issues:
between master & node: works fine but with some delay.

between nodes: works fine, no delay

>From command line the open mpi jobs were run with no error, even
master node is not used in hostfile.


>>> Hello all,
>>>  Open MPI 1.3 is installed on Rocks 4.3 Linux cluster with support of
>>> SGE i.e using --with-sge.
>>> But the ompi_info shows only one component:
>>> # /opt/mpi/openmpi/1.3/intel/bin/ompi_info | grep gridengine
>>>                MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3)
>>> Is this right? Because during ompi installation SGE qmaster daemon was
>>> not working.
>>> Now the problem is, the open mpi parallel jobs submitted thru
>>> gridengine are failing (when run on multiple nodes) with the error:
>>> $ cat err.26.Helloworld-PRL
>>> ssh_exchange_identification: Connection closed by remote host
>>> --------------------------------------------------------------------------
>>> A daemon (pid 8462) died unexpectedly with status 129 while attempting
>>> to launch so we are aborting.
>>> There may be more information reported by the environment (see above).
>>> This may be because the daemon was unable to find all the needed shared
>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have
>>> the
>>> location of the shared libraries on the remote nodes and this will
>>> automatically be forwarded to the remote nodes.
>>> --------------------------------------------------------------------------
>>> --------------------------------------------------------------------------
>>> mpirun noticed that the job aborted, but has no info as to the process
>>> that caused that situation.
>>> --------------------------------------------------------------------------
>>> mpirun: clean termination accomplished
>>> When the job runs on single node, it runs well with producing the
>>> output but with an error:
>>> $ cat err.23.Helloworld-PRL
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>   This will severely limit memory registrations.
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>   This will severely limit memory registrations.
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>   This will severely limit memory registrations.
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>   This will severely limit memory registrations.
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>   This will severely limit memory registrations.
>>> --------------------------------------------------------------------------
>>> WARNING: There was an error initializing an OpenFabrics device.
>>>  Local host:   node-0-4.local
>>>  Local device: mthca0
>>> --------------------------------------------------------------------------
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>   This will severely limit memory registrations.
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>   This will severely limit memory registrations.
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>   This will severely limit memory registrations.
>>> [node-0-4.local:07869] 7 more processes have sent help message
>>> help-mpi-btl-openib.txt / error in device init
>>> [node-0-4.local:07869] Set MCA parameter "orte_base_help_aggregate" to
>>> 0 to see all help / error messages
>>> What may be the problem for this behavior?
>>> Thanks,
>>> Sangamesh
