Thanks Reuti for the reply.

On Sun, Jan 25, 2009 at 2:22 AM, Reuti <re...@staff.uni-marburg.de> wrote:
> Am 24.01.2009 um 17:12 schrieb Jeremy Stout:
>
>> The RLIMIT error is very common when using OpenMPI + OFED + Sun Grid
>> Engine. You can find more information and several remedies here:
>> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>>
>> I usually resolve this problem by adding "ulimit -l unlimited" near
>> the top of the SGE startup script on the computation nodes and
>> restarting SGE on every node.
>
> Did you request/set any limits with SGE's h_vmem/h_stack resource request?
>
No.

The used queue is as follows:
qconf -sq ib.q
qname                 ib.q
hostlist              @ibhosts
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               orte
rerun                 FALSE
slots                 8
tmpdir                /tmp
shell                 /bin/bash
prolog                NONE
epilog                NONE
shell_start_mode      unix_behavior
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY

# qconf -sp orte
pe_name           orte
slots             999
user_lists        NONE
xuser_lists       NONE
start_proc_args   /bin/true
stop_proc_args    /bin/true
allocation_rule   $fill_up
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min
# qconf -shgrp @ibhosts
group_name @ibhosts
hostlist node-0-0.local node-0-1.local node-0-2.local node-0-3.local \
         node-0-4.local node-0-5.local node-0-6.local node-0-7.local \
         node-0-8.local node-0-9.local node-0-10.local node-0-11.local \
         node-0-12.local node-0-13.local node-0-14.local node-0-16.local \
         node-0-17.local node-0-18.local node-0-19.local node-0-20.local \
         node-0-21.local node-0-22.local

The Hostnames for IB interface are like ibc0 ibc1.. ibc22

Is this difference caussing the problem.

ssh issues:
between master & node: works fine but with some delay.

between nodes: works fine, no delay

>From command line the open mpi jobs were run with no error, even
master node is not used in hostfile.

Thanks,
Sangamesh

> -- Reuti
>
>
>> Jeremy Stout
>>
>> On Sat, Jan 24, 2009 at 6:06 AM, Sangamesh B <forum....@gmail.com> wrote:
>>>
>>> Hello all,
>>>
>>>  Open MPI 1.3 is installed on Rocks 4.3 Linux cluster with support of
>>> SGE i.e using --with-sge.
>>> But the ompi_info shows only one component:
>>> # /opt/mpi/openmpi/1.3/intel/bin/ompi_info | grep gridengine
>>>                MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3)
>>>
>>> Is this right? Because during ompi installation SGE qmaster daemon was
>>> not working.
>>>
>>> Now the problem is, the open mpi parallel jobs submitted thru
>>> gridengine are failing (when run on multiple nodes) with the error:
>>>
>>> $ cat err.26.Helloworld-PRL
>>> ssh_exchange_identification: Connection closed by remote host
>>>
>>> --------------------------------------------------------------------------
>>> A daemon (pid 8462) died unexpectedly with status 129 while attempting
>>> to launch so we are aborting.
>>>
>>> There may be more information reported by the environment (see above).
>>>
>>> This may be because the daemon was unable to find all the needed shared
>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have
>>> the
>>> location of the shared libraries on the remote nodes and this will
>>> automatically be forwarded to the remote nodes.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> --------------------------------------------------------------------------
>>> mpirun noticed that the job aborted, but has no info as to the process
>>> that caused that situation.
>>>
>>> --------------------------------------------------------------------------
>>> mpirun: clean termination accomplished
>>>
>>> When the job runs on single node, it runs well with producing the
>>> output but with an error:
>>> $ cat err.23.Helloworld-PRL
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>   This will severely limit memory registrations.
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>   This will severely limit memory registrations.
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>   This will severely limit memory registrations.
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>   This will severely limit memory registrations.
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>   This will severely limit memory registrations.
>>>
>>> --------------------------------------------------------------------------
>>> WARNING: There was an error initializing an OpenFabrics device.
>>>
>>>  Local host:   node-0-4.local
>>>  Local device: mthca0
>>>
>>> --------------------------------------------------------------------------
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>   This will severely limit memory registrations.
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>   This will severely limit memory registrations.
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>   This will severely limit memory registrations.
>>> [node-0-4.local:07869] 7 more processes have sent help message
>>> help-mpi-btl-openib.txt / error in device init
>>> [node-0-4.local:07869] Set MCA parameter "orte_base_help_aggregate" to
>>> 0 to see all help / error messages
>>>
>>> What may be the problem for this behavior?
>>>
>>> Thanks,
>>> Sangamesh
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to