Thanks Reuti for the reply. On Sun, Jan 25, 2009 at 2:22 AM, Reuti <re...@staff.uni-marburg.de> wrote: > Am 24.01.2009 um 17:12 schrieb Jeremy Stout: > >> The RLIMIT error is very common when using OpenMPI + OFED + Sun Grid >> Engine. You can find more information and several remedies here: >> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages >> >> I usually resolve this problem by adding "ulimit -l unlimited" near >> the top of the SGE startup script on the computation nodes and >> restarting SGE on every node. > > Did you request/set any limits with SGE's h_vmem/h_stack resource request? > No.
The used queue is as follows: qconf -sq ib.q qname ib.q hostlist @ibhosts seq_no 0 load_thresholds np_load_avg=1.75 suspend_thresholds NONE nsuspend 1 suspend_interval 00:05:00 priority 0 min_cpu_interval 00:05:00 processors UNDEFINED qtype BATCH INTERACTIVE ckpt_list NONE pe_list orte rerun FALSE slots 8 tmpdir /tmp shell /bin/bash prolog NONE epilog NONE shell_start_mode unix_behavior starter_method NONE suspend_method NONE resume_method NONE terminate_method NONE notify 00:00:60 owner_list NONE user_lists NONE xuser_lists NONE subordinate_list NONE complex_values NONE projects NONE xprojects NONE calendar NONE initial_state default s_rt INFINITY h_rt INFINITY s_cpu INFINITY h_cpu INFINITY s_fsize INFINITY h_fsize INFINITY s_data INFINITY h_data INFINITY s_stack INFINITY h_stack INFINITY s_core INFINITY h_core INFINITY s_rss INFINITY h_rss INFINITY s_vmem INFINITY h_vmem INFINITY # qconf -sp orte pe_name orte slots 999 user_lists NONE xuser_lists NONE start_proc_args /bin/true stop_proc_args /bin/true allocation_rule $fill_up control_slaves TRUE job_is_first_task FALSE urgency_slots min # qconf -shgrp @ibhosts group_name @ibhosts hostlist node-0-0.local node-0-1.local node-0-2.local node-0-3.local \ node-0-4.local node-0-5.local node-0-6.local node-0-7.local \ node-0-8.local node-0-9.local node-0-10.local node-0-11.local \ node-0-12.local node-0-13.local node-0-14.local node-0-16.local \ node-0-17.local node-0-18.local node-0-19.local node-0-20.local \ node-0-21.local node-0-22.local The Hostnames for IB interface are like ibc0 ibc1.. ibc22 Is this difference caussing the problem. ssh issues: between master & node: works fine but with some delay. between nodes: works fine, no delay >From command line the open mpi jobs were run with no error, even master node is not used in hostfile. Thanks, Sangamesh > -- Reuti > > >> Jeremy Stout >> >> On Sat, Jan 24, 2009 at 6:06 AM, Sangamesh B <forum....@gmail.com> wrote: >>> >>> Hello all, >>> >>> Open MPI 1.3 is installed on Rocks 4.3 Linux cluster with support of >>> SGE i.e using --with-sge. >>> But the ompi_info shows only one component: >>> # /opt/mpi/openmpi/1.3/intel/bin/ompi_info | grep gridengine >>> MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3) >>> >>> Is this right? Because during ompi installation SGE qmaster daemon was >>> not working. >>> >>> Now the problem is, the open mpi parallel jobs submitted thru >>> gridengine are failing (when run on multiple nodes) with the error: >>> >>> $ cat err.26.Helloworld-PRL >>> ssh_exchange_identification: Connection closed by remote host >>> >>> -------------------------------------------------------------------------- >>> A daemon (pid 8462) died unexpectedly with status 129 while attempting >>> to launch so we are aborting. >>> >>> There may be more information reported by the environment (see above). >>> >>> This may be because the daemon was unable to find all the needed shared >>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have >>> the >>> location of the shared libraries on the remote nodes and this will >>> automatically be forwarded to the remote nodes. >>> >>> -------------------------------------------------------------------------- >>> >>> -------------------------------------------------------------------------- >>> mpirun noticed that the job aborted, but has no info as to the process >>> that caused that situation. >>> >>> -------------------------------------------------------------------------- >>> mpirun: clean termination accomplished >>> >>> When the job runs on single node, it runs well with producing the >>> output but with an error: >>> $ cat err.23.Helloworld-PRL >>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>> This will severely limit memory registrations. >>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>> This will severely limit memory registrations. >>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>> This will severely limit memory registrations. >>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>> This will severely limit memory registrations. >>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>> This will severely limit memory registrations. >>> >>> -------------------------------------------------------------------------- >>> WARNING: There was an error initializing an OpenFabrics device. >>> >>> Local host: node-0-4.local >>> Local device: mthca0 >>> >>> -------------------------------------------------------------------------- >>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>> This will severely limit memory registrations. >>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>> This will severely limit memory registrations. >>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>> This will severely limit memory registrations. >>> [node-0-4.local:07869] 7 more processes have sent help message >>> help-mpi-btl-openib.txt / error in device init >>> [node-0-4.local:07869] Set MCA parameter "orte_base_help_aggregate" to >>> 0 to see all help / error messages >>> >>> What may be the problem for this behavior? >>> >>> Thanks, >>> Sangamesh >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >