Hello Reuti, I'm sorry for the late response.
On Mon, Jan 26, 2009 at 7:11 PM, Reuti <re...@staff.uni-marburg.de> wrote: > Am 25.01.2009 um 06:16 schrieb Sangamesh B: > >> Thanks Reuti for the reply. >> >> On Sun, Jan 25, 2009 at 2:22 AM, Reuti <re...@staff.uni-marburg.de> wrote: >>> >>> Am 24.01.2009 um 17:12 schrieb Jeremy Stout: >>> >>>> The RLIMIT error is very common when using OpenMPI + OFED + Sun Grid >>>> Engine. You can find more information and several remedies here: >>>> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages >>>> >>>> I usually resolve this problem by adding "ulimit -l unlimited" near >>>> the top of the SGE startup script on the computation nodes and >>>> restarting SGE on every node. >>> >>> Did you request/set any limits with SGE's h_vmem/h_stack resource >>> request? > > Was this also your problem: > > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=99442 > I've not posted that mail. But the same setting is not working for me: $ qconf -sconf global: .. execd_params H_MEMORYLOCKED=infinity .. But I'm using "unset SGE_ROOT" (suggested by you) inside sge job submission script with a Loose integration of Open MPI with SGE. Its working fine. I'm curious to know why Open MPI-1.3 is not working with Tight Integration to SGE 6.0U8 in a Rocks-4.3 cluster. In other cluster Open MPI-1.3 works well with Tight Integration to SGE. Thanks a lot, Sangamesh Thanks, Sangamesh > -- Reuti > > >>> >> No. >> >> The used queue is as follows: >> qconf -sq ib.q >> qname ib.q >> hostlist @ibhosts >> seq_no 0 >> load_thresholds np_load_avg=1.75 >> suspend_thresholds NONE >> nsuspend 1 >> suspend_interval 00:05:00 >> priority 0 >> min_cpu_interval 00:05:00 >> processors UNDEFINED >> qtype BATCH INTERACTIVE >> ckpt_list NONE >> pe_list orte >> rerun FALSE >> slots 8 >> tmpdir /tmp >> shell /bin/bash >> prolog NONE >> epilog NONE >> shell_start_mode unix_behavior >> starter_method NONE >> suspend_method NONE >> resume_method NONE >> terminate_method NONE >> notify 00:00:60 >> owner_list NONE >> user_lists NONE >> xuser_lists NONE >> subordinate_list NONE >> complex_values NONE >> projects NONE >> xprojects NONE >> calendar NONE >> initial_state default >> s_rt INFINITY >> h_rt INFINITY >> s_cpu INFINITY >> h_cpu INFINITY >> s_fsize INFINITY >> h_fsize INFINITY >> s_data INFINITY >> h_data INFINITY >> s_stack INFINITY >> h_stack INFINITY >> s_core INFINITY >> h_core INFINITY >> s_rss INFINITY >> h_rss INFINITY >> s_vmem INFINITY >> h_vmem INFINITY >> >> # qconf -sp orte >> pe_name orte >> slots 999 >> user_lists NONE >> xuser_lists NONE >> start_proc_args /bin/true >> stop_proc_args /bin/true >> allocation_rule $fill_up >> control_slaves TRUE >> job_is_first_task FALSE >> urgency_slots min >> # qconf -shgrp @ibhosts >> group_name @ibhosts >> hostlist node-0-0.local node-0-1.local node-0-2.local node-0-3.local \ >> node-0-4.local node-0-5.local node-0-6.local node-0-7.local \ >> node-0-8.local node-0-9.local node-0-10.local node-0-11.local \ >> node-0-12.local node-0-13.local node-0-14.local node-0-16.local \ >> node-0-17.local node-0-18.local node-0-19.local node-0-20.local \ >> node-0-21.local node-0-22.local >> >> The Hostnames for IB interface are like ibc0 ibc1.. ibc22 >> >> Is this difference caussing the problem. >> >> ssh issues: >> between master & node: works fine but with some delay. >> >> between nodes: works fine, no delay >> >>> From command line the open mpi jobs were run with no error, even >> >> master node is not used in hostfile. >> >> Thanks, >> Sangamesh >> >>> -- Reuti >>> >>> >>>> Jeremy Stout >>>> >>>> On Sat, Jan 24, 2009 at 6:06 AM, Sangamesh B <forum....@gmail.com> >>>> wrote: >>>>> >>>>> Hello all, >>>>> >>>>> Open MPI 1.3 is installed on Rocks 4.3 Linux cluster with support of >>>>> SGE i.e using --with-sge. >>>>> But the ompi_info shows only one component: >>>>> # /opt/mpi/openmpi/1.3/intel/bin/ompi_info | grep gridengine >>>>> MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3) >>>>> >>>>> Is this right? Because during ompi installation SGE qmaster daemon was >>>>> not working. >>>>> >>>>> Now the problem is, the open mpi parallel jobs submitted thru >>>>> gridengine are failing (when run on multiple nodes) with the error: >>>>> >>>>> $ cat err.26.Helloworld-PRL >>>>> ssh_exchange_identification: Connection closed by remote host >>>>> >>>>> >>>>> -------------------------------------------------------------------------- >>>>> A daemon (pid 8462) died unexpectedly with status 129 while attempting >>>>> to launch so we are aborting. >>>>> >>>>> There may be more information reported by the environment (see above). >>>>> >>>>> This may be because the daemon was unable to find all the needed shared >>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have >>>>> the >>>>> location of the shared libraries on the remote nodes and this will >>>>> automatically be forwarded to the remote nodes. >>>>> >>>>> >>>>> -------------------------------------------------------------------------- >>>>> >>>>> >>>>> -------------------------------------------------------------------------- >>>>> mpirun noticed that the job aborted, but has no info as to the process >>>>> that caused that situation. >>>>> >>>>> >>>>> -------------------------------------------------------------------------- >>>>> mpirun: clean termination accomplished >>>>> >>>>> When the job runs on single node, it runs well with producing the >>>>> output but with an error: >>>>> $ cat err.23.Helloworld-PRL >>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>>>> This will severely limit memory registrations. >>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>>>> This will severely limit memory registrations. >>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>>>> This will severely limit memory registrations. >>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>>>> This will severely limit memory registrations. >>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>>>> This will severely limit memory registrations. >>>>> >>>>> >>>>> -------------------------------------------------------------------------- >>>>> WARNING: There was an error initializing an OpenFabrics device. >>>>> >>>>> Local host: node-0-4.local >>>>> Local device: mthca0 >>>>> >>>>> >>>>> -------------------------------------------------------------------------- >>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>>>> This will severely limit memory registrations. >>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>>>> This will severely limit memory registrations. >>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>>>> This will severely limit memory registrations. >>>>> [node-0-4.local:07869] 7 more processes have sent help message >>>>> help-mpi-btl-openib.txt / error in device init >>>>> [node-0-4.local:07869] Set MCA parameter "orte_base_help_aggregate" to >>>>> 0 to see all help / error messages >>>>> >>>>> What may be the problem for this behavior? >>>>> >>>>> Thanks, >>>>> Sangamesh >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >