On Sat, Jan 31, 2009 at 6:27 PM, Reuti <re...@staff.uni-marburg.de> wrote: > Am 31.01.2009 um 08:49 schrieb Sangamesh B: > >> On Fri, Jan 30, 2009 at 10:20 PM, Reuti <re...@staff.uni-marburg.de> >> wrote: >>> >>> Am 30.01.2009 um 15:02 schrieb Sangamesh B: >>> >>>> Dear Open MPI, >>>> >>>> Do you have a solution for the following problem of Open MPI (1.3) >>>> when run through Grid Engine. >>>> >>>> I changed global execd params with H_MEMORYLOCKED=infinity and >>>> restarted the sgeexecd in all nodes. >>>> >>>> But still the problem persists: >>>> >>>> $cat err.77.CPMD-OMPI >>>> ssh_exchange_identification: Connection closed by remote host >>> >>> I think this might already be the reason why it's not working. A mpihello >>> program is running fine through SGE? >>> >> No. >> >> Any Open MPI parallel job thru SGE runs only if its running on a >> single node (i.e. 8processes on 8 cores of a single node). If number >> of processes is more than 8, then SGE will schedule it on 2 nodes - >> the job will fail with the above error. >> >> Now I did a loose integration of Open MPI 1.3 with SGE. The job runs, >> but all 16 processes run on a single node. > > What are the entries in `qconf -sconf`for: > > rsh_command > rsh_daemon > $ qconf -sconf global: execd_spool_dir /opt/gridengine/default/spool ... ..... qrsh_command /usr/bin/ssh rsh_command /usr/bin/ssh rlogin_command /usr/bin/ssh rsh_daemon /usr/sbin/sshd qrsh_daemon /usr/sbin/sshd reprioritize 0
I think its better to check once with Open MPI 1.2.8 > What is your mpirun command in the jobscript - you are getting there the > mpirun from Open MPI? According to the output below, it's not a loose > integration, but you prepare alraedy a machinefile, which is superfluous for > Open MPI. > No. I've not prepared the machinefile for Open MPI. For Tight integartion job: /opt/mpi/openmpi/1.3/intel/bin/mpirun -np $NSLOTS $CPMDBIN/cpmd311-ompi-mkl.x wf1.in $PP_LIBRARY > wf1.out_OMPI$NSLOTS.$JOB_ID For loose integration job: /opt/mpi/openmpi/1.3/intel/bin/mpirun -np $NSLOTS -hostfile $TMPDIR/machines $CPMDBIN/cpmd311-ompi-mkl.x wf1.in $PP_LIBRARY > wf1.out_OMPI_$JOB_ID.$NSLOTS I think I should check with Open MPI 1.2.8. That may work.. Thanks, Sangamesh >> $ cat out.83.Hello-OMPI >> /opt/gridengine/default/spool/node-0-17/active_jobs/83.1/pe_hostfile >> ibc17 >> ibc17 >> ibc17 >> ibc17 >> ibc17 >> ibc17 >> ibc17 >> ibc17 >> ibc12 >> ibc12 >> ibc12 >> ibc12 >> ibc12 >> ibc12 >> ibc12 >> ibc12 >> Greetings: 1 of 16 from the node node-0-17.local >> Greetings: 10 of 16 from the node node-0-17.local >> Greetings: 15 of 16 from the node node-0-17.local >> Greetings: 9 of 16 from the node node-0-17.local >> Greetings: 14 of 16 from the node node-0-17.local >> Greetings: 8 of 16 from the node node-0-17.local >> Greetings: 11 of 16 from the node node-0-17.local >> Greetings: 12 of 16 from the node node-0-17.local >> Greetings: 6 of 16 from the node node-0-17.local >> Greetings: 0 of 16 from the node node-0-17.local >> Greetings: 5 of 16 from the node node-0-17.local >> Greetings: 3 of 16 from the node node-0-17.local >> Greetings: 13 of 16 from the node node-0-17.local >> Greetings: 4 of 16 from the node node-0-17.local >> Greetings: 7 of 16 from the node node-0-17.local >> Greetings: 2 of 16 from the node node-0-17.local >> >> But qhost -u <user name> shows that it is scheduled/running on two nodes. >> >> Any body successful in running Open MPI 1.3 tightly integrated with SGE? > > For a Tight Integration there's a FAQ: > > http://www.open-mpi.org/faq/?category=running#run-n1ge-or-sge > > -- Reuti > >> >> Thanks, >> Sangamesh >> >>> -- Reuti >>> >>> >>>> >>>> -------------------------------------------------------------------------- >>>> A daemon (pid 31947) died unexpectedly with status 129 while attempting >>>> to launch so we are aborting. >>>> >>>> There may be more information reported by the environment (see above). >>>> >>>> This may be because the daemon was unable to find all the needed shared >>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have >>>> the >>>> location of the shared libraries on the remote nodes and this will >>>> automatically be forwarded to the remote nodes. >>>> >>>> -------------------------------------------------------------------------- >>>> >>>> -------------------------------------------------------------------------- >>>> mpirun noticed that the job aborted, but has no info as to the process >>>> that caused that situation. >>>> >>>> -------------------------------------------------------------------------- >>>> ssh_exchange_identification: Connection closed by remote host >>>> >>>> -------------------------------------------------------------------------- >>>> mpirun was unable to cleanly terminate the daemons on the nodes shown >>>> below. Additional manual cleanup may be required - please refer to >>>> the "orte-clean" tool for assistance. >>>> >>>> -------------------------------------------------------------------------- >>>> node-0-19.local - daemon did not report back when launched >>>> node-0-20.local - daemon did not report back when launched >>>> node-0-21.local - daemon did not report back when launched >>>> node-0-22.local - daemon did not report back when launched >>>> >>>> The hostnames for infiniband interfaces are ibc0, ibc1, ibc2 .. ibc23. >>>> May be Open MPI is not able to identify hosts as it is using node-0-.. >>>> . Is this causing open mpi to fail? >>>> >>>> Thanks, >>>> Sangamesh >>>> >>>> >>>> On Mon, Jan 26, 2009 at 5:09 PM, mihlon <vacl...@fel.cvut.cz> wrote: >>>>> >>>>> Hi, >>>>> >>>>>> Hello SGE users, >>>>>> >>>>>> The cluster is installed with Rocks-4.3, SGE 6.0 & Open MPI 1.3. >>>>>> Open MPI is configured with "--with-sge". >>>>>> ompi_info shows only one component: >>>>>> # /opt/mpi/openmpi/1.3/intel/bin/ompi_info | grep gridengine >>>>>> MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3) >>>>>> >>>>>> Is this acceptable? >>>>> >>>>> maybe yes >>>>> >>>>> see: http://www.open-mpi.org/faq/?category=building#build-rte-sge >>>>> >>>>> shell$ ompi_info | grep gridengine >>>>> MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.3) >>>>> MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.3) >>>>> >>>>> (Specific frameworks and version numbers may vary, depending on your >>>>> version of Open MPI.) >>>>> >>>>>> The Open MPI parallel jobs run successfully through command line, but >>>>>> fail when run thru SGE(with -pe orte <slots>). >>>>>> >>>>>> The error is: >>>>>> >>>>>> $ cat err.26.Helloworld-PRL >>>>>> ssh_exchange_identification: Connection closed by remote host >>>>>> >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> A daemon (pid 8462) died unexpectedly with status 129 while attempting >>>>>> to launch so we are aborting. >>>>>> >>>>>> There may be more information reported by the environment (see above). >>>>>> >>>>>> This may be because the daemon was unable to find all the needed >>>>>> shared >>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have >>>>>> the >>>>>> location of the shared libraries on the remote nodes and this will >>>>>> automatically be forwarded to the remote nodes. >>>>>> >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> mpirun noticed that the job aborted, but has no info as to the process >>>>>> that caused that situation. >>>>>> >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> mpirun: clean termination accomplished >>>>>> >>>>>> But the same job runs well, if it runs on a single node but with an >>>>>> error: >>>>>> >>>>>> $ cat err.23.Helloworld-PRL >>>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>>>>> This will severely limit memory registrations. >>>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>>>>> This will severely limit memory registrations. >>>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>>>>> This will severely limit memory registrations. >>>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>>>>> This will severely limit memory registrations. >>>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>>>>> This will severely limit memory registrations. >>>>>> >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> WARNING: There was an error initializing an OpenFabrics device. >>>>>> >>>>>> Local host: node-0-4.local >>>>>> Local device: mthca0 >>>>>> >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>>>>> This will severely limit memory registrations. >>>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>>>>> This will severely limit memory registrations. >>>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>>>>> This will severely limit memory registrations. >>>>>> [node-0-4.local:07869] 7 more processes have sent help message >>>>>> help-mpi-btl-openib.txt / error in device init >>>>>> [node-0-4.local:07869] Set MCA parameter "orte_base_help_aggregate" to >>>>>> 0 to see all help / error messages >>>>>> >>>>>> The following link explains the same problem: >>>>>> >>>>>> >>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=72398 >>>>>> >>>>>> With this reference, I put 'ulimit -l unlimited' into >>>>>> /etc/init.d/sgeexecd in all nodes. Restarted the services. >>>>> >>>>> Do not set 'ulimit -l unlimited' in /etc/init.d/sgeexecd >>>>> but set it in the SGE: >>>>> >>>>> Run qconf -mconf and set execd_params >>>>> >>>>> >>>>> frontend$> qconf -sconf >>>>> ... >>>>> execd_params H_MEMORYLOCKED=infinity >>>>> ... >>>>> >>>>> >>>>> Then restart all your sgeexecd hosts. >>>>> >>>>> >>>>> Milan >>>>> >>>>>> But still the problem persists. >>>>>> >>>>>> What could be the way out for this? >>>>>> >>>>>> Thanks, >>>>>> Sangamesh >>>>>> >>>>>> ------------------------------------------------------ >>>>>> >>>>>> >>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=99133 >>>>>> >>>>>> To unsubscribe from this discussion, e-mail: >>>>>> [users-unsubscr...@gridengine.sunsource.net]. >>>>>> >>>>> >>>>> ------------------------------------------------------ >>>>> >>>>> >>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=99461 >>>>> >>>>> To unsubscribe from this discussion, e-mail: >>>>> [users-unsubscr...@gridengine.sunsource.net]. >>>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >