Hi, Am 21.05.2010 um 17:19 schrieb Eloi Gaudry:
> Hi Reuti, > > Yes, the openmpi binaries used were build after having used the --with-sge > during configure, and we only use those binaries on our cluster. > > [eg@moe:~]$ /opt/openmpi-1.3.3/bin/ompi_info > MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3.3) ok. As you have a Tight Integration as goal and set in your PE "control_slaves TRUE", SGE wouldn't allow `qrsh -inherit ...` to nodes which are not in the list of granted nodes. So it looks, like your job is running outside of this Tight Integration with its own `rsh`or `ssh`. Do you reset $JOB_ID or other environment variables in your jobscript, which could trigger Open MPI to assume that it's not running inside SGE? -- Reuti > > > On Friday 21 May 2010 16:01:54 Reuti wrote: >> Hi, >> >> Am 21.05.2010 um 14:11 schrieb Eloi Gaudry: >>> Hi there, >>> >>> I'm observing something strange on our cluster managed by SGE6.2u4 when >>> launching a parallel computation on several nodes, using OpenMPI/SGE >>> tight- integration mode (OpenMPI-1.3.3). It seems that the SGE allocated >>> slots are not used by OpenMPI, as if OpenMPI was doing is own >>> round-robin allocation based on the allocated node hostnames. >> >> you compiled Open MPI with --with-sge (and recompiled your applications)? >> You are using the correct mpiexec? >> >> -- Reuti >> >>> Here is what I'm doing: >>> - launch a parallel computation involving 8 processors, using for each of >>> them 14GB of memory. I'm using a qsub command where i request >>> memory_free resource and use tight integration with openmpi >>> - 3 servers are available: >>> . barney with 4 cores (4 slots) and 32GB >>> . carl with 4 cores (4 slots) and 32GB >>> . charlie with 8 cores (8 slots) and 64GB >>> >>> Here is the output of the allocated nodes (OpenMPI output): >>> ====================== ALLOCATED NODES ====================== >>> >>> Data for node: Name: charlie Launch id: -1 Arch: ffc91200 State: 2 >>> >>> Daemon: [[44332,0],0] Daemon launched: True >>> Num slots: 4 Slots in use: 0 >>> Num slots allocated: 4 Max slots: 0 >>> Username on node: NULL >>> Num procs: 0 Next node_rank: 0 >>> >>> Data for node: Name: carl.fft Launch id: -1 Arch: 0 State: 2 >>> >>> Daemon: Not defined Daemon launched: False >>> Num slots: 2 Slots in use: 0 >>> Num slots allocated: 2 Max slots: 0 >>> Username on node: NULL >>> Num procs: 0 Next node_rank: 0 >>> >>> Data for node: Name: barney.fft Launch id: -1 Arch: 0 State: 2 >>> >>> Daemon: Not defined Daemon launched: False >>> Num slots: 2 Slots in use: 0 >>> Num slots allocated: 2 Max slots: 0 >>> Username on node: NULL >>> Num procs: 0 Next node_rank: 0 >>> >>> ================================================================= >>> >>> Here is what I see when my computation is running on the cluster: >>> # rank pid hostname >>> >>> 0 28112 charlie >>> 1 11417 carl >>> 2 11808 barney >>> 3 28113 charlie >>> 4 11418 carl >>> 5 11809 barney >>> 6 28114 charlie >>> 7 11419 carl >>> >>> Note that -the parallel environment used under SGE is defined as: >>> [eg@moe:~]$ qconf -sp round_robin >>> pe_name round_robin >>> slots 32 >>> user_lists NONE >>> xuser_lists NONE >>> start_proc_args /bin/true >>> stop_proc_args /bin/true >>> allocation_rule $round_robin >>> control_slaves TRUE >>> job_is_first_task FALSE >>> urgency_slots min >>> accounting_summary FALSE >>> >>> I'm wondering why OpenMPI didn't use the allocated nodes chosen by SGE >>> (cf. "ALLOCATED NODES" report) but instead allocate each job of the >>> parallel computation at a time, using a round-robin method. >>> >>> Note that I'm using the '--bynode' option in the orterun command line. If >>> the behavior I'm observing is simply the consequence of using this >>> option, please let me know. This would eventually mean that one need to >>> state that SGE tight- integration has a lower priority on orterun >>> behavior than the different command line options. >>> >>> Any help would be appreciated, >>> Thanks, >>> Eloi >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > -- > > > Eloi Gaudry > > Free Field Technologies > Axis Park Louvain-la-Neuve > Rue Emile Francqui, 1 > B-1435 Mont-Saint Guibert > BELGIUM > > Company Phone: +32 10 487 959 > Company Fax: +32 10 454 626