Hi, Am 21.05.2010 um 14:11 schrieb Eloi Gaudry:
> Hi there, > > I'm observing something strange on our cluster managed by SGE6.2u4 when > launching a parallel computation on several nodes, using OpenMPI/SGE tight- > integration mode (OpenMPI-1.3.3). It seems that the SGE allocated slots are > not used by OpenMPI, as if OpenMPI was doing is own round-robin allocation > based on the allocated node hostnames. you compiled Open MPI with --with-sge (and recompiled your applications)? You are using the correct mpiexec? -- Reuti > Here is what I'm doing: > - launch a parallel computation involving 8 processors, using for each of > them > 14GB of memory. I'm using a qsub command where i request memory_free resource > and use tight integration with openmpi > - 3 servers are available: > . barney with 4 cores (4 slots) and 32GB > . carl with 4 cores (4 slots) and 32GB > . charlie with 8 cores (8 slots) and 64GB > > Here is the output of the allocated nodes (OpenMPI output): > ====================== ALLOCATED NODES ====================== > > Data for node: Name: charlie Launch id: -1 Arch: ffc91200 State: 2 > Daemon: [[44332,0],0] Daemon launched: True > Num slots: 4 Slots in use: 0 > Num slots allocated: 4 Max slots: 0 > Username on node: NULL > Num procs: 0 Next node_rank: 0 > Data for node: Name: carl.fft Launch id: -1 Arch: 0 State: 2 > Daemon: Not defined Daemon launched: False > Num slots: 2 Slots in use: 0 > Num slots allocated: 2 Max slots: 0 > Username on node: NULL > Num procs: 0 Next node_rank: 0 > Data for node: Name: barney.fft Launch id: -1 Arch: 0 State: 2 > Daemon: Not defined Daemon launched: False > Num slots: 2 Slots in use: 0 > Num slots allocated: 2 Max slots: 0 > Username on node: NULL > Num procs: 0 Next node_rank: 0 > > ================================================================= > > Here is what I see when my computation is running on the cluster: > # rank pid hostname > 0 28112 charlie > 1 11417 carl > 2 11808 barney > 3 28113 charlie > 4 11418 carl > 5 11809 barney > 6 28114 charlie > 7 11419 carl > > Note that -the parallel environment used under SGE is defined as: > [eg@moe:~]$ qconf -sp round_robin > pe_name round_robin > slots 32 > user_lists NONE > xuser_lists NONE > start_proc_args /bin/true > stop_proc_args /bin/true > allocation_rule $round_robin > control_slaves TRUE > job_is_first_task FALSE > urgency_slots min > accounting_summary FALSE > > I'm wondering why OpenMPI didn't use the allocated nodes chosen by SGE (cf. > "ALLOCATED NODES" report) but instead allocate each job of the parallel > computation at a time, using a round-robin method. > > Note that I'm using the '--bynode' option in the orterun command line. If the > behavior I'm observing is simply the consequence of using this option, please > let me know. This would eventually mean that one need to state that SGE tight- > integration has a lower priority on orterun behavior than the different > command > line options. > > Any help would be appreciated, > Thanks, > Eloi > > > -- > > > Eloi Gaudry > > Free Field Technologies > Axis Park Louvain-la-Neuve > Rue Emile Francqui, 1 > B-1435 Mont-Saint Guibert > BELGIUM > > Company Phone: +32 10 487 959 > Company Fax: +32 10 454 626 > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users