Hi,

Am 21.05.2010 um 14:11 schrieb Eloi Gaudry:

> Hi there,
> 
> I'm observing something strange on our cluster managed by SGE6.2u4 when 
> launching a parallel computation on several nodes, using OpenMPI/SGE tight-
> integration mode (OpenMPI-1.3.3). It seems that the SGE allocated slots are 
> not used by OpenMPI, as if OpenMPI was doing is own round-robin allocation 
> based on the allocated node hostnames.

you compiled Open MPI with --with-sge (and recompiled your applications)? You 
are using the correct mpiexec?

-- Reuti


> Here is what I'm doing:
> - launch a parallel computation involving 8 processors, using for each of 
> them 
> 14GB of memory. I'm using a qsub command where i request memory_free resource 
> and use tight integration with openmpi
> - 3 servers are available:
> . barney with 4 cores (4 slots) and 32GB
> . carl with 4 cores (4 slots) and 32GB
> . charlie with 8 cores (8 slots) and 64GB
> 
> Here is the output of the allocated nodes (OpenMPI output):
> ======================   ALLOCATED NODES   ======================
> 
> Data for node: Name: charlie   Launch id: -1 Arch: ffc91200  State: 2
>  Daemon: [[44332,0],0] Daemon launched: True
>  Num slots: 4  Slots in use: 0
>  Num slots allocated: 4  Max slots: 0
>  Username on node: NULL
>  Num procs: 0  Next node_rank: 0
> Data for node: Name: carl.fft    Launch id: -1 Arch: 0 State: 2
>  Daemon: Not defined Daemon launched: False
>  Num slots: 2  Slots in use: 0
>  Num slots allocated: 2  Max slots: 0
>  Username on node: NULL
>  Num procs: 0  Next node_rank: 0
> Data for node: Name: barney.fft    Launch id: -1 Arch: 0 State: 2
>  Daemon: Not defined Daemon launched: False
>  Num slots: 2  Slots in use: 0
>  Num slots allocated: 2  Max slots: 0
>  Username on node: NULL
>  Num procs: 0  Next node_rank: 0
> 
> =================================================================
> 
> Here is what I see when my computation is running on the cluster:
> #     rank       pid          hostname
>         0     28112          charlie
>         1     11417          carl
>         2     11808          barney
>         3     28113          charlie
>         4     11418          carl
>         5     11809          barney
>         6     28114          charlie
>         7     11419          carl
> 
> Note that -the parallel environment used under SGE is defined as:
> [eg@moe:~]$ qconf -sp round_robin
> pe_name            round_robin
> slots              32
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    /bin/true
> stop_proc_args     /bin/true
> allocation_rule    $round_robin
> control_slaves     TRUE
> job_is_first_task  FALSE
> urgency_slots      min
> accounting_summary FALSE
> 
> I'm wondering why OpenMPI didn't use the allocated nodes chosen by SGE (cf. 
> "ALLOCATED NODES" report) but instead allocate each job of the parallel 
> computation at a time, using a round-robin method.
> 
> Note that I'm using the '--bynode' option in the orterun command line. If the 
> behavior I'm observing is simply the consequence of using this option, please 
> let me know. This would eventually mean that one need to state that SGE tight-
> integration has a lower priority on orterun behavior than the different 
> command 
> line options.
> 
> Any help would be appreciated,
> Thanks,
> Eloi
> 
> 
> -- 
> 
> 
> Eloi Gaudry
> 
> Free Field Technologies
> Axis Park Louvain-la-Neuve
> Rue Emile Francqui, 1
> B-1435 Mont-Saint Guibert
> BELGIUM
> 
> Company Phone: +32 10 487 959
> Company Fax:   +32 10 454 626
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to