Re: [OMPI users] nodes are oversubscribed in 1.1.1

Pak Lui Wed, 24 Jan 2007 13:02:51 -0500

Geoff Galitz wrote:

On Jan 24, 2007, at 7:03 AM, Pak Lui wrote:
Geoff Galitz wrote:
Hello,
On the following system:
OpenMPI 1.1.1
SGE 6.0 (with tight integration)
Scientific Linux 4.3
Dual Dual-Core Opterons
MPI jobs are oversubscribing to the nodes. No matter where jobsare launched by the scheduler, they always stack up on the firstnode (node00) and continue to stack even though the system loadexceeds 6 (on a 4 processor box). Eeach node is defined as 4slots with 4 max slots. The MPI jobs launch via "mpirun -np(some-number-of- processors)" from within the scheduler.
Hi Geoff,
I think we first start having SGE support in 1.2, not in 1.1.1.Unless you did some modification on your own to include thegridengine ras/pls modules from v1.2, you probably are not usingthe SGE tight integration. So even though you start mpirun in theSGE parallel environment, ORTE does not have the gridengine modulesfor allocating and launching the jobs, so that could be why allprocesses are launched on the same node. (because there's no nodelist available from gridengine and it defaults to single node)
I have used the backport instructions provided by Olli-Pekka Lehto.Of course, if it is running properly in my case, I can't say as I amcertainly not getting the expected behavior, although the jobs do run.

There are a few things you can try to validate if the gridengine pluginis being used. First, you might want to check ompi_info to see if'gridengine' is in the list. Another thing is to use the mpirun -d flagto show which ras and pls component that mpirun is running through.There are also a few of MCA params you can use to show when thegridengine ras+pls modules are in action. Here's a couple:

"-mca ras_gridengine_verbose 1" would show the output from SGE as if youare passing in the -verbose flag to qrsh.

"-mca pls_gridengine_debug 1" would show you the qrsh -inherit command
that is used to send the job to SGE.

If you are not able to see those outputs, chances are mpirun is notusing the gridengine modules. And that might be a problem with yoursetup. I would suggest you to either verify the build process, or tryout the v1.2 instead.

On a related note, there is a way for SGE to allocate and assignslots for launching tasks. It is done by setting the allocationrule in the parallel environment (PE). If all of the slots areallocated on the same node, it sounds like the allocation rule hasbeen set to $fill_up. Maybe you can try with $round_robin instead?
If I use $round_robin, one MPI process starts up per node and thenwraps around the cluster. So if I have 4 process MPI job, it starts1 process on 4 nodes which is certainly not the most efficient method.
It seems to me that MPI is not detecting that the nodes areoverloaded and that due to the way the job slots are defined andhow mpirun is being called. If I read the documentationcorrectly, a single mpirun run consumes one job slot no matterthe number of processes which are launched. We can chagne thenumber of job slots, but then we expect to waste processors sinceonly one mpirun job will run on any node, even if the job is onlya two processor job.
As for oversubscription, I remember we start having that -nooversubscribe option in v1.2 so if you want to limit ORTE fromoversubscribing because by default oversubscription is allowed.
So it seems the real story for me is that there is no logic thatdetects the oversubscription condition and re-schedules the job foranother node in the MPI nodelist in OpenMPI 1.1.1? If so, that wouldcertainly explain what I am seeing. Is that correct?

Actually I take back the comment about v1.1.1 doesn't have the-oversubscription option. It just check the source and it's there inorterun, so there is that option available to prevent oversubscription.

The behavior you are seeing is probably due the resource allocation(RAS) not getting the nodelist (from SGE), so RMAPS does not have thenodelist to map the processes to, and therefore it's using the same nodeto launch the user processes.


-geoff
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--

Thanks,

- Pak Lui
pak....@sun.com

Re: [OMPI users] nodes are oversubscribed in 1.1.1

Reply via email to