Thank you Ralph for the advice. I will move on to try 1.8.4 as soon as I can. The first torque job asks for nodes=1:ppn=16:whatever The second job asks for nodes=1:ppn=16:whatever Both jobs happen to finish up on the same 64 core node. Each is running on its own set of 16 cores 0-15, and 16-31 respectively. As soon as the second one starts, core utilisation reported by top drops from 100% to 50% (on both). If I qdel it, the first one recovers immediately to 100%. The behaviour reported by top is an accurate reflection of the progress of the calculations. Greg -------------------------------------------------------------------------------------------------------
Message: 1 List-Post: users@lists.open-mpi.org Date: Wed, 28 Jan 2015 05:39:49 +0000 From: "DOHERTY, Greg" <g...@ansto.gov.au> To: "us...@open-mpi.org" <us...@open-mpi.org> Subject: [OMPI users] 1.8.1 [SEC=UNCLASSIFIED] Message-ID: <31af19c9c3a1af4fa8fbe7a0e8f3deb81b08a...@exmbs1-b51.ansto.gov.au> Content-Type: text/plain; charset="us-ascii" This might or might not be related to openmpi 1.8.1. I have not seen the problem with the same program and previous versions of openmpi We have 64 core AMD nodes. I have recently recompiled a large Monte Carlo program using 1.8.1 version of openmpi. Users start this program using maui/torque asking for a number of cores, usually on only one node. One run of the program asking for any number of cores up to 64 runs with full cpu utilisation on each core. A user might start a run asking for 16 cores - fine. Then he starts a second run on the same node, asking for another 16 cores. Immediately the cpu utilisation on all cores of the first job drops to 50%, as it is for the newly starting job. If a different program were using the remaining 32 cores on the same node at the same time, the cpu utilisation of its cores is unaffected. If we qdel the second 16 core job, the cpu utilisation of each core of the first job immediately climbs back to 100%. Any suggestions please, on where I might start looking for the solution to this problem? Greg Doherty ANSTO -------------- next part -------------- HTML attachment scrubbed and removed ------------------------------ Message: 2 List-Post: users@lists.open-mpi.org Date: Wed, 28 Jan 2015 06:16:33 -0600 From: Ralph Castain <r...@open-mpi.org> To: Open MPI Users <us...@open-mpi.org> Subject: Re: [OMPI users] 1.8.1 [SEC=UNCLASSIFIED] Message-ID: <CAMD57oeZpQzQX_WZ3B8X5AzdGUG3+RE1nD==8hgpw3_ra28...@mail.gmail.com> Content-Type: text/plain; charset="utf-8" I'm not entirely clear on the sequence of commands here. Is the user requesting a new allocation from maui/torque for each run? In this case, it's possible we aren't correctly picking up the external binding from Torque. This would likely be a bug we would need to fix. Or is the user obtaining a single allocation of the entire node, and then using mpirun to start multiple jobs in parallel? In this case, the issue is that the user needs to tell mpirun which cpus to confine itself to or else it will always assume that all cpus belong to it. This will lead to overloading the lower core numbers. The problem here can be resolved by adding --cpuset 0,1,2 (or whatever pattern you like) to each cmd line. You might also consider updating to 1.8.4 as we did fix some integration bugs. I don't recall something specific to this question, but it could be my memory at fault. Ralph On Tue, Jan 27, 2015 at 11:39 PM, DOHERTY, Greg <g...@ansto.gov.au> wrote: > This might or might not be related to openmpi 1.8.1. I have not seen > the problem with the same program and previous versions of openmpi > > We have 64 core AMD nodes. I have recently recompiled a large Monte > Carlo program using 1.8.1 version of openmpi. Users start this program > using maui/torque asking for a number of cores, usually on only one > node. One run of the program asking for any number of cores up to 64 > runs with full cpu utilisation on each core. A user might start a run asking > for 16 cores ? > fine. Then he starts a second run on the same node, asking for another > 16 cores. Immediately the cpu utilisation on all cores of the first > job drops to 50%, as it is for the newly starting job. If a different > program were using the remaining 32 cores on the same node at the same > time, the cpu utilisation of its cores is unaffected. If we qdel the > second 16 core job, the cpu utilisation of each core of the first job > immediately climbs back to 100%. Any suggestions please, on where I > might start looking for the solution to this problem? > > Greg Doherty > > ANSTO > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/01/26239.php > -------------- next part -------------- HTML attachment scrubbed and removed ------------------------------ Subject: Digest Footer _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ------------------------------ End of users Digest, Vol 3106, Issue 1 **************************************