Ah, indeed - sounds like we are not correctly picking up the cpuset. Can
you pass me the environ from the procs, and the contents of the
$PBS_HOSTFILE? IIRC, Torque isn't going to bind us to those cores, but
instead sets something into the environ or the allocation that we need to
correctly parse.

Thanks
Ralph


On Wed, Jan 28, 2015 at 3:52 PM, DOHERTY, Greg <g...@ansto.gov.au> wrote:

> Thank you Ralph for the advice. I will move on to try 1.8.4 as soon as I
> can.
> The first torque job asks for nodes=1:ppn=16:whatever
> The second job asks for nodes=1:ppn=16:whatever
> Both jobs happen to finish up on the same 64 core node. Each is running on
> its own set of 16 cores 0-15, and 16-31 respectively.
> As soon as the second one starts, core utilisation reported by top drops
> from 100% to 50% (on both). If I qdel it, the first one recovers
> immediately to 100%.
> The behaviour reported by top is an accurate reflection of the progress of
> the calculations.
> Greg
>
> -------------------------------------------------------------------------------------------------------
>
> Message: 1
> Date: Wed, 28 Jan 2015 05:39:49 +0000
> From: "DOHERTY, Greg" <g...@ansto.gov.au>
> To: "us...@open-mpi.org" <us...@open-mpi.org>
> Subject: [OMPI users] 1.8.1 [SEC=UNCLASSIFIED]
> Message-ID:
>         <31af19c9c3a1af4fa8fbe7a0e8f3deb81b08a...@exmbs1-b51.ansto.gov.au>
> Content-Type: text/plain; charset="us-ascii"
>
> This might or might not be related to openmpi 1.8.1. I have not seen the
> problem with the same program and previous versions of openmpi We have 64
> core AMD nodes. I have recently recompiled  a large Monte Carlo program
> using 1.8.1 version of openmpi. Users start this program using maui/torque
> asking for a number of cores, usually on only one node. One run of the
> program asking for any number of cores up to 64 runs with full cpu
> utilisation on each core. A user might start a run asking for 16 cores -
> fine. Then he starts a second run on the same node, asking for another 16
> cores. Immediately the cpu utilisation on all cores of the first job drops
> to 50%, as it is for the newly starting job. If a different program were
> using the remaining 32 cores on the same node at the same time, the cpu
> utilisation of its cores is unaffected. If we qdel the second 16 core job,
> the cpu utilisation of each core of the first job immediately climbs back
> to 100%. Any suggestions please, on where
>  I might start looking for the solution to this problem?
> Greg Doherty
> ANSTO
> -------------- next part --------------
> HTML attachment scrubbed and removed
>
> ------------------------------
>
> Message: 2
> Date: Wed, 28 Jan 2015 06:16:33 -0600
> From: Ralph Castain <r...@open-mpi.org>
> To: Open MPI Users <us...@open-mpi.org>
> Subject: Re: [OMPI users] 1.8.1 [SEC=UNCLASSIFIED]
> Message-ID:
>         <CAMD57oeZpQzQX_WZ3B8X5AzdGUG3+RE1nD==
> 8hgpw3_ra28...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> I'm not entirely clear on the sequence of commands here. Is the user
> requesting a new allocation from maui/torque for each run? In this case,
> it's possible we aren't correctly picking up the external binding from
> Torque. This would likely be a bug we would need to fix.
>
> Or is the user obtaining a single allocation of the entire node, and then
> using mpirun to start multiple jobs in parallel? In this case, the issue is
> that the user needs to tell mpirun which cpus to confine itself to or else
> it will always assume that all cpus belong to it. This will lead to
> overloading the lower core numbers. The problem here can be resolved by
> adding --cpuset 0,1,2 (or whatever pattern you like) to each cmd line.
>
> You might also consider updating to 1.8.4 as we did fix some integration
> bugs. I don't recall something specific to this question, but it could be
> my memory at fault.
>
> Ralph
>
>
> On Tue, Jan 27, 2015 at 11:39 PM, DOHERTY, Greg <g...@ansto.gov.au> wrote:
>
> >  This might or might not be related to openmpi 1.8.1. I have not seen
> > the problem with the same program and previous versions of openmpi
> >
> > We have 64 core AMD nodes. I have recently recompiled  a large Monte
> > Carlo program using 1.8.1 version of openmpi. Users start this program
> > using maui/torque asking for a number of cores, usually on only one
> > node. One run of the program asking for any number of cores up to 64
> > runs with full cpu utilisation on each core. A user might start a run
> asking for 16 cores ?
> > fine. Then he starts a second run on the same node, asking for another
> > 16 cores. Immediately the cpu utilisation on all cores of the first
> > job drops to 50%, as it is for the newly starting job. If a different
> > program were using the remaining 32 cores on the same node at the same
> > time, the cpu utilisation of its cores is unaffected. If we qdel the
> > second 16 core job, the cpu utilisation of each core of the first job
> > immediately climbs back to 100%. Any suggestions please, on where I
> > might start looking for the solution to this problem?
> >
> > Greg Doherty
> >
> > ANSTO
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> > http://www.open-mpi.org/community/lists/users/2015/01/26239.php
> >
> -------------- next part --------------
> HTML attachment scrubbed and removed
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ------------------------------
>
> End of users Digest, Vol 3106, Issue 1
> **************************************
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/01/26241.php
>

Reply via email to