Re: [OMPI users] How does binding option affect network traffic?

McGrattan, Kevin B. Dr. Fri, 29 Aug 2014 16:06:36 -0400 (EDT)

I am able to run all 15 of my jobs simultaneously; 16 MPI processes per job; 
mapping by socket and binding to socket. On a given socket, 6 MPI processes 
from 6 separate mpiruns share the 6 cores, or at least I assume they are 
sharing. The load for all CPUs and all processes is 100%. I understand that I 
am loading the system to its limits, but is what I am doing OK? My jobs are 
running, and the only problem seems to be that some jobs are hanging at random 
times. This is a new cluster I am shaking down, and I am guessing that the 
message passing traffic is causing packet losses. I am working with the vendor 
to sort this out, but I am curious whether or not I am using OpenMPI 
appropriately.


#PBS -l nodes=8:ppn=2
mpirun --report-bindings --bind-to socket --map-by socket:PE=1 -np 16 
<executable file name>

The bindings are:

[burn004:07244] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 
1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 
4[hwt 0]], socket 0[core 5[hwt 0]]:   [B/B/B/B/B/B][./././././.]
[burn004:07244] MCW rank 1 bound to socket 1[core 6[hwt 0]], socket 1[core 
7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 
10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[burn008:07256] MCW rank 3 bound to socket 1[core 6[hwt 0]], socket 1[core 
7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 
10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[burn008:07256] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket 0[core 
1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 
4[hwt 0]], socket 0[core 5[hwt 0]]:   [B/B/B/B/B/B][./././././.] and so on.


From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Friday, August 29, 2014 3:26 PM
To: Open MPI Users
Subject: Re: [OMPI users] How does binding option affect network traffic?


On Aug 29, 2014, at 10:51 AM, McGrattan, Kevin B. Dr. 
<kevin.mcgrat...@nist.gov<mailto:kevin.mcgrat...@nist.gov>> wrote:


Thanks for the tip. I understand how using the --cpuset option would help me in 
the example I described. However, suppose I have multiple users submitting MPI 
jobs of various sizes? I wouldn't know a priori which cores were in use and 
which weren't. I always assumed that this is what these various schedulers did. 
Is there a way to map-by socket but not allow a single core to be used by more 
than one process. At first glance, I thought that --map-by socket and --bind-to 
core would do this. Would one of these "NOOVERSUBSCRIBE" options help?

I'm afraid not - the issue here is that the mpirun's don't know about each 
other. What you'd need to do is have your scheduler assign cores for our use - 
we'll pick that up and stay inside that envelope. The exact scheduler command 
depends on the scheduler, of course, but the scheduler would then have the more 
global picture and could keep things separated.



Also, in my test case, I have exactly the right amount of cores (240) to run 15 
jobs using 16 MPI processes. I am shaking down a new cluster we just bought. 
This is an extreme case, but not atypical of the way we use our clusters.

Well, you do, but not exactly the way you showed you were trying to use this. 
If you try to run as you described, with 2ppn for each mpirun and 12 
cores/node, you can run a maximum of 6 mpirun's at a time across a given set of 
nodes. So you'd need to stage your allocations correctly to make it work.





------------------------------

List-Post: users@lists.open-mpi.org
Date: Thu, 28 Aug 2014 13:27:12 -0700
From: Ralph Castain <r...@open-mpi.org<mailto:r...@open-mpi.org>>
To: Open MPI Users <us...@open-mpi.org<mailto:us...@open-mpi.org>>
Subject: Re: [OMPI users] How does binding option affect network
                traffic?
Message-ID: 
<637caef5-bbb3-46c2-9387-decdf8cbd...@open-mpi.org<mailto:637caef5-bbb3-46c2-9387-decdf8cbd...@open-mpi.org>>
Content-Type: text/plain; charset="windows-1252"


On Aug 28, 2014, at 11:50 AM, McGrattan, Kevin B. Dr. 
<kevin.mcgrat...@nist.gov<mailto:kevin.mcgrat...@nist.gov>> wrote:


My institute recently purchased a linux cluster with 20 nodes; 2 sockets per 
node; 6 cores per socket. OpenMPI v 1.8.1 is installed. I want to run 15 jobs. 
Each job requires 16 MPI processes.  For each job, I want to use two cores on 
each node, mapping by socket. If I use these options:

#PBS -l nodes=8:ppn=2
mpirun --report-bindings --bind-to core --map-by socket:PE=1 -np 16
<executable file name>

The reported bindings are:

[burn001:09186] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
[B/././././.][./././././.] [burn001:09186] MCW rank 1 bound to socket
1[core 6[hwt 0]]: [./././././.][B/././././.] [burn004:07113] MCW rank
6 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.]
[burn004:07113] MCW rank 7 bound to socket 1[core 6[hwt 0]]: 
[./././././.][B/././././.] and so on?

These bindings appear to be OK, but when I do a ?top ?H? on each node, I see 
that all 15 jobs use core 0 and core 6 on each node. This means, I believe, 
that I am only using 1/6 or my resources.

That is correct. The problem is that each mpirun execution has no idea what the 
others are doing, or even that they exist. Thus, they will each independently 
bind to core zero and core 6, as you observe. You can get around this by 
submitting each with a separate --cpuset argument telling it which cpus it is 
allowed to use - something like this (note that there is no value to having 
pe=1 as that is automatically what happens with bind-to core):

mpirun --cpuset 0,6 --bind-to core  ....
mpirun --cpuset 1,7 --bind-to core  ...

etc. You specified only two procs/node with your PBS request, so we'll only map 
two on each node. This command line tells the first mpirun to only use cores 0 
and 6, and to bind each proc to one of those cores. The second uses only cores 
1 and 7, and thus is separated from the first command.

However, you should note that you can't run 15 jobs at the same time in the 
manner you describe without overloading some cores as you only have 12 
cores/node. This will create a poor-performance situation.



I want to use 100%. So I try this:

#PBS -l nodes=8:ppn=2
mpirun --report-bindings --bind-to socket --map-by socket:PE=1 -np 16
<executable file name>

Now it appears that I am getting 100% usage of all cores on all nodes. The 
bindings are:

[burn004:07244] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]],
socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]:
[B/B/B/B/B/B][./././././.] [burn004:07244] MCW rank 1 bound to socket
1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 
1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: 
[./././././.][B/B/B/B/B/B] [burn008:07256] MCW rank 3 bound to socket 1[core 
6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 
9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: 
[./././././.][B/B/B/B/B/B] [burn008:07256] MCW rank 2 bound to socket 0[core 
0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 
3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: 
[B/B/B/B/B/B][./././././.] and so on?

The problem now is that some of my jobs are hanging. They all start running 
fine, and produce output. But at some point I lose about 4 out of 15 jobs due 
to hanging. I suspect that an MPI message is passed and not received. The 
number of jobs that hang and the time when they hang varies from test to test. 
We have run these cases successfully on our old cluster dozens of times ? they 
are part of our benchmark suite.

Did you have more cores on your old cluster? I suspect the problem here is 
resource exhaustion, especially if you are using Infiniband as you are 
overloading some of the cores, as mentioned above.



When I run these jobs using a map by core strategy (that is, the MPI processes 
are just mapped by core, and each job only uses 16 cores on two nodes), I do 
not see as much hanging. It still occurs, but less often. This leads me to 
suspect that there is something about the increased network traffic due to the 
map-by-socket approach that is the cause of the problem. But I do not know what 
to do about it. I think that the map-by-socket approach is the right one, but I 
do not know if I have my OpenMPI options just right.

Can you tell me what OpenMPI options to use, and can you tell me how I might 
debug the hanging issue.



Kevin McGrattan
National Institute of Standards and Technology
100 Bureau Drive, Mail Stop 8664
Gaithersburg, Maryland 20899

301 975 2712
_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25195.php

Re: [OMPI users] How does binding option affect network traffic?

Reply via email to