Thanks for the tip. I understand how using the --cpuset option would help me in 
the example I described. However, suppose I have multiple users submitting MPI 
jobs of various sizes? I wouldn't know a priori which cores were in use and 
which weren't. I always assumed that this is what these various schedulers did. 
Is there a way to map-by socket but not allow a single core to be used by more 
than one process. At first glance, I thought that --map-by socket and --bind-to 
core would do this. Would one of these "NOOVERSUBSCRIBE" options help?

Also, in my test case, I have exactly the right amount of cores (240) to run 15 
jobs using 16 MPI processes. I am shaking down a new cluster we just bought. 
This is an extreme case, but not atypical of the way we use our clusters.

------------------------------

List-Post: users@lists.open-mpi.org
Date: Thu, 28 Aug 2014 13:27:12 -0700
From: Ralph Castain <r...@open-mpi.org>
To: Open MPI Users <us...@open-mpi.org>
Subject: Re: [OMPI users] How does binding option affect network
        traffic?
Message-ID: <637caef5-bbb3-46c2-9387-decdf8cbd...@open-mpi.org>
Content-Type: text/plain; charset="windows-1252"


On Aug 28, 2014, at 11:50 AM, McGrattan, Kevin B. Dr. 
<kevin.mcgrat...@nist.gov> wrote:

> My institute recently purchased a linux cluster with 20 nodes; 2 sockets per 
> node; 6 cores per socket. OpenMPI v 1.8.1 is installed. I want to run 15 
> jobs. Each job requires 16 MPI processes.  For each job, I want to use two 
> cores on each node, mapping by socket. If I use these options:
>  
> #PBS -l nodes=8:ppn=2
> mpirun --report-bindings --bind-to core --map-by socket:PE=1 -np 16 
> <executable file name>
>  
> The reported bindings are:
>  
> [burn001:09186] MCW rank 0 bound to socket 0[core 0[hwt 0]]: 
> [B/././././.][./././././.] [burn001:09186] MCW rank 1 bound to socket 
> 1[core 6[hwt 0]]: [./././././.][B/././././.] [burn004:07113] MCW rank 
> 6 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.] 
> [burn004:07113] MCW rank 7 bound to socket 1[core 6[hwt 0]]: 
> [./././././.][B/././././.] and so on?
>  
> These bindings appear to be OK, but when I do a ?top ?H? on each node, I see 
> that all 15 jobs use core 0 and core 6 on each node. This means, I believe, 
> that I am only using 1/6 or my resources.

That is correct. The problem is that each mpirun execution has no idea what the 
others are doing, or even that they exist. Thus, they will each independently 
bind to core zero and core 6, as you observe. You can get around this by 
submitting each with a separate --cpuset argument telling it which cpus it is 
allowed to use - something like this (note that there is no value to having 
pe=1 as that is automatically what happens with bind-to core):

mpirun --cpuset 0,6 --bind-to core  ....
mpirun --cpuset 1,7 --bind-to core  ...

etc. You specified only two procs/node with your PBS request, so we'll only map 
two on each node. This command line tells the first mpirun to only use cores 0 
and 6, and to bind each proc to one of those cores. The second uses only cores 
1 and 7, and thus is separated from the first command.

However, you should note that you can't run 15 jobs at the same time in the 
manner you describe without overloading some cores as you only have 12 
cores/node. This will create a poor-performance situation.


> I want to use 100%. So I try this:
>  
> #PBS -l nodes=8:ppn=2
> mpirun --report-bindings --bind-to socket --map-by socket:PE=1 -np 16 
> <executable file name>
>  
> Now it appears that I am getting 100% usage of all cores on all nodes. The 
> bindings are:
>  
> [burn004:07244] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], 
> socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: 
> [B/B/B/B/B/B][./././././.] [burn004:07244] MCW rank 1 bound to socket 
> 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 
> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: 
> [./././././.][B/B/B/B/B/B] [burn008:07256] MCW rank 3 bound to socket 1[core 
> 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 
> 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: 
> [./././././.][B/B/B/B/B/B] [burn008:07256] MCW rank 2 bound to socket 0[core 
> 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 
> 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: 
> [B/B/B/B/B/B][./././././.] and so on?
>  
> The problem now is that some of my jobs are hanging. They all start running 
> fine, and produce output. But at some point I lose about 4 out of 15 jobs due 
> to hanging. I suspect that an MPI message is passed and not received. The 
> number of jobs that hang and the time when they hang varies from test to 
> test. We have run these cases successfully on our old cluster dozens of times 
> ? they are part of our benchmark suite.

Did you have more cores on your old cluster? I suspect the problem here is 
resource exhaustion, especially if you are using Infiniband as you are 
overloading some of the cores, as mentioned above.

>  
> When I run these jobs using a map by core strategy (that is, the MPI 
> processes are just mapped by core, and each job only uses 16 cores on two 
> nodes), I do not see as much hanging. It still occurs, but less often. This 
> leads me to suspect that there is something about the increased network 
> traffic due to the map-by-socket approach that is the cause of the problem. 
> But I do not know what to do about it. I think that the map-by-socket 
> approach is the right one, but I do not know if I have my OpenMPI options 
> just right.
>  
> Can you tell me what OpenMPI options to use, and can you tell me how I might 
> debug the hanging issue.
>  
>  
>  
> Kevin McGrattan
> National Institute of Standards and Technology
> 100 Bureau Drive, Mail Stop 8664
> Gaithersburg, Maryland 20899
>  
> 301 975 2712
>  

Reply via email to