Hi,

Consider a hybrid MPI + OpenMP code on a system with 2 x 8-core processes per 
node, running with OMP_NUM_THREADS=4. A common placement policy we see is to 
have rank 0 on the first 4 cores of the first socket, rank 1 on the second 4 
cores, rank 2 on the first 4 cores of the second socket, and so on. In 3.1.2 
this is easily accomplished with

        $ mpirun --map-by ppr:2:socket:PE=4 --report-bindings
        [raijin1:07173] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]: 
[BB/BB/BB/BB/../../../..][../../../../../../../..]
        [raijin1:07173] MCW rank 1 bound to socket 0[core 4[hwt 0-1]], socket 
0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: 
[../../../../BB/BB/BB/BB][../../../../../../../..]
        [raijin1:07173] MCW rank 2 bound to socket 1[core 8[hwt 0-1]], socket 
1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: 
[../../../../../../../..][BB/BB/BB/BB/../../../..]
        [raijin1:07173] MCW rank 3 bound to socket 1[core 12[hwt 0-1]], socket 
1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: 
[../../../../../../../..][../../../../BB/BB/BB/BB]
        <and similarly on subsequent nodes>

although looking at the man page now it seems like this is an invalid construct 
(even through it worked).

However, it looks like this (mis)use no longer works in OpenMPI 3.1.3:

        
--------------------------------------------------------------------------
        An invalid value was given for the number of processes
        per resource (ppr) to be mapped on each node:
        
          PPR:  2:socket:PE=4
        
        The specification must be a comma-separated list containing
        combinations of number, followed by a colon, followed
        by the resource type. For example, a value of "1:socket" indicates that
        one process is to be mapped onto each socket. Values are supported
        for hwthread, core, L1-3 caches, socket, numa, and node. Note that
        enough characters must be provided to clearly specify the desired
        resource (e.g., "nu" for "numa").
        
--------------------------------------------------------------------------

We’ve come up with an equivalent but it needs both --map-by and --rank-by:

        $ mpirun --map-by node:PE=4 --rank-by core

(without the --rank-by it (as expected) round-robins between nodes first 
instead of the ranks on each node). Is this the correct approach for getting 
this distribution?

As an aside, I’m not sure if this is the expected behaviour, but using --map-by 
socket:PE=4 fails because it tries putting rank 4 on the first socket of the 
first node even through there’s no free cores left there (because of the PE=4), 
instead of moving to the next node. But we’d still need to use the --rank-by 
option in this case, anyway.

Cheers,
Ben

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to