Hi Ralph,

Thanks for this, however --map-by ppr:32:socket:PE=2 --bind-to core
reports the same binding than --map-by ppr:32:socket:PE=4 --bind-to
hwthread:

[epsilon104:2861230] MCW rank 0 bound to socket 0[core 0[hwt 0-1]],
socket 0[core 1[hwt 0-1]]: [BB/BB/../../../../
../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..
/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..
/../../../../../../../..]
[epsilon104:2861230] MCW rank 1 bound to socket 0[core 2[hwt 0-1]],
socket 0[core 3[hwt 0-1]]: [../../BB/BB/../../
../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..
/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..
/../../../../../../../..]
[epsilon104:2861230] MCW rank 2 bound to socket 0[core 4[hwt 0-1]],
socket 0[core 5[hwt 0-1]]: [../../../../BB/BB/
../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..
/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..
/../../../../../../../..]

And this is still different from the output produce using the rankfile.

Cheers,
Luis

On 28/02/2021 14:06, Ralph Castain via users wrote:
> Your command line is incorrect:
>
> --map-by ppr:32:socket:PE=4 --bind-to hwthread
>
> should be
>
> --map-by ppr:32:socket:PE=2 --bind-to core
>
>
>
>> On Feb 28, 2021, at 5:57 AM, Luis Cebamanos via users
>> <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote:
>>
>> I should have said, "I would like to run 128 MPI processes on 2
>> nodes" and not 64 like I initially said...
>>
>> On Sat, 27 Feb 2021, 15:03 Luis Cebamanos, <luic...@gmail.com
>> <mailto:luic...@gmail.com>> wrote:
>>
>>     Hello OMPI users,
>>
>>     On 128 core nodes, 2 sockets x 64 cores/socket (2 hwthreads/core)
>>     , I am
>>     trying to match the behavior of running with a rankfile with manual
>>     mapping/ranking/binding.
>>
>>     I would like to run 64 MPI processes on 2 nodes, 1 MPI process
>>     every 2
>>     cores. This is, I want to run 32 MPI processes per socket on 2
>>     128-core
>>     nodes. My mapping should be something like:
>>
>>     Node 0
>>     =====
>>     rank 0  -  core 0
>>     rank 1  -  core 2
>>     rank 3 -   core 4
>>     ...
>>     rank 63 - core 126
>>
>>
>>     Node 1
>>     ====
>>     rank 64  -  core 0
>>     rank 65  -  core 2
>>     rank 66 -   core 4
>>     ...
>>     rank 127- core 126
>>
>>     If I use a rankfile:
>>     rank 0=epsilon102 slot=0
>>     rank 1=epsilon102 slot=2
>>     rank 2=epsilon102 slot=4
>>     rank 3=epsilon102 slot=6
>>     rank 4=epsilon102 slot=8
>>     rank 5=epsilon102slot=10
>>     ....
>>     rank 123=epsilon103 slot=118
>>     rank 124=epsilon103 slot=120
>>     rank 125=epsilon103 slot=122
>>     rank 126=epsilon103 slot=124
>>     rank 127=epsilon103 slot=126
>>
>>     My --report-binding looks like:
>>
>>     [epsilon102:2635370] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]:
>>     [BB/../../..
>>     
>> /../../../../../../../../../../../../../../../../../../../../../../../../../../.
>>     
>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>     
>> ../../../../../../..][../../../../../../../../../../../../../../../../../../../.
>>     
>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>     ../../../../../../../../../../../../../../../../../..]
>>     [epsilon102:2635370] MCW rank 1 bound to socket 0[core 2[hwt 0-1]]:
>>     [../../BB/..
>>     
>> /../../../../../../../../../../../../../../../../../../../../../../../../../../.
>>     
>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>     
>> ../../../../../../..][../../../../../../../../../../../../../../../../../../../.
>>     
>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>     ../../../../../../../../../../../../../../../../../..]
>>     [epsilon102:2635370] MCW rank 2 bound to socket 0[core 4[hwt 0-1]]:
>>     [../../../..
>>     
>> /BB/../../../../../../../../../../../../../../../../../../../../../../../../../.
>>     
>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>     
>> ../../../../../../..][../../../../../../../../../../../../../../../../../../../.
>>     
>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>     ../../../../../../../../../../../../../../../../../..]
>>
>>
>>     However, I cannot match this report-binding output by manually using
>>     --map-by and --bind-to. I had the impression that this will be
>>     the same:
>>
>>     mpirun -np $SLURM_NTASKS  --report-bindings --map-by
>>     ppr:32:socket:PE=4
>>     --bind-to hwthread
>>
>>     But this output is not quite the same:
>>
>>     [epsilon102:2631529] MCW rank 0 bound to socket 0[core 0[hwt 0-1]],
>>     socket 0[cor
>>     e 1[hwt 0-1]]:
>>     [BB/BB/../../../../../../../../../../../../../../../../../../../.
>>     
>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>     
>> ../../../../../../../../../../../../../../../..][../../../../../../../../../../.
>>     
>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>     
>> ../../../../../../../../../../../../../../../../../../../../../../../../../../..]
>>     [epsilon102:2631529] MCW rank 1 bound to socket 0[core 2[hwt 0-1]],
>>     socket 0[cor
>>     e 3[hwt 0-1]]:
>>     [../../BB/BB/../../../../../../../../../../../../../../../../../.
>>     
>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>     
>> ../../../../../../../../../../../../../../../..][../../../../../../../../../../.
>>     
>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>     
>> ../../../../../../../../../../../../../../../../../../../../../../../../../../..]
>>
>>     What am I missing to match the rankfile behavior? Regarding
>>     performance,
>>     what difference does it make between the first and the second
>>     outputs?
>>
>>     Thanks for your help!
>>     Luis
>>
>

Reply via email to