Hi John,

I would be interested to know if that does what you are expecting...

On 01/03/2021 00:02, John R Cary via users wrote:
> I've been watching this exchange with interest, because it is the
> closest I have seen to what I want, but I want something slightly
> different: 2 processes per node, with the first one bound to one core,
> and the second bound to all the rest, with no use of hyperthreads.
>
> Would this be
>
> --map-by ppr:2:node --bind-to core --cpu-list 0,1-31
>
> ?
>
> Thx....
>
>
> On 2/28/21 5:44 PM, Ralph Castain via users wrote:
>> The only way I know of to do what you want is
>>
>> --map-by ppr:32:socket --bind-to core --cpu-list 0,2,4,6,...
>>
>> where you list out the exact cpus you want to use.
>>
>>
>>> On Feb 28, 2021, at 9:58 AM, Luis Cebamanos via users
>>> <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote:
>>>
>>> I could do--map-by ppr:32:socket:PE=1 --bind-to core (output below)
>>> but I cannot see the way of mapping every 2 cores 0,2,4,....
>>>
>>>  [epsilon110:1489563] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]:
>>> [BB/../../..
>>> /../../../../../../../../../../../../../../../../../../../../../../../../../../.
>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>> ../../../../../../..][../../../../../../../../../../../../../../../../../../../.
>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>> ../../../../../../../../../../../../../../../../../..]
>>> [epsilon110:1489563] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]:
>>> [../BB/../..
>>> /../../../../../../../../../../../../../../../../../../../../../../../../../../.
>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>> ../../../../../../..][../../../../../../../../../../../../../../../../../../../.
>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>> ../../../../../../../../../../../../../../../../../..]
>>>
>>> On 28/02/2021 16:24, Ralph Castain via users wrote:
>>>> Did you read the documentation on rankfile? The "slot=N" directive
>>>> saids to "put this proc on core N". In your file, you stipulate that
>>>>
>>>> rank 0 is to be placed solely on core 0
>>>> rank 1 is to be placed solely on core 2
>>>> etc.
>>>>
>>>> That is not what you asked for in your mpirun cmd. You asked that
>>>> each proc be mapped to TWO cores (PE=2) or FOUR threads (PE=4 with
>>>> bind-to HWT). If you wanted that same thing in a rankfile, it
>>>> should have said
>>>>
>>>> rank 0 slots=0-1
>>>> rank 1 slots=2-3
>>>> etc.
>>>>
>>>> Hence the difference. I was simply correcting your mpirun cmd line
>>>> as you said you wanted two CORES, and that isn't guaranteed if you
>>>> are stipulating things in terms of HWTs as not every machine has
>>>> two HWTs/core.
>>>>
>>>>
>>>>
>>>>> On Feb 28, 2021, at 7:43 AM, Luis Cebamanos via users
>>>>> <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote:
>>>>>
>>>>> Hi Ralph,
>>>>>
>>>>> Thanks for this, however --map-by ppr:32:socket:PE=2 --bind-to
>>>>> core reports the same binding than --map-by ppr:32:socket:PE=4
>>>>> --bind-to hwthread:
>>>>>
>>>>> [epsilon104:2861230] MCW rank 0 bound to socket 0[core 0[hwt
>>>>> 0-1]], socket 0[core 1[hwt 0-1]]: [BB/BB/../../../../
>>>>> ../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../
>>>>> ../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..
>>>>> /../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..
>>>>> /../../../../../../../..]
>>>>> [epsilon104:2861230] MCW rank 1 bound to socket 0[core 2[hwt
>>>>> 0-1]], socket 0[core 3[hwt 0-1]]: [../../BB/BB/../../
>>>>> ../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../
>>>>> ../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..
>>>>> /../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..
>>>>> /../../../../../../../..]
>>>>> [epsilon104:2861230] MCW rank 2 bound to socket 0[core 4[hwt
>>>>> 0-1]], socket 0[core 5[hwt 0-1]]: [../../../../BB/BB/
>>>>> ../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../
>>>>> ../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..
>>>>> /../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..
>>>>> /../../../../../../../..]
>>>>>
>>>>> And this is still different from the output produce using the
>>>>> rankfile.
>>>>>
>>>>> Cheers,
>>>>> Luis
>>>>>
>>>>> On 28/02/2021 14:06, Ralph Castain via users wrote:
>>>>>> Your command line is incorrect:
>>>>>>
>>>>>> --map-by ppr:32:socket:PE=4 --bind-to hwthread
>>>>>>
>>>>>> should be
>>>>>>
>>>>>> --map-by ppr:32:socket:PE=2 --bind-to core
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Feb 28, 2021, at 5:57 AM, Luis Cebamanos via users
>>>>>>> <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote:
>>>>>>>
>>>>>>> I should have said, "I would like to run 128 MPI processes on 2
>>>>>>> nodes" and not 64 like I initially said...
>>>>>>>
>>>>>>> On Sat, 27 Feb 2021, 15:03 Luis Cebamanos, <luic...@gmail.com
>>>>>>> <mailto:luic...@gmail.com>> wrote:
>>>>>>>
>>>>>>>     Hello OMPI users,
>>>>>>>
>>>>>>>     On 128 core nodes, 2 sockets x 64 cores/socket (2
>>>>>>>     hwthreads/core) , I am
>>>>>>>     trying to match the behavior of running with a rankfile with
>>>>>>>     manual
>>>>>>>     mapping/ranking/binding.
>>>>>>>
>>>>>>>     I would like to run 64 MPI processes on 2 nodes, 1 MPI
>>>>>>>     process every 2
>>>>>>>     cores. This is, I want to run 32 MPI processes per socket on
>>>>>>>     2 128-core
>>>>>>>     nodes. My mapping should be something like:
>>>>>>>
>>>>>>>     Node 0
>>>>>>>     =====
>>>>>>>     rank 0  -  core 0
>>>>>>>     rank 1  -  core 2
>>>>>>>     rank 3 -   core 4
>>>>>>>     ...
>>>>>>>     rank 63 - core 126
>>>>>>>
>>>>>>>
>>>>>>>     Node 1
>>>>>>>     ====
>>>>>>>     rank 64  -  core 0
>>>>>>>     rank 65  -  core 2
>>>>>>>     rank 66 -   core 4
>>>>>>>     ...
>>>>>>>     rank 127- core 126
>>>>>>>
>>>>>>>     If I use a rankfile:
>>>>>>>     rank 0=epsilon102 slot=0
>>>>>>>     rank 1=epsilon102 slot=2
>>>>>>>     rank 2=epsilon102 slot=4
>>>>>>>     rank 3=epsilon102 slot=6
>>>>>>>     rank 4=epsilon102 slot=8
>>>>>>>     rank 5=epsilon102slot=10
>>>>>>>     ....
>>>>>>>     rank 123=epsilon103 slot=118
>>>>>>>     rank 124=epsilon103 slot=120
>>>>>>>     rank 125=epsilon103 slot=122
>>>>>>>     rank 126=epsilon103 slot=124
>>>>>>>     rank 127=epsilon103 slot=126
>>>>>>>
>>>>>>>     My --report-binding looks like:
>>>>>>>
>>>>>>>     [epsilon102:2635370] MCW rank 0 bound to socket 0[core 0[hwt
>>>>>>>     0-1]]:
>>>>>>>     [BB/../../..
>>>>>>>     
>>>>>>> /../../../../../../../../../../../../../../../../../../../../../../../../../../.
>>>>>>>     
>>>>>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>>>>>>     
>>>>>>> ../../../../../../..][../../../../../../../../../../../../../../../../../../../.
>>>>>>>     
>>>>>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>>>>>>     ../../../../../../../../../../../../../../../../../..]
>>>>>>>     [epsilon102:2635370] MCW rank 1 bound to socket 0[core 2[hwt
>>>>>>>     0-1]]:
>>>>>>>     [../../BB/..
>>>>>>>     
>>>>>>> /../../../../../../../../../../../../../../../../../../../../../../../../../../.
>>>>>>>     
>>>>>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>>>>>>     
>>>>>>> ../../../../../../..][../../../../../../../../../../../../../../../../../../../.
>>>>>>>     
>>>>>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>>>>>>     ../../../../../../../../../../../../../../../../../..]
>>>>>>>     [epsilon102:2635370] MCW rank 2 bound to socket 0[core 4[hwt
>>>>>>>     0-1]]:
>>>>>>>     [../../../..
>>>>>>>     
>>>>>>> /BB/../../../../../../../../../../../../../../../../../../../../../../../../../.
>>>>>>>     
>>>>>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>>>>>>     
>>>>>>> ../../../../../../..][../../../../../../../../../../../../../../../../../../../.
>>>>>>>     
>>>>>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>>>>>>     ../../../../../../../../../../../../../../../../../..]
>>>>>>>
>>>>>>>
>>>>>>>     However, I cannot match this report-binding output by
>>>>>>>     manually using
>>>>>>>     --map-by and --bind-to. I had the impression that this will
>>>>>>>     be the same:
>>>>>>>
>>>>>>>     mpirun -np $SLURM_NTASKS  --report-bindings --map-by
>>>>>>>     ppr:32:socket:PE=4
>>>>>>>     --bind-to hwthread
>>>>>>>
>>>>>>>     But this output is not quite the same:
>>>>>>>
>>>>>>>     [epsilon102:2631529] MCW rank 0 bound to socket 0[core 0[hwt
>>>>>>>     0-1]],
>>>>>>>     socket 0[cor
>>>>>>>     e 1[hwt 0-1]]:
>>>>>>>     [BB/BB/../../../../../../../../../../../../../../../../../../../.
>>>>>>>     
>>>>>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>>>>>>     
>>>>>>> ../../../../../../../../../../../../../../../..][../../../../../../../../../../.
>>>>>>>     
>>>>>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>>>>>>     
>>>>>>> ../../../../../../../../../../../../../../../../../../../../../../../../../../..]
>>>>>>>     [epsilon102:2631529] MCW rank 1 bound to socket 0[core 2[hwt
>>>>>>>     0-1]],
>>>>>>>     socket 0[cor
>>>>>>>     e 3[hwt 0-1]]:
>>>>>>>     [../../BB/BB/../../../../../../../../../../../../../../../../../.
>>>>>>>     
>>>>>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>>>>>>     
>>>>>>> ../../../../../../../../../../../../../../../..][../../../../../../../../../../.
>>>>>>>     
>>>>>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>>>>>>     
>>>>>>> ../../../../../../../../../../../../../../../../../../../../../../../../../../..]
>>>>>>>
>>>>>>>     What am I missing to match the rankfile behavior? Regarding
>>>>>>>     performance,
>>>>>>>     what difference does it make between the first and the
>>>>>>>     second outputs?
>>>>>>>
>>>>>>>     Thanks for your help!
>>>>>>>     Luis
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to