Hi Ralph,

 The "slot=N" directive saids to "put this proc on core N". In your file,
you stipulate that

>
> rank 0 is to be placed solely on core 0
> rank 1 is to be placed solely on core 2
> etc.
>

That is exactly what I want to achieve but from the mpirun cmd instead of
using a rankfile and I am failing...


> That is not what you asked for in your mpirun cmd. You asked that each
> proc be mapped to TWO cores (PE=2) or FOUR threads (PE=4 with bind-to HWT).
>
If you wanted that same thing in a rankfile, it should have said
>
> rank 0 slots=0-1
> rank 1 slots=2-3
> etc.
>
> Hence the difference. I was simply correcting your mpirun cmd line as you
> said you wanted two CORES, and that isn't guaranteed if you are stipulating
> things in terms of HWTs as not every machine has two HWTs/core.
>
>
>
> On Feb 28, 2021, at 7:43 AM, Luis Cebamanos via users <
> users@lists.open-mpi.org> wrote:
>
> Hi Ralph,
>
> Thanks for this, however --map-by ppr:32:socket:PE=2 --bind-to core
> reports the same binding than --map-by ppr:32:socket:PE=4 --bind-to
> hwthread:
>
> [epsilon104:2861230] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket
> 0[core 1[hwt 0-1]]: [BB/BB/../../../../
>
> ../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../
>
> ../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..
>
> /../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..
> /../../../../../../../..]
> [epsilon104:2861230] MCW rank 1 bound to socket 0[core 2[hwt 0-1]],
> socket 0[core 3[hwt 0-1]]: [../../BB/BB/../../
>
> ../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../
>
> ../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..
>
> /../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..
> /../../../../../../../..]
> [epsilon104:2861230] MCW rank 2 bound to socket 0[core 4[hwt 0-1]],
> socket 0[core 5[hwt 0-1]]: [../../../../BB/BB/
>
> ../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../
>
> ../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..
>
> /../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..
> /../../../../../../../..]
>
> And this is still different from the output produce using the rankfile.
>
> Cheers,
> Luis
>
> On 28/02/2021 14:06, Ralph Castain via users wrote:
>
> Your command line is incorrect:
>
> --map-by ppr:32:socket:PE=4 --bind-to hwthread
>
> should be
>
> --map-by ppr:32:socket:PE=2 --bind-to core
>
>
>
> On Feb 28, 2021, at 5:57 AM, Luis Cebamanos via users <
> users@lists.open-mpi.org> wrote:
>
> I should have said, "I would like to run 128 MPI processes on 2 nodes"
> and not 64 like I initially said...
>
> On Sat, 27 Feb 2021, 15:03 Luis Cebamanos, <luic...@gmail.com> wrote:
>
>> Hello OMPI users,
>>
>> On 128 core nodes, 2 sockets x 64 cores/socket (2 hwthreads/core) , I am
>> trying to match the behavior of running with a rankfile with manual
>> mapping/ranking/binding.
>>
>> I would like to run 64 MPI processes on 2 nodes, 1 MPI process every 2
>> cores. This is, I want to run 32 MPI processes per socket on 2 128-core
>> nodes. My mapping should be something like:
>>
>> Node 0
>> =====
>> rank 0  -  core 0
>> rank 1  -  core 2
>> rank 3 -   core 4
>> ...
>> rank 63 - core 126
>>
>>
>> Node 1
>> ====
>> rank 64  -  core 0
>> rank 65  -  core 2
>> rank 66 -   core 4
>> ...
>> rank 127- core 126
>>
>> If I use a rankfile:
>> rank 0=epsilon102 slot=0
>> rank 1=epsilon102 slot=2
>> rank 2=epsilon102 slot=4
>> rank 3=epsilon102 slot=6
>> rank 4=epsilon102 slot=8
>> rank 5=epsilon102slot=10
>> ....
>> rank 123=epsilon103 slot=118
>> rank 124=epsilon103 slot=120
>> rank 125=epsilon103 slot=122
>> rank 126=epsilon103 slot=124
>> rank 127=epsilon103 slot=126
>>
>> My --report-binding looks like:
>>
>> [epsilon102:2635370] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]:
>> [BB/../../..
>>
>> /../../../../../../../../../../../../../../../../../../../../../../../../../../.
>>
>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>
>> ../../../../../../..][../../../../../../../../../../../../../../../../../../../.
>>
>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>> ../../../../../../../../../../../../../../../../../..]
>> [epsilon102:2635370] MCW rank 1 bound to socket 0[core 2[hwt 0-1]]:
>> [../../BB/..
>>
>> /../../../../../../../../../../../../../../../../../../../../../../../../../../.
>>
>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>
>> ../../../../../../..][../../../../../../../../../../../../../../../../../../../.
>>
>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>> ../../../../../../../../../../../../../../../../../..]
>> [epsilon102:2635370] MCW rank 2 bound to socket 0[core 4[hwt 0-1]]:
>> [../../../..
>>
>> /BB/../../../../../../../../../../../../../../../../../../../../../../../../../.
>>
>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>
>> ../../../../../../..][../../../../../../../../../../../../../../../../../../../.
>>
>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>> ../../../../../../../../../../../../../../../../../..]
>>
>>
>> However, I cannot match this report-binding output by manually using
>> --map-by and --bind-to. I had the impression that this will be the same:
>>
>> mpirun -np $SLURM_NTASKS  --report-bindings --map-by ppr:32:socket:PE=4
>> --bind-to hwthread
>>
>> But this output is not quite the same:
>>
>> [epsilon102:2631529] MCW rank 0 bound to socket 0[core 0[hwt 0-1]],
>> socket 0[cor
>> e 1[hwt 0-1]]:
>> [BB/BB/../../../../../../../../../../../../../../../../../../../.
>>
>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>
>> ../../../../../../../../../../../../../../../..][../../../../../../../../../../.
>>
>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>
>> ../../../../../../../../../../../../../../../../../../../../../../../../../../..]
>> [epsilon102:2631529] MCW rank 1 bound to socket 0[core 2[hwt 0-1]],
>> socket 0[cor
>> e 3[hwt 0-1]]:
>> [../../BB/BB/../../../../../../../../../../../../../../../../../.
>>
>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>
>> ../../../../../../../../../../../../../../../..][../../../../../../../../../../.
>>
>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>
>> ../../../../../../../../../../../../../../../../../../../../../../../../../../..]
>>
>> What am I missing to match the rankfile behavior? Regarding performance,
>> what difference does it make between the first and the second outputs?
>>
>> Thanks for your help!
>> Luis
>>
>
>
>
>

Reply via email to