Hi John, I would be interested to know if that does what you are expecting...
On 01/03/2021 00:02, John R Cary via users wrote: > I've been watching this exchange with interest, because it is the > closest I have seen to what I want, but I want something slightly > different: 2 processes per node, with the first one bound to one core, > and the second bound to all the rest, with no use of hyperthreads. > > Would this be > > --map-by ppr:2:node --bind-to core --cpu-list 0,1-31 > > ? > > Thx.... > > > On 2/28/21 5:44 PM, Ralph Castain via users wrote: >> The only way I know of to do what you want is >> >> --map-by ppr:32:socket --bind-to core --cpu-list 0,2,4,6,... >> >> where you list out the exact cpus you want to use. >> >> >>> On Feb 28, 2021, at 9:58 AM, Luis Cebamanos via users >>> <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote: >>> >>> I could do--map-by ppr:32:socket:PE=1 --bind-to core (output below) >>> but I cannot see the way of mapping every 2 cores 0,2,4,.... >>> >>> [epsilon110:1489563] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: >>> [BB/../../.. >>> /../../../../../../../../../../../../../../../../../../../../../../../../../../. >>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../ >>> ../../../../../../..][../../../../../../../../../../../../../../../../../../../. >>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../ >>> ../../../../../../../../../../../../../../../../../..] >>> [epsilon110:1489563] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: >>> [../BB/../.. >>> /../../../../../../../../../../../../../../../../../../../../../../../../../../. >>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../ >>> ../../../../../../..][../../../../../../../../../../../../../../../../../../../. >>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../ >>> ../../../../../../../../../../../../../../../../../..] >>> >>> On 28/02/2021 16:24, Ralph Castain via users wrote: >>>> Did you read the documentation on rankfile? The "slot=N" directive >>>> saids to "put this proc on core N". In your file, you stipulate that >>>> >>>> rank 0 is to be placed solely on core 0 >>>> rank 1 is to be placed solely on core 2 >>>> etc. >>>> >>>> That is not what you asked for in your mpirun cmd. You asked that >>>> each proc be mapped to TWO cores (PE=2) or FOUR threads (PE=4 with >>>> bind-to HWT). If you wanted that same thing in a rankfile, it >>>> should have said >>>> >>>> rank 0 slots=0-1 >>>> rank 1 slots=2-3 >>>> etc. >>>> >>>> Hence the difference. I was simply correcting your mpirun cmd line >>>> as you said you wanted two CORES, and that isn't guaranteed if you >>>> are stipulating things in terms of HWTs as not every machine has >>>> two HWTs/core. >>>> >>>> >>>> >>>>> On Feb 28, 2021, at 7:43 AM, Luis Cebamanos via users >>>>> <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote: >>>>> >>>>> Hi Ralph, >>>>> >>>>> Thanks for this, however --map-by ppr:32:socket:PE=2 --bind-to >>>>> core reports the same binding than --map-by ppr:32:socket:PE=4 >>>>> --bind-to hwthread: >>>>> >>>>> [epsilon104:2861230] MCW rank 0 bound to socket 0[core 0[hwt >>>>> 0-1]], socket 0[core 1[hwt 0-1]]: [BB/BB/../../../../ >>>>> ../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../ >>>>> ../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../.. >>>>> /../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../.. >>>>> /../../../../../../../..] >>>>> [epsilon104:2861230] MCW rank 1 bound to socket 0[core 2[hwt >>>>> 0-1]], socket 0[core 3[hwt 0-1]]: [../../BB/BB/../../ >>>>> ../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../ >>>>> ../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../.. >>>>> /../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../.. >>>>> /../../../../../../../..] >>>>> [epsilon104:2861230] MCW rank 2 bound to socket 0[core 4[hwt >>>>> 0-1]], socket 0[core 5[hwt 0-1]]: [../../../../BB/BB/ >>>>> ../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../ >>>>> ../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../.. >>>>> /../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../.. >>>>> /../../../../../../../..] >>>>> >>>>> And this is still different from the output produce using the >>>>> rankfile. >>>>> >>>>> Cheers, >>>>> Luis >>>>> >>>>> On 28/02/2021 14:06, Ralph Castain via users wrote: >>>>>> Your command line is incorrect: >>>>>> >>>>>> --map-by ppr:32:socket:PE=4 --bind-to hwthread >>>>>> >>>>>> should be >>>>>> >>>>>> --map-by ppr:32:socket:PE=2 --bind-to core >>>>>> >>>>>> >>>>>> >>>>>>> On Feb 28, 2021, at 5:57 AM, Luis Cebamanos via users >>>>>>> <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote: >>>>>>> >>>>>>> I should have said, "I would like to run 128 MPI processes on 2 >>>>>>> nodes" and not 64 like I initially said... >>>>>>> >>>>>>> On Sat, 27 Feb 2021, 15:03 Luis Cebamanos, <luic...@gmail.com >>>>>>> <mailto:luic...@gmail.com>> wrote: >>>>>>> >>>>>>> Hello OMPI users, >>>>>>> >>>>>>> On 128 core nodes, 2 sockets x 64 cores/socket (2 >>>>>>> hwthreads/core) , I am >>>>>>> trying to match the behavior of running with a rankfile with >>>>>>> manual >>>>>>> mapping/ranking/binding. >>>>>>> >>>>>>> I would like to run 64 MPI processes on 2 nodes, 1 MPI >>>>>>> process every 2 >>>>>>> cores. This is, I want to run 32 MPI processes per socket on >>>>>>> 2 128-core >>>>>>> nodes. My mapping should be something like: >>>>>>> >>>>>>> Node 0 >>>>>>> ===== >>>>>>> rank 0 - core 0 >>>>>>> rank 1 - core 2 >>>>>>> rank 3 - core 4 >>>>>>> ... >>>>>>> rank 63 - core 126 >>>>>>> >>>>>>> >>>>>>> Node 1 >>>>>>> ==== >>>>>>> rank 64 - core 0 >>>>>>> rank 65 - core 2 >>>>>>> rank 66 - core 4 >>>>>>> ... >>>>>>> rank 127- core 126 >>>>>>> >>>>>>> If I use a rankfile: >>>>>>> rank 0=epsilon102 slot=0 >>>>>>> rank 1=epsilon102 slot=2 >>>>>>> rank 2=epsilon102 slot=4 >>>>>>> rank 3=epsilon102 slot=6 >>>>>>> rank 4=epsilon102 slot=8 >>>>>>> rank 5=epsilon102slot=10 >>>>>>> .... >>>>>>> rank 123=epsilon103 slot=118 >>>>>>> rank 124=epsilon103 slot=120 >>>>>>> rank 125=epsilon103 slot=122 >>>>>>> rank 126=epsilon103 slot=124 >>>>>>> rank 127=epsilon103 slot=126 >>>>>>> >>>>>>> My --report-binding looks like: >>>>>>> >>>>>>> [epsilon102:2635370] MCW rank 0 bound to socket 0[core 0[hwt >>>>>>> 0-1]]: >>>>>>> [BB/../../.. >>>>>>> >>>>>>> /../../../../../../../../../../../../../../../../../../../../../../../../../../. >>>>>>> >>>>>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../ >>>>>>> >>>>>>> ../../../../../../..][../../../../../../../../../../../../../../../../../../../. >>>>>>> >>>>>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../ >>>>>>> ../../../../../../../../../../../../../../../../../..] >>>>>>> [epsilon102:2635370] MCW rank 1 bound to socket 0[core 2[hwt >>>>>>> 0-1]]: >>>>>>> [../../BB/.. >>>>>>> >>>>>>> /../../../../../../../../../../../../../../../../../../../../../../../../../../. >>>>>>> >>>>>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../ >>>>>>> >>>>>>> ../../../../../../..][../../../../../../../../../../../../../../../../../../../. >>>>>>> >>>>>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../ >>>>>>> ../../../../../../../../../../../../../../../../../..] >>>>>>> [epsilon102:2635370] MCW rank 2 bound to socket 0[core 4[hwt >>>>>>> 0-1]]: >>>>>>> [../../../.. >>>>>>> >>>>>>> /BB/../../../../../../../../../../../../../../../../../../../../../../../../../. >>>>>>> >>>>>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../ >>>>>>> >>>>>>> ../../../../../../..][../../../../../../../../../../../../../../../../../../../. >>>>>>> >>>>>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../ >>>>>>> ../../../../../../../../../../../../../../../../../..] >>>>>>> >>>>>>> >>>>>>> However, I cannot match this report-binding output by >>>>>>> manually using >>>>>>> --map-by and --bind-to. I had the impression that this will >>>>>>> be the same: >>>>>>> >>>>>>> mpirun -np $SLURM_NTASKS --report-bindings --map-by >>>>>>> ppr:32:socket:PE=4 >>>>>>> --bind-to hwthread >>>>>>> >>>>>>> But this output is not quite the same: >>>>>>> >>>>>>> [epsilon102:2631529] MCW rank 0 bound to socket 0[core 0[hwt >>>>>>> 0-1]], >>>>>>> socket 0[cor >>>>>>> e 1[hwt 0-1]]: >>>>>>> [BB/BB/../../../../../../../../../../../../../../../../../../../. >>>>>>> >>>>>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../ >>>>>>> >>>>>>> ../../../../../../../../../../../../../../../..][../../../../../../../../../../. >>>>>>> >>>>>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../ >>>>>>> >>>>>>> ../../../../../../../../../../../../../../../../../../../../../../../../../../..] >>>>>>> [epsilon102:2631529] MCW rank 1 bound to socket 0[core 2[hwt >>>>>>> 0-1]], >>>>>>> socket 0[cor >>>>>>> e 3[hwt 0-1]]: >>>>>>> [../../BB/BB/../../../../../../../../../../../../../../../../../. >>>>>>> >>>>>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../ >>>>>>> >>>>>>> ../../../../../../../../../../../../../../../..][../../../../../../../../../../. >>>>>>> >>>>>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../ >>>>>>> >>>>>>> ../../../../../../../../../../../../../../../../../../../../../../../../../../..] >>>>>>> >>>>>>> What am I missing to match the rankfile behavior? Regarding >>>>>>> performance, >>>>>>> what difference does it make between the first and the >>>>>>> second outputs? >>>>>>> >>>>>>> Thanks for your help! >>>>>>> Luis >>>>>>> >>>>>> >>>>> >>>> >>> >> >