I should have said, "I would like to run 128 MPI processes on 2 nodes" and not 64 like I initially said...
On Sat, 27 Feb 2021, 15:03 Luis Cebamanos, <luic...@gmail.com> wrote: > Hello OMPI users, > > On 128 core nodes, 2 sockets x 64 cores/socket (2 hwthreads/core) , I am > trying to match the behavior of running with a rankfile with manual > mapping/ranking/binding. > > I would like to run 64 MPI processes on 2 nodes, 1 MPI process every 2 > cores. This is, I want to run 32 MPI processes per socket on 2 128-core > nodes. My mapping should be something like: > > Node 0 > ===== > rank 0 - core 0 > rank 1 - core 2 > rank 3 - core 4 > ... > rank 63 - core 126 > > > Node 1 > ==== > rank 64 - core 0 > rank 65 - core 2 > rank 66 - core 4 > ... > rank 127- core 126 > > If I use a rankfile: > rank 0=epsilon102 slot=0 > rank 1=epsilon102 slot=2 > rank 2=epsilon102 slot=4 > rank 3=epsilon102 slot=6 > rank 4=epsilon102 slot=8 > rank 5=epsilon102slot=10 > .... > rank 123=epsilon103 slot=118 > rank 124=epsilon103 slot=120 > rank 125=epsilon103 slot=122 > rank 126=epsilon103 slot=124 > rank 127=epsilon103 slot=126 > > My --report-binding looks like: > > [epsilon102:2635370] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: > [BB/../../.. > > /../../../../../../../../../../../../../../../../../../../../../../../../../../. > > ./../../../../../../../../../../../../../../../../../../../../../../../../../../ > > ../../../../../../..][../../../../../../../../../../../../../../../../../../../. > > ./../../../../../../../../../../../../../../../../../../../../../../../../../../ > ../../../../../../../../../../../../../../../../../..] > [epsilon102:2635370] MCW rank 1 bound to socket 0[core 2[hwt 0-1]]: > [../../BB/.. > > /../../../../../../../../../../../../../../../../../../../../../../../../../../. > > ./../../../../../../../../../../../../../../../../../../../../../../../../../../ > > ../../../../../../..][../../../../../../../../../../../../../../../../../../../. > > ./../../../../../../../../../../../../../../../../../../../../../../../../../../ > ../../../../../../../../../../../../../../../../../..] > [epsilon102:2635370] MCW rank 2 bound to socket 0[core 4[hwt 0-1]]: > [../../../.. > > /BB/../../../../../../../../../../../../../../../../../../../../../../../../../. > > ./../../../../../../../../../../../../../../../../../../../../../../../../../../ > > ../../../../../../..][../../../../../../../../../../../../../../../../../../../. > > ./../../../../../../../../../../../../../../../../../../../../../../../../../../ > ../../../../../../../../../../../../../../../../../..] > > > However, I cannot match this report-binding output by manually using > --map-by and --bind-to. I had the impression that this will be the same: > > mpirun -np $SLURM_NTASKS --report-bindings --map-by ppr:32:socket:PE=4 > --bind-to hwthread > > But this output is not quite the same: > > [epsilon102:2631529] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], > socket 0[cor > e 1[hwt 0-1]]: > [BB/BB/../../../../../../../../../../../../../../../../../../../. > > ./../../../../../../../../../../../../../../../../../../../../../../../../../../ > > ../../../../../../../../../../../../../../../..][../../../../../../../../../../. > > ./../../../../../../../../../../../../../../../../../../../../../../../../../../ > > ../../../../../../../../../../../../../../../../../../../../../../../../../../..] > [epsilon102:2631529] MCW rank 1 bound to socket 0[core 2[hwt 0-1]], > socket 0[cor > e 3[hwt 0-1]]: > [../../BB/BB/../../../../../../../../../../../../../../../../../. > > ./../../../../../../../../../../../../../../../../../../../../../../../../../../ > > ../../../../../../../../../../../../../../../..][../../../../../../../../../../. > > ./../../../../../../../../../../../../../../../../../../../../../../../../../../ > > ../../../../../../../../../../../../../../../../../../../../../../../../../../..] > > What am I missing to match the rankfile behavior? Regarding performance, > what difference does it make between the first and the second outputs? > > Thanks for your help! > Luis >