Do -cpu-set or -cpu-list work? Or is there a better way to use rankfile?
I have a cluster with 24-cores and 1 GPU per node. I would like to have
one core drive the GPU and the other 23 to be used thread-parallel with
OpenMP. My setup is described in my just-previous email to this list:
CentOS-8.2, gcc-8.3, openmpi-4.0.5
$ which mpirun
~/ompi/contrib-gcc830/openmpi-4.0.5-nodl/bin/mpirun
As noted in that email, I cannot get OMPI_Affinity_str to return the
affinities, but I am able now to get --report-bindings to work, so I can
progress. I have tried both -cpu-set and -cpu-list, but neither seems
to do any bindings. However, I did get rankfile to work:
$ cat rankfile
rank 0=vcloud slot=0:0
rank 1=vcloud slot=0:1-23
$ mpirun -np 2 --report-bindings -rf rankfile affinity
[vcloud:3277858] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]:
[BB/../../../../../../../../../../../../../../../../../../../../../../..]
[vcloud:3277858] MCW rank 1 bound to socket 0[core 1[hwt 0-1]], socket
0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt
0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket
0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket
0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt
0-1]], socket 0[core 15[hwt 0-1]], socket 0[core 16[hwt 0-1]], socket
0[core 17[hwt 0-1]], socket 0[core 18[hwt 0-1]], socket 0[core 19[hwt
0-1]], socket 0[core 20[hwt 0-1]], socket 0[core 21[hwt 0-1]], socket
0[core 22[hwt 0-1]], socket 0[core 23[hwt 0-1]]:
[../BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
and OpenMP returns the correct number of threads on each process (2
logical for the first rank, 46 for the second).
However, rankfile is inconvenient on large runs, where I would have to
parse the host file, then create a corresponding rankfile.
Is there a better way to do this?
Thanks.....John Cary
PS: with both -cpu-list and cpu-set I tried and got
$ mpirun -np 2 --report-bindings -cpu-set 0,1-23 affinity
[vcloud:3849270] MCW rank 0 is not bound (or bound to all available
processors)
[vcloud:3849270] MCW rank 1 is not bound (or bound to all available
processors)