Re: [OMPI users] Error using rankfile to bind multiple cores on the same node for threaded OpenMPI application

Ralph Castain via users Thu, 03 Feb 2022 06:59:23 -0800

I'm sure nobody has looked at the rankfile docs in many a year - nor actually 
tested the code for some time, especially with the newer complex chips. I can 
try to take a look at it locally, but it may be a few days before I get around 
to it.

One disturbing thing in your note was:

Also, on the cluster hwloc is not available 

In this release series, I believe we do allow mpirun to execute without hwloc 
support (we do not in later series - in fact, we won't even let you build OMPI 
without HWLOC) - but there is no way we could support rankfile without it. If 
it is true that hwloc is not available, then (a) you should have immediately 
failed (and the fact that you didn't is the error here), and (b) it might 
explain these odd and inconsistent results.

Ralph

On Feb 3, 2022, at 1:55 AM, David Perozzi <peroz...@ethz.ch 
<mailto:peroz...@ethz.ch> > wrote:

No problem, to give detailed explanation is the least I can do! Thank you for 
taking your time.

Yeah, to be honest I'm not completely sure I'm doing the right thing with the 
IDs, as I had some troubles in understanding the manpages. Maybe you can help 
me and we'll end up seeing that that was indeed the problem.

>From the manpage's section about rankfiles 
>(https://www.open-mpi.org/doc/v4.0/man1/mpirun.1.php#sect13) I can understand 
>that logical cpus can be indexed in the rankfile as: 

 For example: 

 $ cat myrankfile
 rank 0=aa slot=1:0-2
 rank 1=bb slot=0:0,1
 rank 2=cc slot=1-2
 $ mpirun -H aa,bb,cc,dd -rf myrankfile ./a.out

 Means that 

 Rank 0 runs on node aa, bound to logical socket 1, cores 0-2.
 Rank 1 runs on node bb, bound to logical socket 0, cores 0 and 1.
 Rank 2 runs on node cc, bound to logical cores 1 and 2.

While physical cpus can be referred to as:

 $ cat myphysicalrankfile
 rank 0=aa slot=1
 rank 1=bb slot=8
 rank 2=cc slot=6

 This means that 

 Rank 0 will run on node aa, bound to the core that contains physical PU 1
 Rank 1 will run on node bb, bound to the core that contains physical PU 8
 Rank 2 will run on node cc, bound to the core that contains physical PU 6

 Rankfiles are treated as logical by default, and the MCA parameter 
rmaps_rank_file_physical must be set to 1 to indicate that the rankfile is to 
be considered as physical. 

Although, at the end of the section it is said that

Starting with Open MPI v1.7, all socket/core slot locations are be specified as 
logical indexes (the Open MPI v1.6 series used physical indexes). You can use 
tools such as HWLOC’s "lstopo" to find the logical indexes of socket and cores. 

Which confirms what you just said (I indeed wasn't sure which statement I had 
to trust).

So, my local machine has the following topology (only the cpu part is reported 
here)

  Package L#0 (P#0 total=32684312KB CPUVendor=GenuineIntel CPUFamilyNumber=6 
CPUModelNumber=158 CPUModel="Intel(R) Xeon(R) E-2246G CPU @ 3.60GHz" 
CPUStepping=10)
     NUMANode L#0 (P#0 local=32684312KB total=32684312KB)
     L3Cache L#0 (size=12288KB linesize=64 ways=16 Inclusive=1)
       L2Cache L#0 (size=256KB linesize=64 ways=4 Inclusive=0)
         L1dCache L#0 (size=32KB linesize=64 ways=8 Inclusive=0)
           L1iCache L#0 (size=32KB linesize=64 ways=8 Inclusive=0)
             Core L#0 (P#0)
               PU L#0 (P#0)
               PU L#1 (P#6)
       L2Cache L#1 (size=256KB linesize=64 ways=4 Inclusive=0)
         L1dCache L#1 (size=32KB linesize=64 ways=8 Inclusive=0)
           L1iCache L#1 (size=32KB linesize=64 ways=8 Inclusive=0)
             Core L#1 (P#1)
               PU L#2 (P#1)
               PU L#3 (P#7)
       L2Cache L#2 (size=256KB linesize=64 ways=4 Inclusive=0)
         L1dCache L#2 (size=32KB linesize=64 ways=8 Inclusive=0)
           L1iCache L#2 (size=32KB linesize=64 ways=8 Inclusive=0)
             Core L#2 (P#2)
               PU L#4 (P#2)
               PU L#5 (P#8)
       L2Cache L#3 (size=256KB linesize=64 ways=4 Inclusive=0)
         L1dCache L#3 (size=32KB linesize=64 ways=8 Inclusive=0)
           L1iCache L#3 (size=32KB linesize=64 ways=8 Inclusive=0)
             Core L#3 (P#3)
               PU L#6 (P#3)
               PU L#7 (P#9)
       L2Cache L#4 (size=256KB linesize=64 ways=4 Inclusive=0)
         L1dCache L#4 (size=32KB linesize=64 ways=8 Inclusive=0)
           L1iCache L#4 (size=32KB linesize=64 ways=8 Inclusive=0)
             Core L#4 (P#4)
               PU L#8 (P#4)
               PU L#9 (P#10)
       L2Cache L#5 (size=256KB linesize=64 ways=4 Inclusive=0)
         L1dCache L#5 (size=32KB linesize=64 ways=8 Inclusive=0)
           L1iCache L#5 (size=32KB linesize=64 ways=8 Inclusive=0)
             Core L#5 (P#5)
               PU L#10 (P#5)
               PU L#11 (P#11)

Now, something interesting happens if I define my rankfile to be:

$ cat rankfile
 rank 0=localhost slot=1,3
 rank 1=localhost slot=6,9

and run

$ mpirun -rf rankfile -report-bindings --mca rmaps_rank_file_physical 1 echo ""
--------------------------------------------------------------------------
 WARNING: Open MPI tried to bind a process but failed.  This is a
 warning only; your job will continue, though performance may
 be degraded.

   Local host:        4b9bc4c4f40b
   Application name:  /usr/bin/echo
   Error message:     failed to bind memory
   Location:          rtc_hwloc.c:447

--------------------------------------------------------------------------
 [4b9bc4c4f40b:00041] MCW rank 0 bound to socket 0[core 1[hwt 0-1]], socket 
0[core 3[hwt 0-1]]: [../BB/../BB/../..]
 [4b9bc4c4f40b:00041] MCW rank 1 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 3[hwt 0-1]]: [BB/../../BB/../..]

 [4b9bc4c4f40b:00041] 1 more process has sent help message 
help-orte-odls-default.txt / memory not bound
 [4b9bc4c4f40b:00041] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
all help / error messages

(Just ignore the error messages as they seem to be caused by the fact that I'm 
inside a docker container, as explained here 
https://github.com/open-mpi/ompi/issues/7368)

If I omit the mca parameter for physical indexes, however, I get nothing in 
return:

$ mpirun -rf rankfile -report-bindings echo ""
 $

and if I create a rankfile with logical indexes that should be equivalent to 
the other

$ cat rankfile_logical 
 rank 0=localhost slot=0:2,6
 rank 1=localhost slot=0:1,7

then I get following error message:

 user@4b9bc4c4f40b:~$ mpirun -rf rankfile_logical -report-bindings echo ""
 [4b9bc4c4f40b:00058] [[15180,0],0] ORTE_ERROR_LOG: Not found in file 
rmaps_rank_file.c at line 333
 [4b9bc4c4f40b:00058] [[15180,0],0] ORTE_ERROR_LOG: Not found in file 
base/rmaps_base_map_job.c at line 402

If I remove the socket index:

$ cat rankfile_logical 
 rank 0=localhost slot=2,6
 rank 1=localhost slot=1,7

 $ mpirun -rf rankfile_logical -report-bindings echo ""

$

I get again nothing in return. So I start thinking that I may indeed be missing 
something. Do you see where, by chance?

Also, on the cluster hwloc is not available (and so lstopo). numactl seems to 
only show physical locations. Do you know another tool that could help in 
getting logical ids of the allocated cores?

You then write:

It also appears from your output that you are using hwthreads as cpus, so the 
slot descriptions are being applied to threads and not cores. At least, it 
appears that way to me - was that expected?

No, that was actually not expected. All the AMD-based nodes on the cluster 
actually have 1thread/core, while the intel-based nodes have hyperthreading 
activated. The example I've sent before was on an AMD node. Maybe a stupid 
question, but how can I then consider cpus? From the manpages I thought this is 
the default behaviour.

By the way, if I'll manage to understand everything correctly, I can also 
contribute to fix these inconsistencies in the manpages. I'd be more than happy 
to help where I can.

On 03.02.22 09:50, Ralph Castain via users wrote:

Hmmm...okay, I found the code path that fails without an error - not one of the 
ones I was citing. Thanks for that detailed explanation of what you were doing! 
I'll add some code to the master branch to plug that hole along with the other 
I identified.

Just an FYI: we stopped supporting "physical" cpus a long time ago, so the 
"rmaps_rank_file_physical" MCA param is just being ignored (we don't have a way 
to detect that the param you cited doesn't exist). We only take the input as 
being "logical" cpu designations. You might check, but I suspect the two 
(logical and physical IDs) are the same here.

It also appears from your output that you are using hwthreads as cpus, so the 
slot descriptions are being applied to threads and not cores. At least, it 
appears that way to me - was that expected?

On Feb 3, 2022, at 12:27 AM, David Perozzi <peroz...@ethz.ch> wrote:

Thanks for looking into that and sorry if I only included the version in use in 
the pastebin. I'll ask the cluster support if they could install OMPI master.

I really am unfamiliar with openmpi's codebase, so I haven't looked into it and 
are very thanful that you could already identify possible places that I 
could've "visited". One thing that I can add, however, is that I tried both on 
the cluster (OMPI 4.0.2) and on my local machine (OMPI 4.0.3) to run a dummy 
test, which basically consists in launching the following:

$ mpirun -rf rankfile -report-bindings --mca rmaps_rank_file_physical 1 echo ""

I report here the results coming from the cluster, where I allocated 6 cores, 
all on the same node:

$ numactl --show
policy: default
preferred node: current
physcpubind: 3 11 12 13 21 29
cpubind: 0 1
nodebind: 0 1
membind: 0 1 2 3 4 5 6 7

$ hostname
eu-g1-018-1

$ mpirun -rf rankfile -report-bindings --mca rmaps_rank_file_physical 1 echo ""

[eu-g1-018-1:37621] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 
1[hwt 0]]: [B/B/./././.][]
[eu-g1-018-1:37621] MCW rank 1 bound to socket 0[core 2[hwt 0]], socket 0[core 
4[hwt 0]]: [././B/./B/.][]

$ cat rankfile
rank 0=eu-g1-018-1 slot=3,11
rank 1=eu-g1-018-1 slot=12,21

However, if I change the rankfile to use an unavailable core location, e.g.

$ cat rankfile
rank 0=eu-g1-018-1 slot=3,11
rank 1=eu-g1-018-1 slot=12,28

I get no error message in return:

$ mpirun -rf rankfile -report-bindings --mca rmaps_rank_file_physical 1 echo ""
$

So, at least in this version, it is possible to get no error message in return 
quite easily (but this is maybe one of the error you said should never happen).

I'll double (triple) check my python script that generates the rankfile again, 
but as of now I'm pretty sure no nasty things should happen at that level. 
Especially because in the case reported in my initial message one can manually 
check that all locations are indeed allocated to my job (by comparing the 
rankfile and the allocation.txt file).

I was wondering if somehow mpirun cannot find all the hosts sometimes (but 
sometimes it can, so it's a mistery to me)?

Just wanted to point that out. Now I'll get in touch with the cluster support 
to see if it's possible to test on master.

Cheers,
David

On 03.02.22 01:59, Ralph Castain via users wrote:

Are you willing to try this with OMPI master? Asking because it would be hard 
to push changes all the way back to 4.0.x every time we want to see if we fixed 
something.

Also, few of us have any access to LSF, though I doubt that has much impact 
here as it sounds like the issue is in the rank_file mapper.

Glancing over the rank_file mapper in master branch, I only see a couple of 
places (both errors that should never happen) that wouldn't result in a gaudy 
"show help" message. It would be interesting to know if you are hitting those.

One way you could get more debug info is to ensure the OMPI is configure with 
--enable-debug and then add "--mca rmaps_base_verbose 5" to your cmd line.

On Feb 2, 2022, at 3:46 PM, Christoph Niethammer <nietham...@hlrs.de> wrote:

The linked pastebin includes the following version information:

[1,0]<stdout>:package:Open MPI spackapps@eu-c7-042-03 Distribution
[1,0]<stdout>:ompi:version:full:4.0.2
[1,0]<stdout>:ompi:version:repo:v4.0.2
[1,0]<stdout>:ompi:version:release_date:Oct 07, 2019
[1,0]<stdout>:orte:version:full:4.0.2
[1,0]<stdout>:orte:version:repo:v4.0.2
[1,0]<stdout>:orte:version:release_date:Oct 07, 2019
[1,0]<stdout>:opal:version:full:4.0.2
[1,0]<stdout>:opal:version:repo:v4.0.2
[1,0]<stdout>:opal:version:release_date:Oct 07, 2019
[1,0]<stdout>:mpi-api:version:full:3.1.0
[1,0]<stdout>:ident:4.0.2

Best
Christoph

----- Original Message -----
From: "Open MPI Users" <users@lists.open-mpi.org>
To: "Open MPI Users" <users@lists.open-mpi.org>
Cc: "Ralph Castain" <r...@open-mpi.org>
Sent: Thursday, 3 February, 2022 00:22:30
Subject: Re: [OMPI users] Error using rankfile to bind multiple cores on the 
same node for threaded OpenMPI application

Errr...what version OMPI are you using?

On Feb 2, 2022, at 3:03 PM, David Perozzi via users <users@lists.open-mpi.org> 
wrote:

Helo,

I'm trying to run a code implemented with OpenMPI and OpenMP (for threading) on 
a large cluster that uses LSF for the job scheduling and dispatch. The problem 
with LSF is that it is not very straightforward to allocate and bind the right 
amount of threads to an MPI rank inside a single node. Therefore, I have to 
create a rankfile myself, as soon as the (a priori unknown) ressources are 
allocated.

So, after my job get dispatched, I run:

mpirun -n "$nslots" -display-allocation -nooversubscribe --map-by core:PE=1 
--bind-to core mpi_allocation/show_numactl.sh 
>mpi_allocation/allocation_files/allocation.txt

where show_numactl.sh consists of just one line:

{ hostname; numactl --show; } | sed ':a;N;s/\n/ /;ba'

If I ask for 16 slots, in blocks of 4 (i.e., bsub -n 16 -R "span[block=4]"), I 
get something like:

======================   ALLOCATED NODES   ======================
   eu-g1-006-1: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
   eu-g1-009-2: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
   eu-g1-002-3: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
   eu-g1-005-1: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
=================================================================
eu-g1-006-1 policy: default preferred node: current physcpubind: 16  cpubind: 1 
 nodebind: 1  membind: 0 1 2 3 4 5 6 7
eu-g1-006-1 policy: default preferred node: current physcpubind: 24  cpubind: 1 
 nodebind: 1  membind: 0 1 2 3 4 5 6 7
eu-g1-006-1 policy: default preferred node: current physcpubind: 32  cpubind: 2 
 nodebind: 2  membind: 0 1 2 3 4 5 6 7
eu-g1-002-3 policy: default preferred node: current physcpubind: 21  cpubind: 1 
 nodebind: 1  membind: 0 1 2 3 4 5 6 7
eu-g1-002-3 policy: default preferred node: current physcpubind: 22  cpubind: 1 
 nodebind: 1  membind: 0 1 2 3 4 5 6 7
eu-g1-009-2 policy: default preferred node: current physcpubind: 0  cpubind: 0  
nodebind: 0  membind: 0 1 2 3 4 5 6 7
eu-g1-009-2 policy: default preferred node: current physcpubind: 1  cpubind: 0  
nodebind: 0  membind: 0 1 2 3 4 5 6 7
eu-g1-009-2 policy: default preferred node: current physcpubind: 2  cpubind: 0  
nodebind: 0  membind: 0 1 2 3 4 5 6 7
eu-g1-002-3 policy: default preferred node: current physcpubind: 19  cpubind: 1 
 nodebind: 1  membind: 0 1 2 3 4 5 6 7
eu-g1-002-3 policy: default preferred node: current physcpubind: 23  cpubind: 1 
 nodebind: 1  membind: 0 1 2 3 4 5 6 7
eu-g1-006-1 policy: default preferred node: current physcpubind: 52  cpubind: 3 
 nodebind: 3  membind: 0 1 2 3 4 5 6 7
eu-g1-009-2 policy: default preferred node: current physcpubind: 3  cpubind: 0  
nodebind: 0  membind: 0 1 2 3 4 5 6 7
eu-g1-005-1 policy: default preferred node: current physcpubind: 90  cpubind: 5 
 nodebind: 5  membind: 0 1 2 3 4 5 6 7
eu-g1-005-1 policy: default preferred node: current physcpubind: 91  cpubind: 5 
 nodebind: 5  membind: 0 1 2 3 4 5 6 7
eu-g1-005-1 policy: default preferred node: current physcpubind: 94  cpubind: 5 
 nodebind: 5  membind: 0 1 2 3 4 5 6 7
eu-g1-005-1 policy: default preferred node: current physcpubind: 95  cpubind: 5 
 nodebind: 5  membind: 0 1 2 3 4 5 6 7

After that, I parse this allocation file in python and I create a hostfile and 
a rankfile.

The hostfile reads:

eu-g1-006-1
eu-g1-009-2
eu-g1-002-3
eu-g1-005-1

The rankfile:

rank 0=eu-g1-006-1 slot=16,24,32,52
rank 1=eu-g1-009-2 slot=0,1,2,3
rank 2=eu-g1-002-3 slot=21,22,19,23
rank 3=eu-g1-005-1 slot=90,91,94,95

Following OpenMPI's manpages and FAQs, I then run my application using

mpirun -n "$nmpiproc" --rankfile mpi_allocation/hostfiles/rankfile --mca 
rmaps_rank_file_physical 1 ./build/"$executable_name" true "$input_file"

where the bash variables are passed in directly in the bsub command (I 
basically run bsub -n 16 -R "span[block=4]" "my_script.sh num_slots 
num_thread_per_rank executable_name input_file").

Now, this procedure sometimes works just fine, sometimes not. When it doesn't, 
the problem is that I don't get any error message (I noticed that if an error 
is made inside the rankfile, one does not get any error). Strangely, it seems 
that for 16 slots and four threads (so 4 MPI ranks), it works better if I have 
8 slots allocated in two nodes than if I have 4 slots in 4 different nodes. My 
goal is tu run the application with 256 slots and 32 threads per rank (the 
cluster has mainly AMD EPYC based nodes).

The ompi information of the nodes running a failed job and the rankfile for 
that failed job can be found at https://pastebin.com/40f6FigH and the 
allocation file at https://pastebin.com/jeWnkU40

Do you see any problem with my procedure? Why is it failing seemingly randomly? 
Can I somehow get more informtion about what's failing from mpirun?

I hope not having omitted to much information but, in case, just ask and I'll 
provide more details.

Cheers,

David

Re: [OMPI users] Error using rankfile to bind multiple cores on the same node for threaded OpenMPI application

Reply via email to