Hi Greg,

Sorry for the late response , I have strict working hours and project schedule.
I do not have much to help on that problem, but according to your description 
there could be some points to check.
You stated that you don't have problems with IntelMPI. As far I know, with 
--bind-to-none option IntelMPI allocates task affinity internally so this could 
be failure point for OpenMPI. Do you have task affinity or cgroup settings 
anywhere in CI pipeline? Because these settings could prevent OpenMPI to set 
task affinity for the tasks.

Best regards,
Mehmet

________________________________
Mehmet OREN

________________________________
From: Greg Samonds <greg.samo...@esi-group.com>
Sent: Thursday, April 18, 2024 2:55 AM
To: Mehmet Oren <mehmet...@hotmail.com>; Open MPI Users 
<users@lists.open-mpi.org>; Gilles Gouaillardet <gilles.gouaillar...@gmail.com>
Cc: Adnane Khattabi <adnane.khatt...@esi-group.com>; Philippe Rouchon 
<philippe.rouc...@esi-group.com>
Subject: RE: [OMPI users] "MCW rank 0 is not bound (or bound to all available 
processors)" when running multiple jobs concurrently


Hi Mehmet, Gilles,



Thanks for your support on this topic.



  *   I gave "--mca pml ^ucx" a try but unfortunately the jobs failed with “ 
MPI_INIT has failed because at least one MPI process is unreachable from 
another”.
  *   We use a Python-based launcher which launches an mpiexec command through 
a subprocess, and the SIGILL error occurs immediately after this - before our 
Fortran application prints out any information or begins the simulation.
  *   The “[podman-ci-rocky-8.8:09900] MCW rank 3 is not bound (or bound to all 
available processors)” messages can also occur in cases which are successful, 
and do not crash with a SIGILL (I only just realized this, so the subject of 
this email is not correct, sorry about that).
  *   We do actually apply the full HPC-X environment.  Our Python launcher has 
a mechanism which launches a shell and runs “source hpcx-init.sh” and 
“hpcx_load”, and it then copies back this environment so it can be passed to 
the mpiexec command.
  *   We don’t encounter this issue when running in the same context using the 
Intel MPI/x86_64 version of our software, which uses the same source code 
branch and only differs in the libraries its linked with.
  *   The full execution context is: Jenkins (Groovy-based) -> Python script -> 
Podman container -> Shell script -> Make test -> Python (launcher application) 
-> MPI Fortran application
  *   We can’t reproduce the issue when omitting the first two steps by running 
on a similar machine outside of the CI (starting from the “Podman container” 
step)



It’s very bizarre that we can only reproduce this within our Jenkins CI system, 
but not while running it directly, even with the same container, hardware, OS, 
kernel, etc.  For the lack of a better idea, I wonder if there could possibly 
be some strange interaction between the JVM (for Jenkins) running on the 
machine and MPI, but I don’t see how the operating system could allow something 
like this to happen.



Perhaps we can try increasing the verbosity of MPI’s output and comparing what 
we get from within the CI to what we get locally.  Would “--mca 
btl_base_verbose 100” and “--mca pml_base_verbose 100” be the best way to do 
this, or would you recommend something more specific for this situation?



Regards,

Greg



From: Mehmet Oren <mehmet...@hotmail.com>
Sent: Wednesday, April 17, 2024 5:11 PM
To: Open MPI Users <users@lists.open-mpi.org>
Cc: Greg Samonds <greg.samo...@esi-group.com>; Adnane Khattabi 
<adnane.khatt...@esi-group.com>; Philippe Rouchon 
<philippe.rouc...@esi-group.com>
Subject: Re: [OMPI users] "MCW rank 0 is not bound (or bound to all available 
processors)" when running multiple jobs concurrently



Hi Greg,



I am not an openmpi expert but I just wanted to share my experience with HPC-X.

  1.  Default HPC-X builds which come with the mofed drivers are built with UCX 
and as Gilles stated, specifying ob1 will not change the layer for openmpi. You 
can try to discard UCX and let the openmpi decide for the layer by adding 
"--mca pml ^ucx" to your command line.
  2.  HPC-X comes with two scripts named mpivars.sh and mpivars.csh 
respectively under bin folder. It could be a better option to source mpivars.sh 
before running your job instead of adding LD_LIBRARY_PATH. By sourcing this 
script, you can set up all required paths and environment variables easily and 
fix most of the run time problems.
  3.  And also, please check hwloc and it is dependencies which usually are not 
present with default os installations and container images.



Regards,

Mehmet

________________________________

From: users 
<users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>> on 
behalf of Greg Samonds via users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>
Sent: Tuesday, April 16, 2024 5:50 PM
To: Open MPI Users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>
Cc: Greg Samonds 
<greg.samo...@esi-group.com<mailto:greg.samo...@esi-group.com>>; Adnane 
Khattabi <adnane.khatt...@esi-group.com<mailto:adnane.khatt...@esi-group.com>>; 
Philippe Rouchon 
<philippe.rouc...@esi-group.com<mailto:philippe.rouc...@esi-group.com>>
Subject: Re: [OMPI users] "MCW rank 0 is not bound (or bound to all available 
processors)" when running multiple jobs concurrently



Hi Gilles,



Thanks for your assistance.



I tried the recommended settings but got an error saying “sm” is no longer 
available in Open MPI 3.0+, and to use “vader” instead.  I then tried with 
“--mca pml ob1 --mca btl self,vader” but ended up with the original error:



[podman-ci-rocky-8.8:09900] MCW rank 3 is not bound (or bound to all available 
processors)

[podman-ci-rocky-8.8:09899] MCW rank 2 is not bound (or bound to all available 
processors)

[podman-ci-rocky-8.8:09898] MCW rank 1 is not bound (or bound to all available 
processors)

[podman-ci-rocky-8.8:09897] MCW rank 0 is not bound (or bound to all available 
processors)



Program received signal SIGILL: Illegal instruction.



Backtrace for this error:

#0  0xffffa202a917 in ???

#1  0xffffa20299a7 in ???

#2  0xffffa520079f in ???

#3  0xffffa1d0380c in ???

#4  0xffffa1d56fe7 in ???

#5  0xffffa1d57be7 in ???

#6  0xffffa1d5a5f7 in ???

#7  0xffffa1d5b35b in ???

#8  0xffffa17b8db7 in get_print_name_buffer

                at util/name_fns.c:106

#9  0xffffa17b8e1b in orte_util_print_jobids

                at util/name_fns.c:171

#10  0xffffa17b91eb in orte_util_print_name_args

                at util/name_fns.c:143

#11  0xffffa1822e93 in _process_name_print_for_opal

                at runtime/orte_init.c:68

#12  0xffff9ebe5e6f in process_event

                at 
/build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat8-cuda12-aarch64/ompi-821f7a18fb5f87c7840032d0251fb36675505a64/opal/mca/pmix/pmix3x/pmix3x.c:255

#13  0xffffa16ec3cf in event_process_active_single_queue

                at 
/build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat8-cuda12-aarch64/ompi-821f7a18fb5f87c7840032d0251fb36675505a64/opal/mca/event/libevent2022/libevent/event.c:1370

#14  0xffffa16ec3cf in event_process_active

                at 
/build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat8-cuda12-aarch64/ompi-821f7a18fb5f87c7840032d0251fb36675505a64/opal/mca/event/libevent2022/libevent/event.c:1440

#15  0xffffa16ec3cf in opal_libevent2022_event_base_loop

                at 
/build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat8-cuda12-aarch64/ompi-821f7a18fb5f87c7840032d0251fb36675505a64/opal/mca/event/libevent2022/libevent/event.c:1644

#16  0xffffa16a9d93 in progress_engine

                at runtime/opal_progress_threads.c:105

#17  0xffffa1e678b7 in ???

#18  0xffffa1d03afb in ???

#19  0xffffffffffffffff in ???



The typical mpiexec options for each job include “-np 4 --allow-run-as-root 
--bind-to none --report-bindings” and a “-x LD_LIBRARY_PATH=…” which passes the 
HPC-X and application environment.



I will get back to you with a core dump once I figure out the best way to 
generate and retrieve it from within our CI infrastructure.



Thanks again!



Regards,

Greg



From: users 
<users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Gilles Gouaillardet via users
Sent: Tuesday, April 16, 2024 12:59 AM
To: Open MPI Users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>
Cc: Gilles Gouaillardet 
<gilles.gouaillar...@gmail.com<mailto:gilles.gouaillar...@gmail.com>>
Subject: Re: [OMPI users] "MCW rank 0 is not bound (or bound to all available 
processors)" when running multiple jobs concurrently



Greg,



If Open MPI was built with UCX, your jobs will likely use UCX (and the shared 
memory provider) even if running on a single node.

You can

mpirun --mca pml ob1 --mca btl self,sm ...

if you want to avoid using UCX.



What is a typical mpirun command line used under the hood by your "make test"?

Though the warning might be ignored, SIGILL is definitely an issue.

I encourage you to have your app dump a core in order to figure out where this 
is coming from





Cheers,



Gilles



On Tue, Apr 16, 2024 at 5:20 AM Greg Samonds via users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:

Hello,



We’re running into issues with jobs failing in a non-deterministic way when 
running multiple jobs concurrently within a “make test” framework.



Make test is launched from within a shell script running inside a Podman 
container, and we’re typically running with “-j 20” and “-np 4” (20 jobs 
concurrently with 4 procs each).  We’ve also tried reducing the number of jobs 
to no avail.  Each time the battery of test cases is run, about 2 to 4 
different jobs out of around 200 fail with the following errors:

[podman-ci-rocky-8.8:03528] MCW rank 1 is not bound (or bound to all available 
processors)
[podman-ci-rocky-8.8:03540] MCW rank 3 is not bound (or bound to all available 
processors)
[podman-ci-rocky-8.8:03519] MCW rank 0 is not bound (or bound to all available 
processors)
[podman-ci-rocky-8.8:03533] MCW rank 2 is not bound (or bound to all available 
processors)

Program received signal SIGILL: Illegal instruction.

Some info about our setup:

  *   Ampere Altra 80 core ARM machine
  *   Open MPI 4.1.7a1 from HPC-X v2.18
  *   Rocky Linux 8.6 host, Rocky Linux 8.8 container
  *   Podman 4.4.1
  *   This machine has a Mellanox Connect X-6 Lx NIC, however we’re avoiding 
the Mellanox software stack by running in a container, and these are single 
node jobs only



We tried passing “—bind-to none” to the running jobs, and while this seemed to 
reduce the number of failing jobs on average, it didn’t eliminate the issue.



We also encounter the following warning:



[1712927028.412063] [podman-ci-rocky-8:3519 :0]            sock.c:514  UCX  
WARN  unable to read somaxconn value from /proc/sys/net/core/somaxconn file



…however as far as I can tell this is probably unrelated and occurs because the 
associated file isn’t accessible inside the container, and after checking the 
UCX source I can see that SOMAXCONN is picked up from the system headers anyway.



If anyone has hints about how to workaround this issue we’d greatly appreciate 
it!



Thanks,

Greg

Reply via email to