Hi Gilles,
Thanks for your assistance.
I tried the recommended settings but got an error saying “sm” is no longer
available in Open MPI 3.0+, and to use “vader” instead. I then tried with
“--mca pml ob1 --mca btl self,vader” but ended up with the original error:
[podman-ci-rocky-8.8:09900] MCW rank 3 is not bound (or bound to all available
processors)
[podman-ci-rocky-8.8:09899] MCW rank 2 is not bound (or bound to all available
processors)
[podman-ci-rocky-8.8:09898] MCW rank 1 is not bound (or bound to all available
processors)
[podman-ci-rocky-8.8:09897] MCW rank 0 is not bound (or bound to all available
processors)
Program received signal SIGILL: Illegal instruction.
Backtrace for this error:
#0 0xffffa202a917 in ???
#1 0xffffa20299a7 in ???
#2 0xffffa520079f in ???
#3 0xffffa1d0380c in ???
#4 0xffffa1d56fe7 in ???
#5 0xffffa1d57be7 in ???
#6 0xffffa1d5a5f7 in ???
#7 0xffffa1d5b35b in ???
#8 0xffffa17b8db7 in get_print_name_buffer
at util/name_fns.c:106
#9 0xffffa17b8e1b in orte_util_print_jobids
at util/name_fns.c:171
#10 0xffffa17b91eb in orte_util_print_name_args
at util/name_fns.c:143
#11 0xffffa1822e93 in _process_name_print_for_opal
at runtime/orte_init.c:68
#12 0xffff9ebe5e6f in process_event
at
/build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat8-cuda12-aarch64/ompi-821f7a18fb5f87c7840032d0251fb36675505a64/opal/mca/pmix/pmix3x/pmix3x.c:255
#13 0xffffa16ec3cf in event_process_active_single_queue
at
/build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat8-cuda12-aarch64/ompi-821f7a18fb5f87c7840032d0251fb36675505a64/opal/mca/event/libevent2022/libevent/event.c:1370
#14 0xffffa16ec3cf in event_process_active
at
/build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat8-cuda12-aarch64/ompi-821f7a18fb5f87c7840032d0251fb36675505a64/opal/mca/event/libevent2022/libevent/event.c:1440
#15 0xffffa16ec3cf in opal_libevent2022_event_base_loop
at
/build-result/src/hpcx-v2.17.1-gcc-mlnx_ofed-redhat8-cuda12-aarch64/ompi-821f7a18fb5f87c7840032d0251fb36675505a64/opal/mca/event/libevent2022/libevent/event.c:1644
#16 0xffffa16a9d93 in progress_engine
at runtime/opal_progress_threads.c:105
#17 0xffffa1e678b7 in ???
#18 0xffffa1d03afb in ???
#19 0xffffffffffffffff in ???
The typical mpiexec options for each job include “-np 4 --allow-run-as-root
--bind-to none --report-bindings” and a “-x LD_LIBRARY_PATH=…” which passes the
HPC-X and application environment.
I will get back to you with a core dump once I figure out the best way to
generate and retrieve it from within our CI infrastructure.
Thanks again!
Regards,
Greg
From: users <[email protected]> On Behalf Of Gilles Gouaillardet
via users
Sent: Tuesday, April 16, 2024 12:59 AM
To: Open MPI Users <[email protected]>
Cc: Gilles Gouaillardet <[email protected]>
Subject: Re: [OMPI users] "MCW rank 0 is not bound (or bound to all available
processors)" when running multiple jobs concurrently
Greg,
If Open MPI was built with UCX, your jobs will likely use UCX (and the shared
memory provider) even if running on a single node.
You can
mpirun --mca pml ob1 --mca btl self,sm ...
if you want to avoid using UCX.
What is a typical mpirun command line used under the hood by your "make test"?
Though the warning might be ignored, SIGILL is definitely an issue.
I encourage you to have your app dump a core in order to figure out where this
is coming from
Cheers,
Gilles
On Tue, Apr 16, 2024 at 5:20 AM Greg Samonds via users
<[email protected]<mailto:[email protected]>> wrote:
Hello,
We’re running into issues with jobs failing in a non-deterministic way when
running multiple jobs concurrently within a “make test” framework.
Make test is launched from within a shell script running inside a Podman
container, and we’re typically running with “-j 20” and “-np 4” (20 jobs
concurrently with 4 procs each). We’ve also tried reducing the number of jobs
to no avail. Each time the battery of test cases is run, about 2 to 4
different jobs out of around 200 fail with the following errors:
[podman-ci-rocky-8.8:03528] MCW rank 1 is not bound (or bound to all available
processors)
[podman-ci-rocky-8.8:03540] MCW rank 3 is not bound (or bound to all available
processors)
[podman-ci-rocky-8.8:03519] MCW rank 0 is not bound (or bound to all available
processors)
[podman-ci-rocky-8.8:03533] MCW rank 2 is not bound (or bound to all available
processors)
Program received signal SIGILL: Illegal instruction.
Some info about our setup:
* Ampere Altra 80 core ARM machine
* Open MPI 4.1.7a1 from HPC-X v2.18
* Rocky Linux 8.6 host, Rocky Linux 8.8 container
* Podman 4.4.1
* This machine has a Mellanox Connect X-6 Lx NIC, however we’re avoiding
the Mellanox software stack by running in a container, and these are single
node jobs only
We tried passing “—bind-to none” to the running jobs, and while this seemed to
reduce the number of failing jobs on average, it didn’t eliminate the issue.
We also encounter the following warning:
[1712927028.412063] [podman-ci-rocky-8:3519 :0] sock.c:514 UCX
WARN unable to read somaxconn value from /proc/sys/net/core/somaxconn file
…however as far as I can tell this is probably unrelated and occurs because the
associated file isn’t accessible inside the container, and after checking the
UCX source I can see that SOMAXCONN is picked up from the system headers anyway.
If anyone has hints about how to workaround this issue we’d greatly appreciate
it!
Thanks,
Greg