Hello, We're running into issues with jobs failing in a non-deterministic way when running multiple jobs concurrently within a "make test" framework.
Make test is launched from within a shell script running inside a Podman container, and we're typically running with "-j 20" and "-np 4" (20 jobs concurrently with 4 procs each). We've also tried reducing the number of jobs to no avail. Each time the battery of test cases is run, about 2 to 4 different jobs out of around 200 fail with the following errors: [podman-ci-rocky-8.8:03528] MCW rank 1 is not bound (or bound to all available processors) [podman-ci-rocky-8.8:03540] MCW rank 3 is not bound (or bound to all available processors) [podman-ci-rocky-8.8:03519] MCW rank 0 is not bound (or bound to all available processors) [podman-ci-rocky-8.8:03533] MCW rank 2 is not bound (or bound to all available processors) Program received signal SIGILL: Illegal instruction. Some info about our setup: * Ampere Altra 80 core ARM machine * Open MPI 4.1.7a1 from HPC-X v2.18 * Rocky Linux 8.6 host, Rocky Linux 8.8 container * Podman 4.4.1 * This machine has a Mellanox Connect X-6 Lx NIC, however we're avoiding the Mellanox software stack by running in a container, and these are single node jobs only We tried passing "-bind-to none" to the running jobs, and while this seemed to reduce the number of failing jobs on average, it didn't eliminate the issue. We also encounter the following warning: [1712927028.412063] [podman-ci-rocky-8:3519 :0] sock.c:514 UCX WARN unable to read somaxconn value from /proc/sys/net/core/somaxconn file ...however as far as I can tell this is probably unrelated and occurs because the associated file isn't accessible inside the container, and after checking the UCX source I can see that SOMAXCONN is picked up from the system headers anyway. If anyone has hints about how to workaround this issue we'd greatly appreciate it! Thanks, Greg