Hello,

We're running into issues with jobs failing in a non-deterministic way when 
running multiple jobs concurrently within a "make test" framework.

Make test is launched from within a shell script running inside a Podman 
container, and we're typically running with "-j 20" and "-np 4" (20 jobs 
concurrently with 4 procs each).  We've also tried reducing the number of jobs 
to no avail.  Each time the battery of test cases is run, about 2 to 4 
different jobs out of around 200 fail with the following errors:

[podman-ci-rocky-8.8:03528] MCW rank 1 is not bound (or bound to all available 
processors)
[podman-ci-rocky-8.8:03540] MCW rank 3 is not bound (or bound to all available 
processors)
[podman-ci-rocky-8.8:03519] MCW rank 0 is not bound (or bound to all available 
processors)
[podman-ci-rocky-8.8:03533] MCW rank 2 is not bound (or bound to all available 
processors)

Program received signal SIGILL: Illegal instruction.
Some info about our setup:

  *   Ampere Altra 80 core ARM machine
  *   Open MPI 4.1.7a1 from HPC-X v2.18
  *   Rocky Linux 8.6 host, Rocky Linux 8.8 container
  *   Podman 4.4.1
  *   This machine has a Mellanox Connect X-6 Lx NIC, however we're avoiding 
the Mellanox software stack by running in a container, and these are single 
node jobs only

We tried passing "-bind-to none" to the running jobs, and while this seemed to 
reduce the number of failing jobs on average, it didn't eliminate the issue.

We also encounter the following warning:

[1712927028.412063] [podman-ci-rocky-8:3519 :0]            sock.c:514  UCX  
WARN  unable to read somaxconn value from /proc/sys/net/core/somaxconn file

...however as far as I can tell this is probably unrelated and occurs because 
the associated file isn't accessible inside the container, and after checking 
the UCX source I can see that SOMAXCONN is picked up from the system headers 
anyway.

If anyone has hints about how to workaround this issue we'd greatly appreciate 
it!

Thanks,
Greg

Reply via email to