I apologize in advance for the size of the example source and probably the
length of the email, but this has been a pain to track down.

Our application uses System V style shared memory pretty extensively and
have recently found that in certain circumstances, OpenMPI appears to
provide ranks with stale data.  The attached archive contains sample code
that demonstrates the issue.  There is a subroutine that uses a shared
memory array to broadcast from a single rank on one compute node to a
single rank on all other compute nodes.  The first call sends all 1s, then
all 2s, and so on.  The receiving rank(s) get all 1s on the first
execution, but on subsequent executions they receive some 2s and some 1s;
then some 3s, some 2s, and some 1s.  The code contains a version of this
routine in both C and Fortran but only the Fortran version appears to
exhibit the problem.

I've tried this with OpenMPI 3.1.5, 4.0.2, and 4.0.4 and on two different
systems with very different configurations and both show the problem.  On
one of the machines, it only appears to happen when MPI is initialized with
mpi4py, so I've included that in the test as well.  Other than that, the
behavior is very consistent across machines.  When run with the same number
of ranks and same size array, the two machines even show the invalid values
at the same indices.

Please let me know if you need any additional information.

Thanks,
Patrick

Attachment: shmemTest.tgz
Description: application/compressed-tar

Reply via email to