Since neither bcast nor reduce acts as a barrier it is possible to run out of resources if either of these calls (or both) are used in a tight loop. The sync coll component exists for this scenario. You can enable it by adding the following to mpirun (or setting these variables through the environment or a file):
—mca coll_sync_priority 100 —mca coll_sync_barrier_after 10 This will effectively throttle the collective calls for you. You can also change the reduce to an allreduce. -Nathan > On Jan 18, 2019, at 6:31 PM, Jeff Wentworth via users > <users@lists.open-mpi.org> wrote: > > Greetings everyone, > > I have a scientific code using Open MPI (v3.1.3) that seems to work fine when > MPI_Bcast() and MPI_Reduce() calls are well spaced out in time. Yet if the > time between these calls is short, eventually one of the nodes hangs at some > random point, never returning from the broadcast or reduce call. Is there > some minimum time between calls that needs to be obeyed in order for Open MPI > to process these reliably? > > The reason this has come up is because I am trying to run in a multi-node > environment some established acceptance tests in order to verify that the > Open MPI configured version of the code yields the same baseline result as > the original single node version of the code. These acceptance tests must > pass in order for the code to be considered validated and deliverable to the > customer. One of these acceptance tests that hangs does involve 90 > broadcasts and 90 reduces in a short period of time (less than .01 cpu sec), > as in: > > Broadcast #89 in > Broadcast #89 out 8 bytes > Calculate angle #89 > Reduce #89 in > Reduce #89 out 208 bytes > Write result #89 to file on service node > Broadcast #90 in > Broadcast #90 out 8 bytes > Calculate angle #89 > Reduce #90 in > Reduce #90 out 208 bytes > Write result #90 to file on service node > > If I slow down the above acceptance test, for example by running it under > valgrind, then it runs to completion and yields the correct result. So it > seems to suggest that something internal to Open MPI is getting swamped. I > understand that these acceptance tests might be pushing the limit, given that > they involve so many short calculations combined with frequent, yet tiny, > transfers of data among nodes. > > Would it be worthwhile for me to enforce with some minimum wait time between > the MPI calls, say 0.01 or 0.001 sec via nanosleep()? The only time it would > matter would be when acceptance tests are run, as the situation doesn't arise > when beefier runs are performed. > > Thanks. > > jw2002 > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users