Ralph, Thanks for the advice. I have to set 'coll_sync_barrier_before=5' to do the job. This is a big change from the default value (1000), so our application seems to be a pretty extreme case.
T. Rosmond On Mon, 2011-11-14 at 16:17 -0700, Ralph Castain wrote: > Yes, this is well documented - may be on the FAQ, but certainly has been in > the user list multiple times. > > The problem is that one process falls behind, which causes it to begin > accumulating "unexpected messages" in its queue. This causes the matching > logic to run a little slower, thus making the process fall further and > further behind. Eventually, things hang because everyone is sitting in bcast > waiting for the slow proc to catch up, but it's queue is saturated and it > can't. > > The solution is to do exactly what you describe - add some barriers to force > the slow process to catch up. This happened enough that we even added support > for it in OMPI itself so you don't have to modify your code. Look at the > following from "ompi_info --param coll sync" > > MCA coll: parameter "coll_base_verbose" (current value: <0>, > data source: default value) > Verbosity level for the coll framework (0 = no > verbosity) > MCA coll: parameter "coll_sync_priority" (current value: > <50>, data source: default value) > Priority of the sync coll component; only relevant > if barrier_before or barrier_after is > 0 > MCA coll: parameter "coll_sync_barrier_before" (current value: > <1000>, data source: default value) > Do a synchronization before each Nth collective > MCA coll: parameter "coll_sync_barrier_after" (current value: > <0>, data source: default value) > Do a synchronization after each Nth collective > > Take your pick - inserting a barrier before or after doesn't seem to make a > lot of difference, but most people use "before". Try different values until > you get something that works for you. > > > On Nov 14, 2011, at 3:10 PM, Tom Rosmond wrote: > > > Hello: > > > > A colleague and I have been running a large F90 application that does an > > enormous number of mpi_bcast calls during execution. I deny any > > responsibility for the design of the code and why it needs these calls, > > but it is what we have inherited and have to work with. > > > > Recently we ported the code to an 8 node, 6 processor/node NUMA system > > (lstopo output attached) running Debian linux 6.0.3 with Open_MPI 1.5.3, > > and began having trouble with mysterious 'hangs' in the program inside > > the mpi_bcast calls. The hangs were always in the same calls, but not > > necessarily at the same time during integration. We originally didn't > > have NUMA support, so reinstalled with libnuma support added, but the > > problem persisted. Finally, just as a wild guess, we inserted > > 'mpi_barrier' calls just before the 'mpi_bcast' calls, and the program > > now runs without problems. > > > > I believe conventional wisdom is that properly formulated MPI programs > > should run correctly without barriers, so do you have any thoughts on > > why we found it necessary to add them? The code has run correctly on > > other architectures, i.g. Crayxe6, so I don't think there is a bug > > anywhere. My only explanation is that some internal resource gets > > exhausted because of the large number of 'mpi_bcast' calls in rapid > > succession, and the barrier calls force synchronization which allows the > > resource to be restored. Does this make sense? I'd appreciate any > > comments and advice you can provide. > > > > > > I have attached compressed copies of config.log and ompi_info for the > > system. The program is built with ifort 12.0 and typically runs with > > > > mpirun -np 36 -bycore -bind-to-core program.exe > > > > We have run both interactively and with PBS, but that doesn't seem to > > make any difference in program behavior. > > > > T. Rosmond > > > > > > <lstopo_out.txt><config.log.bz2><ompi_info.bz2>_______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users