Ralph, If I may say this is exactly the type of problem the tool I have been working on recently aims to help with and I'd be happy to help you through it.
Firstly I'd say of the three collectives you mention, MPI_Allgather, MPI_Reduce and MPI_Bcast one exhibit a many-to-many, one a many-to-one and the last a many-to-one communication pattern. The scenario of a root process falling behind and getting swamped in comms is a plausible one for MPI_Reduce only but doesn't hold water with the other two. You also don't mention if the loop is over a single collective or if you have loop calling a number of different collectives each iteration. padb, the tool I've been working on has the ability to look at parallel jobs and report on the state of collective comms and should help narrow you down on erroneous processes and those simply blocked waiting for comms. I'd recommend using it to look at maybe four or five instances where the application has hung and look for any common features between them. Let me know if you are willing to try this route and I'll talk, the code is downloadable from http://padb.pittman.org.uk and if you want the full collective functionality you'll need to patch openmp with the patch from http://padb.pittman.org.uk/extensions.html Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk