First of all, the reason that I have created a CPU-friendly version of MPI_Barrier is that my program is asymmetric (so some of the nodes can easily have to wait for several hours) and that it is I/O bound. My program uses MPI mainly to synchronize I/O and to share some counters between the nodes, followed by a gather/scatter of the files. MPI_Barrier (or any of the other MPI calls) caused the four CPU's of my Quad Core to continuously run at 100% because of the aggressive polling, making the server almost unusable and also slowing my program down because there was less CPU time available for I/O and file synchronization. With this version of MPI_Barrier CPU usage averages out at about 25%. I only recently learned about the OMPI_MCA_mpi_yield_when_idle variable, I still have to test if that is an alternative to my workaround. Meanwhile I seem to have found the cause of problem thanks to Ashley's excellent padb tool. Following Eugene's recommendation, I have added the MPI_Wait call: the same problem. Next I created a separate program that just calls my_barrier repeatedly with randomized 1-2 seconds intervals. Again the same problem (with 4 nodes), sometimes after a couple of iterations, sometimes after 500, 1000 or 2000 iterations. Next I followed Ashley's suggestion to use padb. I ran padb --all --mpi-queue and padb --all --message-queue while the program was running fine and after the problem occured. When the problem occurred padb said:
Warning, remote process state differs across ranks state : ranks R : [2-3] S : [0-1] and $ padb --all --stack-trace --tree Warning, remote process state differs across ranks state : ranks R : [2-3] S : [0-1] ----------------- [0-1] (2 processes) ----------------- main() at ?:? barrier_util() at ?:? my_sleep() at ?:? __nanosleep_nocancel() at ?:? ----------------- [2-3] (2 processes) ----------------- ??() at ?:? ??() at ?:? ??() at ?:? ??() at ?:? ??() at ?:? ompi_mpi_signed_char() at ?:? ompi_request_default_wait_all() at ?:? opal_progress() at ?:? ----------------- 2 (1 processes) ----------------- mca_pml_ob1_progress() at ?:? suggests that rather than OpenMPI being the problem, nanosleep is the culprit because the call to it seems to hang. Thanks for all the help. Gijsbert On Mon, Dec 14, 2009 at 8:22 PM, Ashley Pittman <ash...@pittman.co.uk>wrote: > On Sun, 2009-12-13 at 19:04 +0100, Gijsbert Wiesenekker wrote: > > The following routine gives a problem after some (not reproducible) > > time on Fedora Core 12. The routine is a CPU usage friendly version of > > MPI_Barrier. > > There are some proposals for Non-blocking collectives before the MPI > forum currently and I believe a working implementation which can be used > as a plug-in for OpenMPI, I would urge you to look at these rather than > try and implement your own. > > > My question is: is there a problem with this routine that I overlooked > > that somehow did not show up until now > > Your code both does all-to-all communication and also uses probe, both > of these can easily be avoided when implementing Barrier. > > > Is there a way to see which messages have been sent/received/are > > pending? > > Yes, there is a message queue interface allowing tools to peek inside > the MPI library and see these queues. That I know of there are three > tools which use this, either TotalView, DDT or my own tool, padb. > TotalView and DDT are both full-featured graphical debuggers and > commercial products, padb is a open-source text based tool. > > Ashley, > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >