First of all, the reason that I have created a CPU-friendly version of
MPI_Barrier is that my program is asymmetric (so some of the nodes can
easily have to wait for several hours) and that it is I/O bound. My program
uses MPI mainly to synchronize I/O and to share some counters between the
nodes, followed by a gather/scatter of the files. MPI_Barrier (or any of the
other MPI calls) caused the four CPU's of my Quad Core to continuously run
at 100% because of the aggressive polling, making the server almost unusable
and also slowing my program down because there was less CPU time available
for I/O and file synchronization. With this version of MPI_Barrier CPU usage
averages out at about 25%. I only recently learned about
the OMPI_MCA_mpi_yield_when_idle variable, I still have to test if that is
an alternative to my workaround.
Meanwhile I seem to have found the cause of problem thanks to Ashley's
excellent padb tool. Following Eugene's recommendation, I have added the
MPI_Wait call: the same problem. Next I created a separate program that just
calls my_barrier repeatedly with randomized 1-2 seconds intervals. Again the
same problem (with 4 nodes), sometimes after a couple of iterations,
sometimes after 500, 1000 or 2000 iterations. Next I followed Ashley's
suggestion to use padb. I ran padb --all --mpi-queue and padb --all
--message-queue while the program was running fine and after the problem
occured. When the problem occurred padb said:

Warning, remote process state differs across ranks
state : ranks
    R : [2-3]
    S : [0-1]

and

$ padb --all --stack-trace --tree
Warning, remote process state differs across ranks
state : ranks
    R : [2-3]
    S : [0-1]
-----------------
[0-1] (2 processes)
-----------------
main() at ?:?
  barrier_util() at ?:?
    my_sleep() at ?:?
      __nanosleep_nocancel() at ?:?
-----------------
[2-3] (2 processes)
-----------------
??() at ?:?
  ??() at ?:?
    ??() at ?:?
      ??() at ?:?
        ??() at ?:?
          ompi_mpi_signed_char() at ?:?
            ompi_request_default_wait_all() at ?:?
              opal_progress() at ?:?
                -----------------
                2 (1 processes)
                -----------------
                mca_pml_ob1_progress() at ?:?

suggests that rather than OpenMPI being the problem, nanosleep is the
culprit because the call to it seems to hang.

Thanks for all the help.

Gijsbert

On Mon, Dec 14, 2009 at 8:22 PM, Ashley Pittman <ash...@pittman.co.uk>wrote:

> On Sun, 2009-12-13 at 19:04 +0100, Gijsbert Wiesenekker wrote:
> > The following routine gives a problem after some (not reproducible)
> > time on Fedora Core 12. The routine is a CPU usage friendly version of
> > MPI_Barrier.
>
> There are some proposals for Non-blocking collectives before the MPI
> forum currently and I believe a working implementation which can be used
> as a plug-in for OpenMPI, I would urge you to look at these rather than
> try and implement your own.
>
> > My question is: is there a problem with this routine that I overlooked
> > that somehow did not show up until now
>
> Your code both does all-to-all communication and also uses probe, both
> of these can easily be avoided when implementing Barrier.
>
> > Is there a way to see which messages have been sent/received/are
> > pending?
>
> Yes, there is a message queue interface allowing tools to peek inside
> the MPI library and see these queues.  That I know of there are three
> tools which use this, either TotalView, DDT or my own tool, padb.
> TotalView and DDT are both full-featured graphical debuggers and
> commercial products, padb is a open-source text based tool.
>
> Ashley,
>
> --
>
> Ashley Pittman, Bath, UK.
>
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to