I was hanging problems with 1.2.5 hanging during collective operations
(MPI_Gather and MPI_Barrier):

2008/3/27 Matt Hughes <matt.c.hug...@gmail.com>:
>  A similar problem was reported in this message, and a 1.3 nightly was
>  reported to work:
>  http://www.open-mpi.org/community/lists/users/2008/01/4891.php
>
>  I tested the code in that message, and it hangs (actually, runs very
>  slowly after a few iterations) with 1.2.5, but works find with 1.3.

I was able to eliminate the hang I was seeing with 1.2.5 during the
gather operation by using these btl parameters (found at
http://svn.open-mpi.org/trac/ompi/browser/trunk/ompi/mca/btl/openib/btl-openib-benchmark):

 btl_openib_max_btls=20
 btl_openib_rd_num=128
 btl_openib_rd_low=75
 btl_openib_rd_win=50
 btl_openib_max_eager_rdma=32
 mpool_base_use_mem_hooks=1
 mpi_leave_pinned=1

Only the btl_openib_rd_low=75 and btl_openib_rd_num=128 parameters are
necessary to avoid the hang.

The information given for the parameters in ompi_info is not very
helpful.  Can anyone explain (or point me to a reference) what these
parameters do and how they affect collective operations?

Thanks,
mch


>
>  My own code starts worker processes with MPI::Comm::Spawn, and does a
>  series of Bcast's and Gather's from the parent process.  Large
>  messages are passed between the spawned processes using ISend / IRecv
>  / Wait, and that works fine.  The crash or hang is always observed in
>  the parent process during the Gather operation.
>
>  I suspect this may have something to do with eager rdma, so I ran some
>  tests with different values of btl_openib_use_eager_rdma.  On 1.2.5,
>  no difference was observed.  It always hung after about 20 Gathers.
>  On 1.3:
>
>   * Not set: parent process crashes with a null pointer dereference on
>  the 10th Gather operation.
>   * Set to 0: parent process crashes with a null pointer dereference on
>  the 33rd Gather operation.
>   * Set to 1: parent process hangs on the 7th Gather operation.
>
>  I built 1.3 in debug mode and attempted to narrow down where the crash
>  (segfault due to null pointer).
>
>  Before the crash, the stack trace looks like this:
>
>  #0  PMPI_Gather (sendbuf=0x7fbfffe494, sendcount=1, sendtype=0x2a958aab80,
>     recvbuf=0xda1a40, recvcount=1, recvtype=0x2a958aab80, root=0,
>     comm=0xd5bbd0) at pgather.c:138
>  #1  0x0000000000608ff4 in MPI::Comm::Gather (this=0xcdd890,
>     sendbuf=0x7fbfffe494, sendcount=1, sendtype=@0xa33950, recvbuf=0xda1a40,
>     recvcount=1, recvtype=@0xa33950, root=0)
>     at /home/matt/opt/openmpi/1.3/include/openmpi/ompi/mpi/cxx/comm_inln.h:325
>
>  Stepping into comm->c_coll.coll_gather at pgather.c:138 results in an
>  immediate crash, but comm->c_coll.coll_gather itself is not null (it
>  is the same as for successful Gathers).
>
>  Can anyone suggest where I can go from here?
>
>  Thanks,
>  Matt Hughes
>

Reply via email to