I was hanging problems with 1.2.5 hanging during collective operations (MPI_Gather and MPI_Barrier):
2008/3/27 Matt Hughes <matt.c.hug...@gmail.com>: > A similar problem was reported in this message, and a 1.3 nightly was > reported to work: > http://www.open-mpi.org/community/lists/users/2008/01/4891.php > > I tested the code in that message, and it hangs (actually, runs very > slowly after a few iterations) with 1.2.5, but works find with 1.3. I was able to eliminate the hang I was seeing with 1.2.5 during the gather operation by using these btl parameters (found at http://svn.open-mpi.org/trac/ompi/browser/trunk/ompi/mca/btl/openib/btl-openib-benchmark): btl_openib_max_btls=20 btl_openib_rd_num=128 btl_openib_rd_low=75 btl_openib_rd_win=50 btl_openib_max_eager_rdma=32 mpool_base_use_mem_hooks=1 mpi_leave_pinned=1 Only the btl_openib_rd_low=75 and btl_openib_rd_num=128 parameters are necessary to avoid the hang. The information given for the parameters in ompi_info is not very helpful. Can anyone explain (or point me to a reference) what these parameters do and how they affect collective operations? Thanks, mch > > My own code starts worker processes with MPI::Comm::Spawn, and does a > series of Bcast's and Gather's from the parent process. Large > messages are passed between the spawned processes using ISend / IRecv > / Wait, and that works fine. The crash or hang is always observed in > the parent process during the Gather operation. > > I suspect this may have something to do with eager rdma, so I ran some > tests with different values of btl_openib_use_eager_rdma. On 1.2.5, > no difference was observed. It always hung after about 20 Gathers. > On 1.3: > > * Not set: parent process crashes with a null pointer dereference on > the 10th Gather operation. > * Set to 0: parent process crashes with a null pointer dereference on > the 33rd Gather operation. > * Set to 1: parent process hangs on the 7th Gather operation. > > I built 1.3 in debug mode and attempted to narrow down where the crash > (segfault due to null pointer). > > Before the crash, the stack trace looks like this: > > #0 PMPI_Gather (sendbuf=0x7fbfffe494, sendcount=1, sendtype=0x2a958aab80, > recvbuf=0xda1a40, recvcount=1, recvtype=0x2a958aab80, root=0, > comm=0xd5bbd0) at pgather.c:138 > #1 0x0000000000608ff4 in MPI::Comm::Gather (this=0xcdd890, > sendbuf=0x7fbfffe494, sendcount=1, sendtype=@0xa33950, recvbuf=0xda1a40, > recvcount=1, recvtype=@0xa33950, root=0) > at /home/matt/opt/openmpi/1.3/include/openmpi/ompi/mpi/cxx/comm_inln.h:325 > > Stepping into comm->c_coll.coll_gather at pgather.c:138 results in an > immediate crash, but comm->c_coll.coll_gather itself is not null (it > is the same as for successful Gathers). > > Can anyone suggest where I can go from here? > > Thanks, > Matt Hughes >