I have a cluster using Mellanox  (dual port) ConnectX hardware, and
 I'm having some problems with MPI_Gather operations.  The vendor id is
 0x2c9, and the part id is 26418.  I had to add the vendor part id to
 mca-btl-openib-hca-params.ini, but the problems are the same for both
 1.2.5 and 1.3, whether the part ID is in the ini file or not.

 The details of my hardware, the OpenMPI 1.3 configuration, and the
 runtime environment are included in the attached tar.gz file.

 A similar problem was reported in this message, and a 1.3 nightly was
 reported to work:
 http://www.open-mpi.org/community/lists/users/2008/01/4891.php

 I tested the code in that message, and it hangs (actually, runs very
 slowly after a few iterations) with 1.2.5, but works find with 1.3.

 My own code starts worker processes with MPI::Comm::Spawn, and does a
 series of Bcast's and Gather's from the parent process.  Large
 messages are passed between the spawned processes using ISend / IRecv
 / Wait, and that works fine.  The crash or hang is always observed in
 the parent process during the Gather operation.

 I suspect this may have something to do with eager rdma, so I ran some
 tests with different values of btl_openib_use_eager_rdma.  On 1.2.5,
 no difference was observed.  It always hung after about 20 Gathers.
 On 1.3:

  * Not set: parent process crashes with a null pointer dereference on
 the 10th Gather operation.
  * Set to 0: parent process crashes with a null pointer dereference on
 the 33rd Gather operation.
  * Set to 1: parent process hangs on the 7th Gather operation.

 I built 1.3 in debug mode and attempted to narrow down where the crash
 (segfault due to null pointer).

 Before the crash, the stack trace looks like this:

 #0  PMPI_Gather (sendbuf=0x7fbfffe494, sendcount=1, sendtype=0x2a958aab80,
   recvbuf=0xda1a40, recvcount=1, recvtype=0x2a958aab80, root=0,
   comm=0xd5bbd0) at pgather.c:138
 #1  0x0000000000608ff4 in MPI::Comm::Gather (this=0xcdd890,
   sendbuf=0x7fbfffe494, sendcount=1, sendtype=@0xa33950, recvbuf=0xda1a40,
   recvcount=1, recvtype=@0xa33950, root=0)
   at /home/matt/opt/openmpi/1.3/include/openmpi/ompi/mpi/cxx/comm_inln.h:325

 Stepping into comm->c_coll.coll_gather at pgather.c:138 results in an
 immediate crash, but comm->c_coll.coll_gather itself is not null (it
 is the same as for successful Gathers).

 Can anyone suggest where I can go from here?

 Thanks,
 Matt Hughes

Attachment: ompi-info.tar.gz
Description: GNU Zip compressed data

Reply via email to