Hello, I am trying to play with nvidia's gpudirect. The test program given with the gpudirect tarball just does a basic MPI ping-pong between two process that allocated their buffers with cudaHostMalloc instead of malloc. It seems to work with Intel MPI but Open MPI 1.5 hangs in the first MPI_Send. Replacing the cuda buffer with a normally-malloc'ed buffer makes the program work again. I assume that something goes wrong when OMPI tries to register/pin the cuda buffer in the IB stack (that's what gpudirect seems to be about), but I don't see why Intel MPI would succeed there.
Has anybody ever looked at this? FWIW, we're using OMPI 1.5, OFED 1.5.2, Intel MPI 4.0.0.28 and SLES11 w/ and w/o the gpudirect patch. Thanks Brice Goglin