I was able to reproduce your issue and I think I understand the problem a
bit better at least. This demonstrates exactly what I was pointing to:

It looks like when the test switches over from eager RDMA (I'll explain in
a second), to doing a rendezvous protocol working entirely in user buffer
space things go bad.

If you're input is smaller than some threshold, the eager RDMA limit, then
the contents of your user buffer are copied into OMPI/OpenIB BTL scratch
buffers called "eager fragments". This pool of resources is preregistered,
pinned, and have had their rkeys exchanged. So, in the eager protocol, your
data is copied into these "locked and loaded" RDMA frags and the put/get is
handled internally. When the data is received, it's copied back out into
your buffer. In your setup, this always works.

$mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca
btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca
btl_openib_eager_limit 512 -mca btl openib,self ./ibtest -s 56
per-node buffer has size 448 bytes
node 0 iteration 0, lead word received from peer is 0x00000401 [ok]
node 0 iteration 1, lead word received from peer is 0x00000801 [ok]
node 0 iteration 2, lead word received from peer is 0x00000c01 [ok]
node 0 iteration 3, lead word received from peer is 0x00001001 [ok]

When you exceed the eager threshold, this always fails on the second
iteration. To understand this, you need to understand that there is a
protocol switch where now your user buffer is used for the transfer. Hence,
the user buffer is registered with the HCA. This operation is an inherently
high latency operation and is one of the primary motives for doing
copy-in/copy-out into preregistered buffers for small, latency sensitive
ops. For bandwidth bound transfers, the cost to register can be amortized
over the whole transfer, but it still affects the total bandwidth. In the
case of a rendezvous protocol where the user buffer is registered, there is
an optimization mostly used to help improve the numbers in a bandwidth
benchmark called a registration cache. With registration caching the user
buffer is registered once and the mkey put into a cache and the memory is
kept pinned until the system provides some notification via either memory
hooks in p2p malloc, or ummunotify that the buffer has been freed and this
signals that the mkey can be evicted from the cache.  On subsequent
send/recv operations from the same user buffer address, OpenIB BTL will
find the address in the registration cache and take the cached mkey and
avoid paying the cost of the memory registration the memory registration
and start the data transfer.

What I noticed is when the rendezvous protocol kicks in, it always fails on
the second iteration.

$mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca
btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca
btl_openib_eager_limit 128 -mca btl openib,self ./ibtest -s 56
per-node buffer has size 448 bytes
node 0 iteration 0, lead word received from peer is 0x00000401 [ok]
node 0 iteration 1, lead word received from peer is 0x00000000 [NOK]
--------------------------------------------------------------------------

So, I suspected it has something to do with the way the virtual address is
being handled in this case. To test that theory, I just completely disabled
the registration cache by setting -mca mpi_leave_pinned 0 and things start
to work:

$mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca
btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca
btl_openib_eager_limit 128 -mca mpi_leave_pinned 0 -mca btl openib,self
./ibtest -s 56
per-node buffer has size 448 bytes
node 0 iteration 0, lead word received from peer is 0x00000401 [ok]
node 0 iteration 1, lead word received from peer is 0x00000801 [ok]
node 0 iteration 2, lead word received from peer is 0x00000c01 [ok]
node 0 iteration 3, lead word received from peer is 0x00001001 [ok]

I don't know enough about memory hooks or the registration cache
implementation to speak with any authority, but it looks like this is where
the issue resides. As a workaround, can you try your original experiment
with -mca mpi_leave_pinned 0 and see if you get consistent results.


Josh





On Tue, Nov 11, 2014 at 7:07 AM, Emmanuel Thomé <emmanuel.th...@gmail.com>
wrote:

> Hi again,
>
> I've been able to simplify my test case significantly. It now runs
> with 2 nodes, and only a single MPI_Send / MPI_Recv pair is used.
>
> The pattern is as follows.
>
>  *  - ranks 0 and 1 both own a local buffer.
>  *  - each fills it with (deterministically known) data.
>  *  - rank 0 collects the data from rank 1's local buffer
>  *    (whose contents should be no mystery), and writes this to a
>  *    file-backed mmaped area.
>  *  - rank 0 compares what it receives with what it knows it *should
>  *  have* received.
>
> The test fails if:
>
>  *  - the openib btl is used among the 2 nodes
>  *  - a file-backed mmaped area is used for receiving the data.
>  *  - the write is done to a newly created file.
>  *  - per-node buffer is large enough.
>
> For a per-node buffer size above 12kb (12240 bytes to be exact), my
> program fails, since the MPI_Recv does not receive the correct data
> chunk (it just gets zeroes).
>
> I attach the simplified test case. I hope someone will be able to
> reproduce the problem.
>
> Best regards,
>
> E.
>
>
> On Mon, Nov 10, 2014 at 5:48 PM, Emmanuel Thomé
> <emmanuel.th...@gmail.com> wrote:
> > Thanks for your answer.
> >
> > On Mon, Nov 10, 2014 at 4:31 PM, Joshua Ladd <jladd.m...@gmail.com>
> wrote:
> >> Just really quick off the top of my head, mmaping relies on the virtual
> >> memory subsystem, whereas IB RDMA operations rely on physical memory
> being
> >> pinned (unswappable.)
> >
> > Yes. Does that mean that the result of computations should be
> > undefined if I happen to give a user buffer which corresponds to a
> > file ? That would be surprising.
> >
> >> For a large message transfer, the OpenIB BTL will
> >> register the user buffer, which will pin the pages and make them
> >> unswappable.
> >
> > Yes. But what are the semantics of pinning the VM area pointed to by
> > ptr if ptr happens to be mmaped from a file ?
> >
> >> If the data being transfered is small, you'll copy-in/out to
> >> internal bounce buffers and you shouldn't have issues.
> >
> > Are you saying that the openib layer does have provision in this case
> > for letting the RDMA happen with a pinned physical memory range, and
> > later perform the copy to the file-backed mmaped range ? That would
> > make perfect sense indeed, although I don't have enough familiarity
> > with the OMPI code to see where it happens, and more importantly
> > whether the completion properly waits for this post-RDMA copy to
> > complete.
> >
> >
> >> 1.If you try to just bcast a few kilobytes of data using this
> technique, do
> >> you run into issues?
> >
> > No. All "simpler" attempts were successful, unfortunately. Can you be
> > a little bit more precise about what scenario you imagine ? The
> > setting "all ranks mmap a local file, and rank 0 broadcasts there" is
> > successful.
> >
> >> 2. How large is the data in the collective (input and output), is
> in_place
> >> used? I'm guess it's large enough that the BTL tries to work with the
> user
> >> buffer.
> >
> > MPI_IN_PLACE is used in reduce_scatter and allgather in the code.
> > Collectives are with communicators of 2 nodes, and we're talking (for
> > the smallest failing run) 8kb per node (i.e. 16kb total for an
> > allgather).
> >
> > E.
> >
> >> On Mon, Nov 10, 2014 at 9:29 AM, Emmanuel Thomé <
> emmanuel.th...@gmail.com>
> >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> I'm stumbling on a problem related to the openib btl in
> >>> openmpi-1.[78].*, and the (I think legitimate) use of file-backed
> >>> mmaped areas for receiving data through MPI collective calls.
> >>>
> >>> A test case is attached. I've tried to make is reasonably small,
> >>> although I recognize that it's not extra thin. The test case is a
> >>> trimmed down version of what I witness in the context of a rather
> >>> large program, so there is no claim of relevance of the test case
> >>> itself. It's here just to trigger the desired misbehaviour. The test
> >>> case contains some detailed information on what is done, and the
> >>> experiments I did.
> >>>
> >>> In a nutshell, the problem is as follows.
> >>>
> >>>  - I do a computation, which involves MPI_Reduce_scatter and
> >>> MPI_Allgather.
> >>>  - I save the result to a file (collective operation).
> >>>
> >>> *If* I save the file using something such as:
> >>>  fd = open("blah", ...
> >>>  area = mmap(..., fd, )
> >>>  MPI_Gather(..., area, ...)
> >>> *AND* the MPI_Reduce_scatter is done with an alternative
> >>> implementation (which I believe is correct)
> >>> *AND* communication is done through the openib btl,
> >>>
> >>> then the file which gets saved is inconsistent with what is obtained
> >>> with the normal MPI_Reduce_scatter (alghough memory areas do coincide
> >>> before the save).
> >>>
> >>> I tried to dig a bit in the openib internals, but all I've been able
> >>> to witness was beyond my expertise (an RDMA read not transferring the
> >>> expected data, but I'm too uncomfortable with this layer to say
> >>> anything I'm sure about).
> >>>
> >>> Tests have been done with several openmpi versions including 1.8.3, on
> >>> a debian wheezy (7.5) + OFED 2.3 cluster.
> >>>
> >>> It would be great if someone could tell me if he is able to reproduce
> >>> the bug, or tell me whether something which is done in this test case
> >>> is illegal in any respect. I'd be glad to provide further information
> >>> which could be of any help.
> >>>
> >>> Best regards,
> >>>
> >>> E. Thomé.
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>> Link to this post:
> >>> http://www.open-mpi.org/community/lists/users/2014/11/25730.php
> >>
> >>
> >>
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> Link to this post:
> >> http://www.open-mpi.org/community/lists/users/2014/11/25732.php
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/11/25740.php
>

Reply via email to