Thanks for your answer.

On Mon, Nov 10, 2014 at 4:31 PM, Joshua Ladd <jladd.m...@gmail.com> wrote:
> Just really quick off the top of my head, mmaping relies on the virtual
> memory subsystem, whereas IB RDMA operations rely on physical memory being
> pinned (unswappable.)

Yes. Does that mean that the result of computations should be
undefined if I happen to give a user buffer which corresponds to a
file ? That would be surprising.

> For a large message transfer, the OpenIB BTL will
> register the user buffer, which will pin the pages and make them
> unswappable.

Yes. But what are the semantics of pinning the VM area pointed to by
ptr if ptr happens to be mmaped from a file ?

> If the data being transfered is small, you'll copy-in/out to
> internal bounce buffers and you shouldn't have issues.

Are you saying that the openib layer does have provision in this case
for letting the RDMA happen with a pinned physical memory range, and
later perform the copy to the file-backed mmaped range ? That would
make perfect sense indeed, although I don't have enough familiarity
with the OMPI code to see where it happens, and more importantly
whether the completion properly waits for this post-RDMA copy to
complete.


> 1.If you try to just bcast a few kilobytes of data using this technique, do
> you run into issues?

No. All "simpler" attempts were successful, unfortunately. Can you be
a little bit more precise about what scenario you imagine ? The
setting "all ranks mmap a local file, and rank 0 broadcasts there" is
successful.

> 2. How large is the data in the collective (input and output), is in_place
> used? I'm guess it's large enough that the BTL tries to work with the user
> buffer.

MPI_IN_PLACE is used in reduce_scatter and allgather in the code.
Collectives are with communicators of 2 nodes, and we're talking (for
the smallest failing run) 8kb per node (i.e. 16kb total for an
allgather).

E.

> On Mon, Nov 10, 2014 at 9:29 AM, Emmanuel Thomé <emmanuel.th...@gmail.com>
> wrote:
>>
>> Hi,
>>
>> I'm stumbling on a problem related to the openib btl in
>> openmpi-1.[78].*, and the (I think legitimate) use of file-backed
>> mmaped areas for receiving data through MPI collective calls.
>>
>> A test case is attached. I've tried to make is reasonably small,
>> although I recognize that it's not extra thin. The test case is a
>> trimmed down version of what I witness in the context of a rather
>> large program, so there is no claim of relevance of the test case
>> itself. It's here just to trigger the desired misbehaviour. The test
>> case contains some detailed information on what is done, and the
>> experiments I did.
>>
>> In a nutshell, the problem is as follows.
>>
>>  - I do a computation, which involves MPI_Reduce_scatter and
>> MPI_Allgather.
>>  - I save the result to a file (collective operation).
>>
>> *If* I save the file using something such as:
>>  fd = open("blah", ...
>>  area = mmap(..., fd, )
>>  MPI_Gather(..., area, ...)
>> *AND* the MPI_Reduce_scatter is done with an alternative
>> implementation (which I believe is correct)
>> *AND* communication is done through the openib btl,
>>
>> then the file which gets saved is inconsistent with what is obtained
>> with the normal MPI_Reduce_scatter (alghough memory areas do coincide
>> before the save).
>>
>> I tried to dig a bit in the openib internals, but all I've been able
>> to witness was beyond my expertise (an RDMA read not transferring the
>> expected data, but I'm too uncomfortable with this layer to say
>> anything I'm sure about).
>>
>> Tests have been done with several openmpi versions including 1.8.3, on
>> a debian wheezy (7.5) + OFED 2.3 cluster.
>>
>> It would be great if someone could tell me if he is able to reproduce
>> the bug, or tell me whether something which is done in this test case
>> is illegal in any respect. I'd be glad to provide further information
>> which could be of any help.
>>
>> Best regards,
>>
>> E. Thomé.
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2014/11/25730.php
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/11/25732.php

Reply via email to