Thanks for your answer. On Mon, Nov 10, 2014 at 4:31 PM, Joshua Ladd <jladd.m...@gmail.com> wrote: > Just really quick off the top of my head, mmaping relies on the virtual > memory subsystem, whereas IB RDMA operations rely on physical memory being > pinned (unswappable.)
Yes. Does that mean that the result of computations should be undefined if I happen to give a user buffer which corresponds to a file ? That would be surprising. > For a large message transfer, the OpenIB BTL will > register the user buffer, which will pin the pages and make them > unswappable. Yes. But what are the semantics of pinning the VM area pointed to by ptr if ptr happens to be mmaped from a file ? > If the data being transfered is small, you'll copy-in/out to > internal bounce buffers and you shouldn't have issues. Are you saying that the openib layer does have provision in this case for letting the RDMA happen with a pinned physical memory range, and later perform the copy to the file-backed mmaped range ? That would make perfect sense indeed, although I don't have enough familiarity with the OMPI code to see where it happens, and more importantly whether the completion properly waits for this post-RDMA copy to complete. > 1.If you try to just bcast a few kilobytes of data using this technique, do > you run into issues? No. All "simpler" attempts were successful, unfortunately. Can you be a little bit more precise about what scenario you imagine ? The setting "all ranks mmap a local file, and rank 0 broadcasts there" is successful. > 2. How large is the data in the collective (input and output), is in_place > used? I'm guess it's large enough that the BTL tries to work with the user > buffer. MPI_IN_PLACE is used in reduce_scatter and allgather in the code. Collectives are with communicators of 2 nodes, and we're talking (for the smallest failing run) 8kb per node (i.e. 16kb total for an allgather). E. > On Mon, Nov 10, 2014 at 9:29 AM, Emmanuel Thomé <emmanuel.th...@gmail.com> > wrote: >> >> Hi, >> >> I'm stumbling on a problem related to the openib btl in >> openmpi-1.[78].*, and the (I think legitimate) use of file-backed >> mmaped areas for receiving data through MPI collective calls. >> >> A test case is attached. I've tried to make is reasonably small, >> although I recognize that it's not extra thin. The test case is a >> trimmed down version of what I witness in the context of a rather >> large program, so there is no claim of relevance of the test case >> itself. It's here just to trigger the desired misbehaviour. The test >> case contains some detailed information on what is done, and the >> experiments I did. >> >> In a nutshell, the problem is as follows. >> >> - I do a computation, which involves MPI_Reduce_scatter and >> MPI_Allgather. >> - I save the result to a file (collective operation). >> >> *If* I save the file using something such as: >> fd = open("blah", ... >> area = mmap(..., fd, ) >> MPI_Gather(..., area, ...) >> *AND* the MPI_Reduce_scatter is done with an alternative >> implementation (which I believe is correct) >> *AND* communication is done through the openib btl, >> >> then the file which gets saved is inconsistent with what is obtained >> with the normal MPI_Reduce_scatter (alghough memory areas do coincide >> before the save). >> >> I tried to dig a bit in the openib internals, but all I've been able >> to witness was beyond my expertise (an RDMA read not transferring the >> expected data, but I'm too uncomfortable with this layer to say >> anything I'm sure about). >> >> Tests have been done with several openmpi versions including 1.8.3, on >> a debian wheezy (7.5) + OFED 2.3 cluster. >> >> It would be great if someone could tell me if he is able to reproduce >> the bug, or tell me whether something which is done in this test case >> is illegal in any respect. I'd be glad to provide further information >> which could be of any help. >> >> Best regards, >> >> E. Thomé. >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/11/25730.php > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25732.php