Not exactly, the PML invokes the mpool which invokes the registration function.
If registration fails the mpool will deregister from its lru (if possible) and
try again. So, it is not an error if ibv_reg_mr fails unless it fails because
the process is starved of registered memory (or truely run out).
The hang occurs because there is nothing on the lru to deregister and
ibv_reg_mr (or GNI_MemRegister in the uGNI case) fails. The PML then puts the
request on its rdma pending list and continues. If any message comes in the
rdma pending list is progressed. If not it hangs indefinitely!
In general I have found the underlying cause of the hang is due to an imbalance
of registrations between processes on a node. i.e the hung process has an empty
lru but other processes could deregister. I am working on a new mpool (grdma)
to handle the imbalance. The new mpool will allow a process to request that one
of its peers deregisters from it lru if possible. I have a working proof of
concept implementation that uses a posix shmem segment and a progress function
to handle signaling and dereferencing. With it I no longer see hangs with IMB
Alltoall/Alltoallv on uGNI (without putting an artificial limit on the number
of registrations). I will test the mpool on infiniband later today.
-Nathan
On Fri, 9 Mar 2012, Jeffrey Squyres wrote:
George --
I believe that this is the subject of a few long-standing tickets (i.e., what
to do when running out of registered memory -- right now, we hang, for a few
reasons). I think that this is Mellanox's attempt to at least warn the user
that we have run out of registered memory, and will therefore hang.
Once the hangs have been fixed, I'm assuming this message can be removed.
Note, too, that this is in the BTL registration code (openib_reg_mr), not in
the directly-invoked-by-the-PML code. So it's the mpool's fault -- not the
PML's fault.
On Mar 6, 2012, at 10:05 AM, George Bosilca wrote:
I din't check thoroughly the code, but OMPI_ERR_OUT_OF_RESOURCES is not an
error. If the registration returns out of resources, the BTL will returns
OUT_OF_RESOURCE (as an example via the mca_btl_openib_prepare_src). At the
upper level, the PML (in the mca_pml_ob1_send_request_start function) intercept
it and insert the request into a pending list. Later on this pending list will
be examined and the request for resource re-issued.
Why do we need to trigger a BTL_ERROR for OUT_OF_RESOURCES?
george.
On Mar 6, 2012, at 09:48 , Jeffrey Squyres wrote:
Mike --
I would make this a bit better of an error. I.e., use orte_show_help(), so you
can explain the issue more, and also remove all duplicates (i.e., if it fails
to register multiple times).
On Mar 6, 2012, at 8:25 AM, mi...@osl.iu.edu wrote:
Author: miked
Date: 2012-03-06 09:25:56 EST (Tue, 06 Mar 2012)
New Revision: 26106
URL: https://svn.open-mpi.org/trac/ompi/changeset/26106
Log:
print error which is ignored on upper layer
Text files modified:
trunk/ompi/mca/btl/openib/btl_openib_component.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
Modified: trunk/ompi/mca/btl/openib/btl_openib_component.c
==============================================================================
--- trunk/ompi/mca/btl/openib/btl_openib_component.c (original)
+++ trunk/ompi/mca/btl/openib/btl_openib_component.c 2012-03-06 09:25:56 EST
(Tue, 06 Mar 2012)
@@ -569,6 +569,8 @@
openib_reg->mr = ibv_reg_mr(device->ib_pd, base, size, access_flag);
if (NULL == openib_reg->mr) {
+ BTL_ERROR(("%s: error pinning openib memory errno says %s",
+ __func__, strerror(errno)));
return OMPI_ERR_OUT_OF_RESOURCE;
}
_______________________________________________
svn-full mailing list
svn-f...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/svn-full
--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel