On Mar 9, 2012, at 1:14 PM, George Bosilca wrote: >> The hang occurs because there is nothing on the lru to deregister and >> ibv_reg_mr (or GNI_MemRegister in the uGNI case) fails. The PML then puts >> the request on its rdma pending list and continues. If any message comes in >> the rdma pending list is progressed. If not it hangs indefinitely! > > Unlike Jeff, I'm not in favor of adding bandages. If the cause is understood, > then there _is_ a fix, and that fix should be the target of any efforts.
The fix that Nathan proposes is not a complete fix -- we can still run out of memory and hang. You should read the open tickets and prior emails we have sent about this -- Nathan's fix merely delays when we will run out of registered memory. It does not solve the underlying problem. >> In general I have found the underlying cause of the hang is due to an >> imbalance of registrations between processes on a node. i.e the hung process >> has an empty lru but other processes could deregister. I am working on a new >> mpool (grdma) to handle the imbalance. The new mpool will allow a process to >> request that one of its peers deregisters from it lru if possible. I have a >> working proof of concept implementation that uses a posix shmem segment and >> a progress function to handle signaling and dereferencing. With it I no >> longer see hangs with IMB Alltoall/Alltoallv on uGNI (without putting an >> artificial limit on the number of registrations). I will test the mpool on >> infiniband later today. > > If a solution already exists I don't see why we have to have the message > code. Based on its urgency, I'm confident your patch will make its way into > the 1.5 quite easily. Depending on the timing, this might go to 1.6 (1.5.5 has waited for too long, and this is not a regression). Keep in mind that the problem has been around for *a long, long time*, which is why I approved the diag message (i.e., because a real solution is still nowhere in sight). The real issue is that we can still run out of registered memory *and there is nothing left to deregister*. The real solution there is that the PML should fall back to a different protocol, but I'm told that doesn't happen and will require a bunch of work to make work properly. -- Jeff Squyres [email protected] For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
