On Mar 9, 2012, at 1:14 PM, George Bosilca wrote:

>> The hang occurs because there is nothing on the lru to deregister and 
>> ibv_reg_mr (or GNI_MemRegister in the uGNI case) fails. The PML then puts 
>> the request on its rdma pending list and continues. If any message comes in 
>> the rdma pending list is progressed. If not it hangs indefinitely!
> 
> Unlike Jeff, I'm not in favor of adding bandages. If the cause is understood, 
> then there _is_ a fix, and that fix should be the target of any efforts.

The fix that Nathan proposes is not a complete fix -- we can still run out of 
memory and hang.  You should read the open tickets and prior emails we have 
sent about this -- Nathan's fix merely delays when we will run out of 
registered memory.  It does not solve the underlying problem.

>> In general I have found the underlying cause of the hang is due to an 
>> imbalance of registrations between processes on a node. i.e the hung process 
>> has an empty lru but other processes could deregister. I am working on a new 
>> mpool (grdma) to handle the imbalance. The new mpool will allow a process to 
>> request that one of its peers deregisters from it lru if possible. I have a 
>> working proof of concept implementation that uses a posix shmem segment and 
>> a progress function to handle signaling and dereferencing. With it I no 
>> longer see hangs with IMB Alltoall/Alltoallv on uGNI (without putting an 
>> artificial limit on the number of registrations). I will test the mpool on 
>> infiniband later today.
> 
> If a solution already exists I don't see why we have to have the message 
> code. Based on its urgency, I'm confident your patch will make its way into 
> the 1.5 quite easily.


Depending on the timing, this might go to 1.6 (1.5.5 has waited for too long, 
and this is not a regression).  Keep in mind that the problem has been around 
for *a long, long time*, which is why I approved the diag message (i.e., 
because a real solution is still nowhere in sight).  The real issue is that we 
can still run out of registered memory *and there is nothing left to 
deregister*.  The real solution there is that the PML should fall back to a 
different protocol, but I'm told that doesn't happen and will require a bunch 
of work to make work properly.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to