>> Depending on the timing, this might go to 1.6 (1.5.5 has waited for too 
>> long, and this is not a regression).  Keep in mind that the problem has been 
>> around for *a long, long time*, which is why I approved the diag message 
>> (i.e., because a real solution is still nowhere in sight).  The real issue 
>> is that we can still run out of registered memory *and there is nothing left 
>> to deregister*.  The real solution there is that the PML should fall back to 
>> a different protocol, but I'm told that doesn't happen and will require a 
>> bunch of work to make work properly.
> 
> An mpool that is aware of local processes lru's will solve the problem in 
> most cases (all that I have seen) but yes, we need to rework the pml to 
> handle the remaining cases. There are two things that need to be changed 
> (from what I can tell):
> 
>  1) allow rget to fallback to send/put depending on the failure (I have 
> fallback on put implemented in my branch-- and in my btl).
>  2) need to devise new criteria on when we should progress the rdma_pending 
> list to avoid deadlock.
> 
> #1  is fairly simple and I haven't given much though to #2.


But #1 will be good start in right direction.Agree about #2.

> 
> -Nathan
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to