A strange race condition happening for undisclosed reasons, and only fixable by 
replication is jeopardizing our reference count system. That sounds 
definitively almost scary (!) 

I think that the proposed solution is just a band-aid. It somehow fixes this 
particular instance of the issue but leave all the others unpatched, asking for 
troubles later on. This problem has been lingering around for years, but we 
failed to address it correctly up to now.

Based on my understanding of the code the problem is not with the ref count but 
with the way opal_buffer_t is handled. We have no way to retrieve the pointer 
where the data in the opal_buffer_t is stored without a destructive operation. 
This means every time we need to have the pointer of the opal_buffer_t (like in 
the send operation to build the iovecs), we have to do a load followed by an 
unload, leaving the opal_buffer_t uninitialized for a short amount of time. As 
a result it is completely unsafe to use the same opal_buffer_t concurrently for 
multiple operations, as some callbacks can find the buffer uninitialized when 
they fire.


Now regarding the patch itself, I have to congratulate the Open MPI community 
for its unbelievable response time. A solution proposed, then tested on the 
faulty platforms, then the code carefully reviewed and finally pushed in a 
stable branch all in a mere 43 minutes (!). It shows that all the protection 
mechanism we put in place around our stable branches are entirely functional 
and their role is completely fulfilled. I doubt any other open source project 
can claim such a feat. Congratulations!

commit in the trunk @ Timestamp: 08/28/12 13:17:34 (6 hours ago)
commit in the 1.7   @ Timestamp: 08/28/12 14:00:10 (5 hours ago)

  george.

On Aug 28, 2012, at 19:17 , svn-commit-mai...@open-mpi.org wrote:

> Author: rhc (Ralph Castain)
> Date: 2012-08-28 13:17:34 EDT (Tue, 28 Aug 2012)
> New Revision: 27161
> URL: https://svn.open-mpi.org/trac/ompi/changeset/27161
> 
> Log:
> Fix a strange race condition by creating a separate buffer for each send - 
> apparently, just a retain isn't enough protection on some systems


Reply via email to