On Aug 30, 2012, at 8:10 AM, George Bosilca <bosi...@eecs.utk.edu> wrote:

> A strange race condition happening for undisclosed reasons, and only fixable 
> by replication is jeopardizing our reference count system. That sounds 
> definitively almost scary (!) 
> 
> I think that the proposed solution is just a band-aid. It somehow fixes this 
> particular instance of the issue but leave all the others unpatched, asking 
> for troubles later on. This problem has been lingering around for years, but 
> we failed to address it correctly up to now.
> 
> Based on my understanding of the code the problem is not with the ref count 
> but with the way opal_buffer_t is handled. We have no way to retrieve the 
> pointer where the data in the opal_buffer_t is stored without a destructive 
> operation. This means every time we need to have the pointer of the 
> opal_buffer_t (like in the send operation to build the iovecs), we have to do 
> a load followed by an unload, leaving the opal_buffer_t uninitialized for a 
> short amount of time. As a result it is completely unsafe to use the same 
> opal_buffer_t concurrently for multiple operations, as some callbacks can 
> find the buffer uninitialized when they fire.

That is correct - and yes, it is a bandaid. Fixing the opal_buffer_t situation 
is a much bigger issue that will require more time and effort than we had at 
the moment.

> 
> 
> Now regarding the patch itself, I have to congratulate the Open MPI community 
> for its unbelievable response time. A solution proposed, then tested on the 
> faulty platforms, then the code carefully reviewed and finally pushed in a 
> stable branch all in a mere 43 minutes (!). It shows that all the protection 
> mechanism we put in place around our stable branches are entirely functional 
> and their role is completely fulfilled. I doubt any other open source project 
> can claim such a feat. Congratulations!

As always, George - thanks for your positive, inspirational attitude. I'm sure 
we all truly appreciate your input.


> 
> commit in the trunk @ Timestamp: 08/28/12 13:17:34 (6 hours ago)
> commit in the 1.7   @ Timestamp: 08/28/12 14:00:10 (5 hours ago)
> 
>  george.
> 
> On Aug 28, 2012, at 19:17 , svn-commit-mai...@open-mpi.org wrote:
> 
>> Author: rhc (Ralph Castain)
>> Date: 2012-08-28 13:17:34 EDT (Tue, 28 Aug 2012)
>> New Revision: 27161
>> URL: https://svn.open-mpi.org/trac/ompi/changeset/27161
>> 
>> Log:
>> Fix a strange race condition by creating a separate buffer for each send - 
>> apparently, just a retain isn't enough protection on some systems
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to