On Aug 30, 2012, at 8:10 AM, George Bosilca <bosi...@eecs.utk.edu> wrote:
> A strange race condition happening for undisclosed reasons, and only fixable > by replication is jeopardizing our reference count system. That sounds > definitively almost scary (!) > > I think that the proposed solution is just a band-aid. It somehow fixes this > particular instance of the issue but leave all the others unpatched, asking > for troubles later on. This problem has been lingering around for years, but > we failed to address it correctly up to now. > > Based on my understanding of the code the problem is not with the ref count > but with the way opal_buffer_t is handled. We have no way to retrieve the > pointer where the data in the opal_buffer_t is stored without a destructive > operation. This means every time we need to have the pointer of the > opal_buffer_t (like in the send operation to build the iovecs), we have to do > a load followed by an unload, leaving the opal_buffer_t uninitialized for a > short amount of time. As a result it is completely unsafe to use the same > opal_buffer_t concurrently for multiple operations, as some callbacks can > find the buffer uninitialized when they fire. That is correct - and yes, it is a bandaid. Fixing the opal_buffer_t situation is a much bigger issue that will require more time and effort than we had at the moment. > > > Now regarding the patch itself, I have to congratulate the Open MPI community > for its unbelievable response time. A solution proposed, then tested on the > faulty platforms, then the code carefully reviewed and finally pushed in a > stable branch all in a mere 43 minutes (!). It shows that all the protection > mechanism we put in place around our stable branches are entirely functional > and their role is completely fulfilled. I doubt any other open source project > can claim such a feat. Congratulations! As always, George - thanks for your positive, inspirational attitude. I'm sure we all truly appreciate your input. > > commit in the trunk @ Timestamp: 08/28/12 13:17:34 (6 hours ago) > commit in the 1.7 @ Timestamp: 08/28/12 14:00:10 (5 hours ago) > > george. > > On Aug 28, 2012, at 19:17 , svn-commit-mai...@open-mpi.org wrote: > >> Author: rhc (Ralph Castain) >> Date: 2012-08-28 13:17:34 EDT (Tue, 28 Aug 2012) >> New Revision: 27161 >> URL: https://svn.open-mpi.org/trac/ompi/changeset/27161 >> >> Log: >> Fix a strange race condition by creating a separate buffer for each send - >> apparently, just a retain isn't enough protection on some systems > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel