I will have to correct something here. From what I can see, it appears the MPI code may not be creating ompi_proc_t structures, but rather creating arrays of ompi_proc_t* pointers that are then filled with values equal to the pointers in the ompi_proc_list held inside of ompi/proc/proc.c.
It appears, though, that this may be done in a non-thread-safe manner. When the arrays are filled by calling ompi_proc_world or ompi_proc_all, the objects themselves are never OBJ_RETAIN'd. As a result, when the first thread in the code calls OBJ_RELEASE, the object is removed from the ompi_proc_list and free'd - but the other threads that called ompi_proc_world/all still retain pointers that now reference invalid memory. Perhaps the best path forward is to devise a thread-safe design for this code area and present it for people's review. I'll see what I can do. Again, comments are welcomed Ralph On 7/7/08 8:22 AM, "Ralph H Castain" <r...@lanl.gov> wrote: > I am seeking input before making a change to the ompi/proc/proc.c code to > resolve the referenced ticket. The change could potentially impact how the > ompi_proc_t struct is used in the rest of the MPI code base. If this doesn't > impact you, please ignore the remainder of this note. > > > I was asked last week to take a look at ticket #1267, filed by Tim Prins > several months ago. This ticket references a report on the devel list about > thread locks when calling comm_spawn and using MPI_Init_thread. The thread > lock is caused by the constructor/destructor in the ompi_proc_t class > explicitly removing the referenced ompi_proc_t from the static local global > ompi_proc_list, and calling OPAL_THREAD_LOCK/OPAL_THREAD_UNLOCK around that > list operation. > > As far as I can see, Tim correctly resolved the constructor conflict by > simply removing the thread lock/unlock and list append operation from the > constructor. A scan of the code shows that OBJ_NEW is -only- called from > within the ompi/proc/proc.c code, so this won't be an issue. > > However, I noted several issues surrounding the creation and release of > ompi_proc_t objects that -may- cause problems in making a similar change to > the destructor to fix the rest of the threading problem. These may have been > created in response to the list modification code currently existing in the > ompi_proc_t object destructor - or they may be due to other factors. > > Specifically, the MPI code base outside of ompi/proc/proc.c: > > 1. -never- treats the ompi_proc_t as an opal object. Instead, the code > specifically calls calloc to create space for the structures, and then > manually constructs them. > > 2. consistently calls OBJ_RELEASE on the resulting structures, even though > they were never created as opal objects via OBJ_NEW. > > I confess to being puzzled here as the destructor will attempt to remove the > referenced ompi_proc_t* pointer from the ompi_proc_list in ompi/proc/proc.c, > but (since OBJ_NEW wasn't called) that object won't be on the list. Looking > at the code itself, it appears we rely on the list function to realize that > this object isn't on the list and ignore the request to remove it. We don't > check the return code from the opal_list_remove_item, and so ignore the > returned error. > > My point here is to seek comment about the proposed fix for the problem > referenced in the ticket. My proposal is to remove the thread lock/unlock > and list manipulation from the ompi_proc_t destructor. From what I can see > (as described above), this should not impact the rest of the code base. I > will then add thread lock/unlock protection explicitly to ompi_proc_finalize > to protect its list operations. > > It appears to me that this change would also open the way to allowing the > remainder of the code base to treat ompi_proc_t as an object, using OBJ_NEW > to correctly and consistently initialize those objects. I note that any > change to ompi_proc_t today (which has occurred in the not-too-distant > past!) can create a problem throughout the current code base due to the > manual construction of this object. This is why we have objects in the first > place - I suspect people didn't use OBJ_NEW because of the automatic change > it induced in the ompi_proc_list in ompi/proc/proc.c. > > Any comments would be welcome. > Ralph > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel