On Jan 13, 2011, at 14:08 , Jeff Squyres wrote:

> Great!
> 
> I see in your other mail that you pulled something from MPICH2 to make this 
> work.
> 
> Does that mean that there's a even-newer version of ROMIO that we should pull 
> in its entirety?  It's a little risky to pull most stuff from one released 
> version of ROMIO and then more stuff from another released version.  Meaning: 
> it's little nicer/safer to say that we have ROMIO from a single released 
> version of MPICH2.

My understanding is that the MPICH guys provided a patch for the MPI attribute 
issue. As such the version here is the most up to date.

  george.

> 
> If possible.  :-)
> 
> Is it possible?
> 
> Don't get me wrong -- I want the new ROMIO, and I'm sorry you've had to go 
> through so many hoops to get it ready.  :-(  But we should do it the best way 
> we can; we have history/precedent for taking ROMIO from a single 
> source/released version of MPICH[2], and I'd like to maintain that precedent 
> if at all possible.
> 
> 
> On Jan 13, 2011, at 8:04 AM, Pascal Deveze wrote:
> 
>> This problem of assertion is now solved by a patch in ROMIO just commited in 
>> http://bitbucket.org/devezep/new-romio-for-openmpi
>> 
>> I don't know any other problem in this porting of ROMIO.
>> 
>> Pascal
>> 
>> Pascal Deveze a écrit :
>>> Jeff Squyres a écrit :
>>>> On Dec 16, 2010, at 3:31 AM, Pascal Deveze wrote:
>>>> 
>>>> 
>>>> 
>>>>> int main(int argc, char **argv) {
>>>>>  MPI_File fh;
>>>>>  MPI_Info info, info_used;
>>>>> 
>>>>>  MPI_Init(&argc,&argv);
>>>>> 
>>>>>  MPI_File_open(MPI_COMM_WORLD, "/tmp/A", MPI_MODE_CREATE | MPI_MODE_RDWR, 
>>>>> MPI_INFO_NULL, &fh);
>>>>>  MPI_File_close(&fh);
>>>>> 
>>>>>  MPI_File_open(MPI_COMM_WORLD, "/tmp/A", MPI_MODE_CREATE | MPI_MODE_RDWR, 
>>>>> MPI_INFO_NULL, &fh);
>>>>>  MPI_File_close(&fh);
>>>>> 
>>>>>  MPI_Finalize();
>>>>> }
>>>>> 
>>>>> I run this programon one process : salloc -p debug  -n1 mpirun -np 1 
>>>>> ./a.out
>>>>> And I get teh assertion error:
>>>>> 
>>>>> a.out: attribute/attribute.c:763: ompi_attr_delete: Assertion 
>>>>> `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) 
>>>>> (keyval))->obj_magic_id' failed.
>>>>> [cuzco10:24785] *** Process received signal ***
>>>>> [cuzco10:24785] Signal: Aborted (6)
>>>>> 
>>>>> 
>>>> 
>>>> Ok.
>>>> 
>>>> 
>>>> 
>>>>> I saw that there is a problem with an MPI_COMM_SELF communicator.
>>>>> 
>>>>> The problem disappears (and all ROMIO tests are OK) when I comment line 
>>>>> 89 in the file ompi/mca/io/romio/romio/adio/common/ad_close.c :
>>>>>     // MPI_Comm_free(&(fd->comm));
>>>>> 
>>>>> The problem disappears (and all ROMIO tests are OK) when I comment line 
>>>>> 425 in the file ompi/mca/io/romio/romio/adio/common/cb_config_list.c :
>>>>>   //  MPI_Keyval_free(&keyval);
>>>>> 
>>>>> The problem also disappears (but only 50% of the ROMIO tests are OK) when 
>>>>> I comment line 133 in the file ompi/runtime/ompi_mpi_finalize.c:
>>>>>      // ompi_attr_delete_all(COMM_ATTR, &ompi_mpi_comm_self,
>>>>>     //                             ompi_mpi_comm_self.comm.c_keyhash);
>>>>> 
>>>>> 
>>>> 
>>>> It sounds like there's a problem with the ordering of shutdown of things 
>>>> in MPI_FINALIZE w.r.t. ROMIO.
>>>> 
>>>> FWIW: ROMIO violates some of our abstractions, but it's the price we pay 
>>>> for using a 3rd party package.  One very, very important abstraction that 
>>>> we have is that no top-level MPI API functions are not allowed to call any 
>>>> other MPI API functions.  E.g., MPI_Send (i.e., ompi/mpi/c/send.c) cannot 
>>>> call MPI_Isend (i.e., ompi/mpi/c/isend.c).  MPI_Send *can* call the same 
>>>> back-end implementation functions that isend does -- it's just not allowed 
>>>> to call MPI_<foo>.
>>>> 
>>>> The reason is that the top-level MPI API functions do things like check 
>>>> for whether MPI_INIT / MPI_FINALIZE have been called, etc.  The back-end 
>>>> functions do not do this.  Additionally, top-level MPI API functions may 
>>>> be overridden via PMPI kinds of things.  We wouldn't want our internal 
>>>> library calls to get intercepted by user code.
>>>> 
>>>> 
>>>> 
>>>>> I am not very familiar with the OBJ_RELEASE/OBJ_RETAIN mechanism and till 
>>>>> now I do not understand what is the real origin of that problem.
>>>>> 
>>>>> 
>>>> 
>>>> RETAIN/RELEASE is part of OMPI's "poor man's C++" design.  Waaaay back in 
>>>> the beginning of the project, we debated whether to use C or C++ for 
>>>> developing the code.  There was a desire to use some of the basic object 
>>>> functionality of C++ (e.g., derived classes, constructors, destructors, 
>>>> etc.), but we wanted to stay as portable as possible.  So we ended up 
>>>> going with C, but with a few macros that emulate some C++-like 
>>>> functionality.  This led to OMPI's OBJ system that is used all over the 
>>>> place.  
>>>> 
>>>> The OBJ system does several things:
>>>> 
>>>> - allows you to have "constructor"- and "destructor"-like behavior for 
>>>> structs
>>>> - works for both stack and heap memory
>>>> - reference counting
>>>> 
>>>> The reference counting is perhaps the most-used function of OBJ.  Here's a 
>>>> sample scenario:
>>>> 
>>>> /* allocate some memory, call the some_object_type "constructor",
>>>>   and set the reference count of "foo" to 1 */
>>>> foo = OBJ_NEW(some_object_type);
>>>> 
>>>> /* increment the reference count of foo (to 2) */
>>>> OBJ_RETAIN(foo);
>>>> 
>>>> /* increment the reference count of foo (to 3) */
>>>> OBJ_RETAIN(foo);
>>>> 
>>>> /* decrement the reference count of foo (to 1) */
>>>> OBJ_RELEASE(foo);
>>>> OBJ_RELEASE(foo);
>>>> 
>>>> /* decrement the reference count of foo to 0 -- which will
>>>>   call foo's "destructor" and then free the memory */
>>>> OBJ_RELEASE(foo);
>>>> 
>>>> The same principle works for structs on the stack -- we do the same 
>>>> constructor / destructor behavior, but just don't free the memory.  For 
>>>> example:
>>>> 
>>>> /* Instantiate the memory and call its "constructor" and set the
>>>>   ref count to 1 */
>>>> some_object_type foo;
>>>> OBJ_CONSTRUCT(&foo, some_object_type);
>>>> 
>>>> /* Increment and decrement the ref count */
>>>> OBJ_RETAIN(&foo);
>>>> OBJ_RETAIN(&foo);
>>>> OBJ_RELEASE(&foo);
>>>> OBJ_RELEASE(&foo);
>>>> 
>>>> /* The last RELEASE will call the destructor, but won't actually
>>>>   free the memory, because the memory was not allocated with 
>>>>   OBJ_NEW */
>>>> OBJ_RELEASE(&foo);
>>>> 
>>>> When the destructor is called, the OBJ system sets the magic number in the 
>>>> obj's memory to a sentinel value so that we know that the destructor has 
>>>> been called on this particular struct.  Hence, if we call OBJ_RELEASE 
>>>> *again* on a struct that has already had its ref count go to 0 (and 
>>>> therefore already had its destructor called), we get the assert error that 
>>>> you're seeing.
>>>> 
>>>> So to be totally clear: the assert error you're seeing is because some OBJ 
>>>> is (effectively) getting its ref count decremented below zero.  Which 
>>>> means it's trying to get destroyed twice.  Which means the ordering 
>>>> sequence of stuff in the ROMIO shutdown / MPI_FINALIZE is likely not right.
>>>> 
>>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> 
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to