Re: [OMPI devel] RFC: Bring the lastest ROMIO version from MPICH2-1.3 into the trunk

Jeff Squyres Thu, 13 Jan 2011 14:08:10 -0500

Great!

I see in your other mail that you pulled something from MPICH2 to make this 
work.


Does that mean that there's a even-newer version of ROMIO that we should pull 
in its entirety?  It's a little risky to pull most stuff from one released 
version of ROMIO and then more stuff from another released version.  Meaning: 
it's little nicer/safer to say that we have ROMIO from a single released 
version of MPICH2.  

If possible.  :-)

Is it possible?

Don't get me wrong -- I want the new ROMIO, and I'm sorry you've had to go 
through so many hoops to get it ready.  :-(  But we should do it the best way 
we can; we have history/precedent for taking ROMIO from a single 
source/released version of MPICH[2], and I'd like to maintain that precedent if 
at all possible.


On Jan 13, 2011, at 8:04 AM, Pascal Deveze wrote:

> This problem of assertion is now solved by a patch in ROMIO just commited in 
> http://bitbucket.org/devezep/new-romio-for-openmpi
> 
> I don't know any other problem in this porting of ROMIO.
> 
> Pascal
> 
> Pascal Deveze a écrit :
>> Jeff Squyres a écrit :
>>> On Dec 16, 2010, at 3:31 AM, Pascal Deveze wrote:
>>> 
>>>  
>>> 
>>>> int main(int argc, char **argv) {
>>>>   MPI_File fh;
>>>>   MPI_Info info, info_used;
>>>> 
>>>>   MPI_Init(&argc,&argv);
>>>> 
>>>>   MPI_File_open(MPI_COMM_WORLD, "/tmp/A", MPI_MODE_CREATE | MPI_MODE_RDWR, 
>>>> MPI_INFO_NULL, &fh);
>>>>   MPI_File_close(&fh);
>>>> 
>>>>   MPI_File_open(MPI_COMM_WORLD, "/tmp/A", MPI_MODE_CREATE | MPI_MODE_RDWR, 
>>>> MPI_INFO_NULL, &fh);
>>>>   MPI_File_close(&fh);
>>>> 
>>>>   MPI_Finalize();
>>>> }
>>>> 
>>>> I run this programon one process : salloc -p debug  -n1 mpirun -np 1 
>>>> ./a.out
>>>> And I get teh assertion error:
>>>> 
>>>> a.out: attribute/attribute.c:763: ompi_attr_delete: Assertion 
>>>> `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) 
>>>> (keyval))->obj_magic_id' failed.
>>>> [cuzco10:24785] *** Process received signal ***
>>>> [cuzco10:24785] Signal: Aborted (6)
>>>>     
>>>> 
>>> 
>>> Ok.
>>> 
>>>   
>>> 
>>>> I saw that there is a problem with an MPI_COMM_SELF communicator.
>>>> 
>>>> The problem disappears (and all ROMIO tests are OK) when I comment line 89 
>>>> in the file ompi/mca/io/romio/romio/adio/common/ad_close.c :
>>>>      // MPI_Comm_free(&(fd->comm));
>>>> 
>>>> The problem disappears (and all ROMIO tests are OK) when I comment line 
>>>> 425 in the file ompi/mca/io/romio/romio/adio/common/cb_config_list.c :
>>>>    //  MPI_Keyval_free(&keyval);
>>>> 
>>>> The problem also disappears (but only 50% of the ROMIO tests are OK) when 
>>>> I comment line 133 in the file ompi/runtime/ompi_mpi_finalize.c:
>>>>       // ompi_attr_delete_all(COMM_ATTR, &ompi_mpi_comm_self,
>>>>      //                             ompi_mpi_comm_self.comm.c_keyhash);
>>>>     
>>>> 
>>> 
>>> It sounds like there's a problem with the ordering of shutdown of things in 
>>> MPI_FINALIZE w.r.t. ROMIO.
>>> 
>>> FWIW: ROMIO violates some of our abstractions, but it's the price we pay 
>>> for using a 3rd party package.  One very, very important abstraction that 
>>> we have is that no top-level MPI API functions are not allowed to call any 
>>> other MPI API functions.  E.g., MPI_Send (i.e., ompi/mpi/c/send.c) cannot 
>>> call MPI_Isend (i.e., ompi/mpi/c/isend.c).  MPI_Send *can* call the same 
>>> back-end implementation functions that isend does -- it's just not allowed 
>>> to call MPI_<foo>.
>>> 
>>> The reason is that the top-level MPI API functions do things like check for 
>>> whether MPI_INIT / MPI_FINALIZE have been called, etc.  The back-end 
>>> functions do not do this.  Additionally, top-level MPI API functions may be 
>>> overridden via PMPI kinds of things.  We wouldn't want our internal library 
>>> calls to get intercepted by user code.
>>> 
>>>   
>>> 
>>>> I am not very familiar with the OBJ_RELEASE/OBJ_RETAIN mechanism and till 
>>>> now I do not understand what is the real origin of that problem.
>>>>     
>>>> 
>>> 
>>> RETAIN/RELEASE is part of OMPI's "poor man's C++" design.  Waaaay back in 
>>> the beginning of the project, we debated whether to use C or C++ for 
>>> developing the code.  There was a desire to use some of the basic object 
>>> functionality of C++ (e.g., derived classes, constructors, destructors, 
>>> etc.), but we wanted to stay as portable as possible.  So we ended up going 
>>> with C, but with a few macros that emulate some C++-like functionality.  
>>> This led to OMPI's OBJ system that is used all over the place.  
>>> 
>>> The OBJ system does several things:
>>> 
>>> - allows you to have "constructor"- and "destructor"-like behavior for 
>>> structs
>>> - works for both stack and heap memory
>>> - reference counting
>>> 
>>> The reference counting is perhaps the most-used function of OBJ.  Here's a 
>>> sample scenario:
>>> 
>>> /* allocate some memory, call the some_object_type "constructor",
>>>    and set the reference count of "foo" to 1 */
>>> foo = OBJ_NEW(some_object_type);
>>> 
>>> /* increment the reference count of foo (to 2) */
>>> OBJ_RETAIN(foo);
>>> 
>>> /* increment the reference count of foo (to 3) */
>>> OBJ_RETAIN(foo);
>>> 
>>> /* decrement the reference count of foo (to 1) */
>>> OBJ_RELEASE(foo);
>>> OBJ_RELEASE(foo);
>>> 
>>> /* decrement the reference count of foo to 0 -- which will
>>>    call foo's "destructor" and then free the memory */
>>> OBJ_RELEASE(foo);
>>> 
>>> The same principle works for structs on the stack -- we do the same 
>>> constructor / destructor behavior, but just don't free the memory.  For 
>>> example:
>>> 
>>> /* Instantiate the memory and call its "constructor" and set the
>>>    ref count to 1 */
>>> some_object_type foo;
>>> OBJ_CONSTRUCT(&foo, some_object_type);
>>> 
>>> /* Increment and decrement the ref count */
>>> OBJ_RETAIN(&foo);
>>> OBJ_RETAIN(&foo);
>>> OBJ_RELEASE(&foo);
>>> OBJ_RELEASE(&foo);
>>> 
>>> /* The last RELEASE will call the destructor, but won't actually
>>>    free the memory, because the memory was not allocated with 
>>>    OBJ_NEW */
>>> OBJ_RELEASE(&foo);
>>> 
>>> When the destructor is called, the OBJ system sets the magic number in the 
>>> obj's memory to a sentinel value so that we know that the destructor has 
>>> been called on this particular struct.  Hence, if we call OBJ_RELEASE 
>>> *again* on a struct that has already had its ref count go to 0 (and 
>>> therefore already had its destructor called), we get the assert error that 
>>> you're seeing.
>>> 
>>> So to be totally clear: the assert error you're seeing is because some OBJ 
>>> is (effectively) getting its ref count decremented below zero.  Which means 
>>> it's trying to get destroyed twice.  Which means the ordering sequence of 
>>> stuff in the ROMIO shutdown / MPI_FINALIZE is likely not right.
>>>   
>>> 
>> 
>> _______________________________________________
>> devel mailing list
>> 
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] RFC: Bring the lastest ROMIO version from MPICH2-1.3 into the trunk

Reply via email to