Great! I see in your other mail that you pulled something from MPICH2 to make this work.
Does that mean that there's a even-newer version of ROMIO that we should pull in its entirety? It's a little risky to pull most stuff from one released version of ROMIO and then more stuff from another released version. Meaning: it's little nicer/safer to say that we have ROMIO from a single released version of MPICH2. If possible. :-) Is it possible? Don't get me wrong -- I want the new ROMIO, and I'm sorry you've had to go through so many hoops to get it ready. :-( But we should do it the best way we can; we have history/precedent for taking ROMIO from a single source/released version of MPICH[2], and I'd like to maintain that precedent if at all possible. On Jan 13, 2011, at 8:04 AM, Pascal Deveze wrote: > This problem of assertion is now solved by a patch in ROMIO just commited in > http://bitbucket.org/devezep/new-romio-for-openmpi > > I don't know any other problem in this porting of ROMIO. > > Pascal > > Pascal Deveze a écrit : >> Jeff Squyres a écrit : >>> On Dec 16, 2010, at 3:31 AM, Pascal Deveze wrote: >>> >>> >>> >>>> int main(int argc, char **argv) { >>>> MPI_File fh; >>>> MPI_Info info, info_used; >>>> >>>> MPI_Init(&argc,&argv); >>>> >>>> MPI_File_open(MPI_COMM_WORLD, "/tmp/A", MPI_MODE_CREATE | MPI_MODE_RDWR, >>>> MPI_INFO_NULL, &fh); >>>> MPI_File_close(&fh); >>>> >>>> MPI_File_open(MPI_COMM_WORLD, "/tmp/A", MPI_MODE_CREATE | MPI_MODE_RDWR, >>>> MPI_INFO_NULL, &fh); >>>> MPI_File_close(&fh); >>>> >>>> MPI_Finalize(); >>>> } >>>> >>>> I run this programon one process : salloc -p debug -n1 mpirun -np 1 >>>> ./a.out >>>> And I get teh assertion error: >>>> >>>> a.out: attribute/attribute.c:763: ompi_attr_delete: Assertion >>>> `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) >>>> (keyval))->obj_magic_id' failed. >>>> [cuzco10:24785] *** Process received signal *** >>>> [cuzco10:24785] Signal: Aborted (6) >>>> >>>> >>> >>> Ok. >>> >>> >>> >>>> I saw that there is a problem with an MPI_COMM_SELF communicator. >>>> >>>> The problem disappears (and all ROMIO tests are OK) when I comment line 89 >>>> in the file ompi/mca/io/romio/romio/adio/common/ad_close.c : >>>> // MPI_Comm_free(&(fd->comm)); >>>> >>>> The problem disappears (and all ROMIO tests are OK) when I comment line >>>> 425 in the file ompi/mca/io/romio/romio/adio/common/cb_config_list.c : >>>> // MPI_Keyval_free(&keyval); >>>> >>>> The problem also disappears (but only 50% of the ROMIO tests are OK) when >>>> I comment line 133 in the file ompi/runtime/ompi_mpi_finalize.c: >>>> // ompi_attr_delete_all(COMM_ATTR, &ompi_mpi_comm_self, >>>> // ompi_mpi_comm_self.comm.c_keyhash); >>>> >>>> >>> >>> It sounds like there's a problem with the ordering of shutdown of things in >>> MPI_FINALIZE w.r.t. ROMIO. >>> >>> FWIW: ROMIO violates some of our abstractions, but it's the price we pay >>> for using a 3rd party package. One very, very important abstraction that >>> we have is that no top-level MPI API functions are not allowed to call any >>> other MPI API functions. E.g., MPI_Send (i.e., ompi/mpi/c/send.c) cannot >>> call MPI_Isend (i.e., ompi/mpi/c/isend.c). MPI_Send *can* call the same >>> back-end implementation functions that isend does -- it's just not allowed >>> to call MPI_<foo>. >>> >>> The reason is that the top-level MPI API functions do things like check for >>> whether MPI_INIT / MPI_FINALIZE have been called, etc. The back-end >>> functions do not do this. Additionally, top-level MPI API functions may be >>> overridden via PMPI kinds of things. We wouldn't want our internal library >>> calls to get intercepted by user code. >>> >>> >>> >>>> I am not very familiar with the OBJ_RELEASE/OBJ_RETAIN mechanism and till >>>> now I do not understand what is the real origin of that problem. >>>> >>>> >>> >>> RETAIN/RELEASE is part of OMPI's "poor man's C++" design. Waaaay back in >>> the beginning of the project, we debated whether to use C or C++ for >>> developing the code. There was a desire to use some of the basic object >>> functionality of C++ (e.g., derived classes, constructors, destructors, >>> etc.), but we wanted to stay as portable as possible. So we ended up going >>> with C, but with a few macros that emulate some C++-like functionality. >>> This led to OMPI's OBJ system that is used all over the place. >>> >>> The OBJ system does several things: >>> >>> - allows you to have "constructor"- and "destructor"-like behavior for >>> structs >>> - works for both stack and heap memory >>> - reference counting >>> >>> The reference counting is perhaps the most-used function of OBJ. Here's a >>> sample scenario: >>> >>> /* allocate some memory, call the some_object_type "constructor", >>> and set the reference count of "foo" to 1 */ >>> foo = OBJ_NEW(some_object_type); >>> >>> /* increment the reference count of foo (to 2) */ >>> OBJ_RETAIN(foo); >>> >>> /* increment the reference count of foo (to 3) */ >>> OBJ_RETAIN(foo); >>> >>> /* decrement the reference count of foo (to 1) */ >>> OBJ_RELEASE(foo); >>> OBJ_RELEASE(foo); >>> >>> /* decrement the reference count of foo to 0 -- which will >>> call foo's "destructor" and then free the memory */ >>> OBJ_RELEASE(foo); >>> >>> The same principle works for structs on the stack -- we do the same >>> constructor / destructor behavior, but just don't free the memory. For >>> example: >>> >>> /* Instantiate the memory and call its "constructor" and set the >>> ref count to 1 */ >>> some_object_type foo; >>> OBJ_CONSTRUCT(&foo, some_object_type); >>> >>> /* Increment and decrement the ref count */ >>> OBJ_RETAIN(&foo); >>> OBJ_RETAIN(&foo); >>> OBJ_RELEASE(&foo); >>> OBJ_RELEASE(&foo); >>> >>> /* The last RELEASE will call the destructor, but won't actually >>> free the memory, because the memory was not allocated with >>> OBJ_NEW */ >>> OBJ_RELEASE(&foo); >>> >>> When the destructor is called, the OBJ system sets the magic number in the >>> obj's memory to a sentinel value so that we know that the destructor has >>> been called on this particular struct. Hence, if we call OBJ_RELEASE >>> *again* on a struct that has already had its ref count go to 0 (and >>> therefore already had its destructor called), we get the assert error that >>> you're seeing. >>> >>> So to be totally clear: the assert error you're seeing is because some OBJ >>> is (effectively) getting its ref count decremented below zero. Which means >>> it's trying to get destroyed twice. Which means the ordering sequence of >>> stuff in the ROMIO shutdown / MPI_FINALIZE is likely not right. >>> >>> >> >> _______________________________________________ >> devel mailing list >> >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/