On Jan 13, 2011, at 14:08 , Jeff Squyres wrote: > Great! > > I see in your other mail that you pulled something from MPICH2 to make this > work. > > Does that mean that there's a even-newer version of ROMIO that we should pull > in its entirety? It's a little risky to pull most stuff from one released > version of ROMIO and then more stuff from another released version. Meaning: > it's little nicer/safer to say that we have ROMIO from a single released > version of MPICH2.
My understanding is that the MPICH guys provided a patch for the MPI attribute issue. As such the version here is the most up to date. george. > > If possible. :-) > > Is it possible? > > Don't get me wrong -- I want the new ROMIO, and I'm sorry you've had to go > through so many hoops to get it ready. :-( But we should do it the best way > we can; we have history/precedent for taking ROMIO from a single > source/released version of MPICH[2], and I'd like to maintain that precedent > if at all possible. > > > On Jan 13, 2011, at 8:04 AM, Pascal Deveze wrote: > >> This problem of assertion is now solved by a patch in ROMIO just commited in >> http://bitbucket.org/devezep/new-romio-for-openmpi >> >> I don't know any other problem in this porting of ROMIO. >> >> Pascal >> >> Pascal Deveze a écrit : >>> Jeff Squyres a écrit : >>>> On Dec 16, 2010, at 3:31 AM, Pascal Deveze wrote: >>>> >>>> >>>> >>>>> int main(int argc, char **argv) { >>>>> MPI_File fh; >>>>> MPI_Info info, info_used; >>>>> >>>>> MPI_Init(&argc,&argv); >>>>> >>>>> MPI_File_open(MPI_COMM_WORLD, "/tmp/A", MPI_MODE_CREATE | MPI_MODE_RDWR, >>>>> MPI_INFO_NULL, &fh); >>>>> MPI_File_close(&fh); >>>>> >>>>> MPI_File_open(MPI_COMM_WORLD, "/tmp/A", MPI_MODE_CREATE | MPI_MODE_RDWR, >>>>> MPI_INFO_NULL, &fh); >>>>> MPI_File_close(&fh); >>>>> >>>>> MPI_Finalize(); >>>>> } >>>>> >>>>> I run this programon one process : salloc -p debug -n1 mpirun -np 1 >>>>> ./a.out >>>>> And I get teh assertion error: >>>>> >>>>> a.out: attribute/attribute.c:763: ompi_attr_delete: Assertion >>>>> `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) >>>>> (keyval))->obj_magic_id' failed. >>>>> [cuzco10:24785] *** Process received signal *** >>>>> [cuzco10:24785] Signal: Aborted (6) >>>>> >>>>> >>>> >>>> Ok. >>>> >>>> >>>> >>>>> I saw that there is a problem with an MPI_COMM_SELF communicator. >>>>> >>>>> The problem disappears (and all ROMIO tests are OK) when I comment line >>>>> 89 in the file ompi/mca/io/romio/romio/adio/common/ad_close.c : >>>>> // MPI_Comm_free(&(fd->comm)); >>>>> >>>>> The problem disappears (and all ROMIO tests are OK) when I comment line >>>>> 425 in the file ompi/mca/io/romio/romio/adio/common/cb_config_list.c : >>>>> // MPI_Keyval_free(&keyval); >>>>> >>>>> The problem also disappears (but only 50% of the ROMIO tests are OK) when >>>>> I comment line 133 in the file ompi/runtime/ompi_mpi_finalize.c: >>>>> // ompi_attr_delete_all(COMM_ATTR, &ompi_mpi_comm_self, >>>>> // ompi_mpi_comm_self.comm.c_keyhash); >>>>> >>>>> >>>> >>>> It sounds like there's a problem with the ordering of shutdown of things >>>> in MPI_FINALIZE w.r.t. ROMIO. >>>> >>>> FWIW: ROMIO violates some of our abstractions, but it's the price we pay >>>> for using a 3rd party package. One very, very important abstraction that >>>> we have is that no top-level MPI API functions are not allowed to call any >>>> other MPI API functions. E.g., MPI_Send (i.e., ompi/mpi/c/send.c) cannot >>>> call MPI_Isend (i.e., ompi/mpi/c/isend.c). MPI_Send *can* call the same >>>> back-end implementation functions that isend does -- it's just not allowed >>>> to call MPI_<foo>. >>>> >>>> The reason is that the top-level MPI API functions do things like check >>>> for whether MPI_INIT / MPI_FINALIZE have been called, etc. The back-end >>>> functions do not do this. Additionally, top-level MPI API functions may >>>> be overridden via PMPI kinds of things. We wouldn't want our internal >>>> library calls to get intercepted by user code. >>>> >>>> >>>> >>>>> I am not very familiar with the OBJ_RELEASE/OBJ_RETAIN mechanism and till >>>>> now I do not understand what is the real origin of that problem. >>>>> >>>>> >>>> >>>> RETAIN/RELEASE is part of OMPI's "poor man's C++" design. Waaaay back in >>>> the beginning of the project, we debated whether to use C or C++ for >>>> developing the code. There was a desire to use some of the basic object >>>> functionality of C++ (e.g., derived classes, constructors, destructors, >>>> etc.), but we wanted to stay as portable as possible. So we ended up >>>> going with C, but with a few macros that emulate some C++-like >>>> functionality. This led to OMPI's OBJ system that is used all over the >>>> place. >>>> >>>> The OBJ system does several things: >>>> >>>> - allows you to have "constructor"- and "destructor"-like behavior for >>>> structs >>>> - works for both stack and heap memory >>>> - reference counting >>>> >>>> The reference counting is perhaps the most-used function of OBJ. Here's a >>>> sample scenario: >>>> >>>> /* allocate some memory, call the some_object_type "constructor", >>>> and set the reference count of "foo" to 1 */ >>>> foo = OBJ_NEW(some_object_type); >>>> >>>> /* increment the reference count of foo (to 2) */ >>>> OBJ_RETAIN(foo); >>>> >>>> /* increment the reference count of foo (to 3) */ >>>> OBJ_RETAIN(foo); >>>> >>>> /* decrement the reference count of foo (to 1) */ >>>> OBJ_RELEASE(foo); >>>> OBJ_RELEASE(foo); >>>> >>>> /* decrement the reference count of foo to 0 -- which will >>>> call foo's "destructor" and then free the memory */ >>>> OBJ_RELEASE(foo); >>>> >>>> The same principle works for structs on the stack -- we do the same >>>> constructor / destructor behavior, but just don't free the memory. For >>>> example: >>>> >>>> /* Instantiate the memory and call its "constructor" and set the >>>> ref count to 1 */ >>>> some_object_type foo; >>>> OBJ_CONSTRUCT(&foo, some_object_type); >>>> >>>> /* Increment and decrement the ref count */ >>>> OBJ_RETAIN(&foo); >>>> OBJ_RETAIN(&foo); >>>> OBJ_RELEASE(&foo); >>>> OBJ_RELEASE(&foo); >>>> >>>> /* The last RELEASE will call the destructor, but won't actually >>>> free the memory, because the memory was not allocated with >>>> OBJ_NEW */ >>>> OBJ_RELEASE(&foo); >>>> >>>> When the destructor is called, the OBJ system sets the magic number in the >>>> obj's memory to a sentinel value so that we know that the destructor has >>>> been called on this particular struct. Hence, if we call OBJ_RELEASE >>>> *again* on a struct that has already had its ref count go to 0 (and >>>> therefore already had its destructor called), we get the assert error that >>>> you're seeing. >>>> >>>> So to be totally clear: the assert error you're seeing is because some OBJ >>>> is (effectively) getting its ref count decremented below zero. Which >>>> means it's trying to get destroyed twice. Which means the ordering >>>> sequence of stuff in the ROMIO shutdown / MPI_FINALIZE is likely not right. >>>> >>>> >>> >>> _______________________________________________ >>> devel mailing list >>> >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel