Re: [OMPI devel] OMPI 1.4.3 hangs in gather
Try manually specifying the collective component "-mca coll tuned" You seem to be using the "sync" collective component, any stale mca param files lying around ? --Nysal On Tue, Jan 11, 2011 at 6:28 PM, Doron Shoham wrote: > Hi > > All machines on the setup are IDataPlex with Nehalem 12 cores per node, > 24GB memory. > > > > · *Problem 1 – OMPI 1.4.3 hangs in gather:* > > > > I’m trying to run IMB and gather operation with OMPI 1.4.3 (Vanilla). > > It happens when np >= 64 and message size exceed 4k: > > mpirun -np 64 -machinefile voltairenodes -mca btl sm,self,openib > imb/src-1.4.2/IMB-MPI1 gather –npmin 64 > > > > voltairenodes consists of 64 machines. > > > > # > > # Benchmarking Gather > > # #processes = 64 > > # > >#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] > > 0 1000 0.02 0.02 0.02 > > 1 33114.0214.1614.09 > > 2 33112.8713.0812.93 > > 4 33114.2914.4314.34 > > 8 33116.0316.2016.11 > >16 33117.5417.7417.64 > >32 33120.4920.6220.53 > >64 33123.5723.8423.70 > > 128 33128.0228.3528.18 > > 256 33134.7834.8834.80 > > 512 33146.3446.9146.60 > > 1024 33163.9664.7164.33 > > 2048 331 460.67 465.74 463.18 > > 4096 331 637.33 643.99 640.75 > > > > This the padb output: > > padb –A –x –Ormgr=mpirun –tree: > > > > =~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2011.01.06 14:33:17 > =~=~=~=~=~=~=~=~=~=~=~= > > > > Warning, remote process state differs across ranks > > state : ranks > > R (running) : > [1,3-6,8,10-13,16-20,23-28,30-32,34-42,44-45,47-49,51-53,56-59,61-63] > > S (sleeping) : [0,2,7,9,14-15,21-22,29,33,43,46,50,54-55,60] > > Stack trace(s) for thread: 1 > > - > > [0-63] (64 processes) > > - > > main() at ?:? > > IMB_init_buffers_iter() at ?:? > > IMB_gather() at ?:? > > PMPI_Gather() at pgather.c:175 > > mca_coll_sync_gather() at coll_sync_gather.c:46 > > ompi_coll_tuned_gather_intra_dec_fixed() at > coll_tuned_decision_fixed.c:714 > > - > > [0,3-63] (62 processes) > > - > > ompi_coll_tuned_gather_intra_linear_sync() at > coll_tuned_gather.c:248 > > mca_pml_ob1_recv() at pml_ob1_irecv.c:104 > > ompi_request_wait_completion() at > ../../../../ompi/request/request.h:375 > > opal_condition_wait() at > ../../../../opal/threads/condition.h:99 > > - > > [1] (1 processes) > > - > > ompi_coll_tuned_gather_intra_linear_sync() at > coll_tuned_gather.c:302 > > mca_pml_ob1_send() at pml_ob1_isend.c:125 > > ompi_request_wait_completion() at > ../../../../ompi/request/request.h:375 > > opal_condition_wait() at > ../../../../opal/threads/condition.h:99 > > - > > [2] (1 processes) > > - > > ompi_coll_tuned_gather_intra_linear_sync() at > coll_tuned_gather.c:315 > > ompi_request_default_wait() at request/req_wait.c:37 > > ompi_request_wait_completion() at > ../ompi/request/request.h:375 > > opal_condition_wait() at ../opal/threads/condition.h:99 > > Stack trace(s) for thread: 2 > > - > > [0-63] (64 processes) > > - > > start_thread() at ?:? > > btl_openib_async_thread() at btl_openib_async.c:344 > > poll() at ?:? > > Stack trace(s) for thread: 3 > > - > > [0-63] (64 processes) > > - > > start_thread() at ?:? > > service_thread_start() at btl_openib_fd.c:427 > > select() at ?:? > > -bash-3.2$ > > > > > > When running again padb after couple of minutes, I can see that the total > number of processes remain in the same position but > > different processes are at different positions. > > For example, this is the diff between two padb outputs: > > > > Warning, remote process state differs across ranks > > state : ranks > > -R (running) : [0,2-4,6-13,16-18,20-21,28-31,33-36,38-56,58,60,62-63] > > -S (sleeping) : [1,5,14-15,19,22-27,32,37,57,59,61] > > +R (running) : [2,5-14,16-23,25,28-40,42-48,50-51,53-58,61,63] > > +S (sleeping) : [0-1,3-4,15,24,26-27,41,49,52,59-60,62] >
Re: [OMPI devel] Back-porting components from SVN trunk to v1.5 branch
For the moment, that's true. Abhishek's working on bringing over SOS and the notifier... On Jan 12, 2011, at 5:57 PM, Ralph Castain wrote: > You also have to remove all references to OPAL SOS... > > > On Jan 12, 2011, at 1:25 PM, Jeff Squyres wrote: > >> I back-ported the trunk's paffinity/hwloc component to the v1.5 branch >> today. Here's the things that you need to look out for if you undertake >> back-porting a component from the trunk to the v1.5 branch... >> >> Remember: the whole autogen.pl infrastructure was not (and will not be) >> ported to the v1.5 branch. So there's some things that you need to change >> in your component's build system: >> >> - You need to add a configure.params file >> - In your component's configure.m4 file: >> >>- Rename your m4 define from MCACONFIG to >> MCA___CONFIG >>- Same for _POST_CONFIG >>- Remove AC_CONFIG_FILES (they should now be in configure.params) >>- We renamed a few m4 macros on the trunk; e.g., it's OPAL_VAR_SCOPE_PUSH >> on the trunk and OMPI_VAR_SCOPE_PUSH on v1.5. So if you run "configure" and >> it says that commands are not found and they're un-expanded m4 names, look >> to see if they have changed names. >> >> - In your component's Makefile.am: >> >>- Rename any "if" variables from the form >> MCA_BUILDDSO to OMPI_BUILD___DSO >> >> I think those are the main points to watch out for. >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] Problem with attributes attached to communicators
A new patch in ROMIO solves this problem Thanks to Dave. Pascal Dave Goodell a écrit : Hmm... Apparently I was too optimistic about my untested patch. I'll work with Rob this afternoon to straighten this out. -Dave On Jan 10, 2011, at 5:53 AM CST, Pascal Deveze wrote: Dave, Your proposed patch does not work when the call to MPI_File_open() is done on MPI_COMM_SELF. For example, with the romio test program "simple.c", I got the fatal error: mpirun -np 1 ./simple -fname /tmp//TEST Fatal error in MPI_Attr_put: Invalid keyval, error stack: MPI_Attr_put(131): MPI_Attr_put(comm=0x8400, keyval=603979776, attr_value=0x2279fa0) failed MPI_Attr_put(89).: Attribute key was MPI_KEYVAL_INVALID APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1) Pascal Dave Goodell a écrit : Try this (untested) patch instead: -Dave On Jan 7, 2011, at 3:50 AM CST, Rob Latham wrote: Hi Pascal. I'm really happy that you have been working with the OpenMPI folks to re-sync romio. I meant to ask you how that work was progressing, so thanks for the email! I need copy Dave Goodell on this conversation because he helped me understand the keyval issues when we last worked on this two years ago. Dave, some background. We added some code in ROMIO to address ticket 222: http://trac.mcs.anl.gov/projects/mpich2/ticket/222 But that code apparently makes OpenMPI unhappy. I think when we talked about this I remember it came down to a, shall we say, different interpretation of the standard between MPICH2 and OpenMPI. In case it's not clear from the nesting of messages, here's Pascal's extraction of the ROMIO keyval code: http://www.open-mpi.org/community/lists/devel/2011/01/8837.php and here's the OpenMPI developer's response: http://www.open-mpi.org/community/lists/devel/2011/01/8838.php I think this is related to a discussion I had a couple years ago: http://www.open-mpi.org/community/lists/users/2009/03/8409.php So, to eventually answer your question yes I do have some remarks, but I have no answers. It's been a couple of years since I added those frees... ==rob On Fri, Jan 07, 2011 at 09:47:17AM +0100, Pascal Deveze wrote: Hi Rob, As you perhaps remember, I was porting ROMIO on OpenMPI. The job is quite finished, I only have a problem with the allocation/dealocation of Keyval (cb_config_list_keyval in adio/common/cb_config_list.c). As the alogorithm runs on MPICH2, I asked for help on the de...@open-mpi.org mailing list. I just received the following answer from George Bosilca. The solution I found to run ROMIO with OpenMPI is to delete the line: MPI_Keyval_free(&keyval); in the function ADIOI_cb_delete_name_array (romio/adio/common/cb_config_list.c). Do you have any remarks about that ? Regards, Pascal Message original Sujet: Re: [OMPI devel] Problem with attributes attached to communicators Date: Thu, 6 Jan 2011 13:15:14 -0500 De: George Bosilca Répondre à: Open MPI Developers Pour: Open MPI Developers Références: <4d25daf9.3070...@bull.net> MPI_Comm_create_keyval and MPI_Comm_free_keyval are the functions you should use in order to be MPI 2.2 compliant. Based on my understanding of the MPI standard, your application is incorrect, and therefore the MPICH behavior is incorrect. The delete function is not there for you to delete the keyval (!) but to delete the attribute. Here is what the MPI standard states about this: Note that it is not erroneous to free an attribute key that is in use, because the actual free does not transpire until after all references (in other communicators on the process) to the key have been freed. These references need to be explictly freed by the program, either via calls to MPI_COMM_DELETE_ATTR that free one attribute instance, or by calls to MPI_COMM_FREE that free all attribute instances associated with the freed communicator. george. On Jan 6, 2011, at 10:08 , Pascal Deveze wrote: I have a problem to finish the porting of ROMIO into Open MPI. It is related to the routines MPI_Comm_dup together with MPI_Keyval_create, MPI_Keyval_free, MPI_Attr_get and MPI_Attr_put. Here is a simple program that reproduces my problem: === #include #include "mpi.h" int copy_fct(MPI_Comm comm, int keyval, void *extra, void *attr_in, void **attr_out, int *flag) { return MPI_SUCCESS; } int delete_fct(MPI_Comm comm, int keyval, void *attr_val, void *extra) { MPI_Keyval_free(&keyval); return MPI_SUCCESS; } int main(int argc, char **argv) { int i, found, attribute_val=100, keyval = MPI_KEYVAL_INVALID; MPI_Comm dupcomm; MPI_Init(&argc,&argv); for (i=0; i<100;i++) { /* This simulates the MPI_File_open() */ if (keyval == MPI_KEYVAL_INVALID) { MPI_Keyval_create((MPI_Copy_function *) copy_fct, (MPI_Delete_function *) delete_fct, &keyval, NULL);
Re: [OMPI devel] RFC: Bring the lastest ROMIO version from MPICH2-1.3 into the trunk
This problem of assertion is now solved by a patch in ROMIO just commited in http://bitbucket.org/devezep/new-romio-for-openmpi I don't know any other problem in this porting of ROMIO. Pascal Pascal Deveze a écrit : Jeff Squyres a écrit : On Dec 16, 2010, at 3:31 AM, Pascal Deveze wrote: int main(int argc, char **argv) { MPI_File fh; MPI_Info info, info_used; MPI_Init(&argc,&argv); MPI_File_open(MPI_COMM_WORLD, "/tmp/A", MPI_MODE_CREATE | MPI_MODE_RDWR, MPI_INFO_NULL, &fh); MPI_File_close(&fh); MPI_File_open(MPI_COMM_WORLD, "/tmp/A", MPI_MODE_CREATE | MPI_MODE_RDWR, MPI_INFO_NULL, &fh); MPI_File_close(&fh); MPI_Finalize(); } I run this programon one process : salloc -p debug -n1 mpirun -np 1 ./a.out And I get teh assertion error: a.out: attribute/attribute.c:763: ompi_attr_delete: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (keyval))->obj_magic_id' failed. [cuzco10:24785] *** Process received signal *** [cuzco10:24785] Signal: Aborted (6) Ok. I saw that there is a problem with an MPI_COMM_SELF communicator. The problem disappears (and all ROMIO tests are OK) when I comment line 89 in the file ompi/mca/io/romio/romio/adio/common/ad_close.c : // MPI_Comm_free(&(fd->comm)); The problem disappears (and all ROMIO tests are OK) when I comment line 425 in the file ompi/mca/io/romio/romio/adio/common/cb_config_list.c : // MPI_Keyval_free(&keyval); The problem also disappears (but only 50% of the ROMIO tests are OK) when I comment line 133 in the file ompi/runtime/ompi_mpi_finalize.c: // ompi_attr_delete_all(COMM_ATTR, &ompi_mpi_comm_self, // ompi_mpi_comm_self.comm.c_keyhash); It sounds like there's a problem with the ordering of shutdown of things in MPI_FINALIZE w.r.t. ROMIO. FWIW: ROMIO violates some of our abstractions, but it's the price we pay for using a 3rd party package. One very, very important abstraction that we have is that no top-level MPI API functions are not allowed to call any other MPI API functions. E.g., MPI_Send (i.e., ompi/mpi/c/send.c) cannot call MPI_Isend (i.e., ompi/mpi/c/isend.c). MPI_Send *can* call the same back-end implementation functions that isend does -- it's just not allowed to call MPI_. The reason is that the top-level MPI API functions do things like check for whether MPI_INIT / MPI_FINALIZE have been called, etc. The back-end functions do not do this. Additionally, top-level MPI API functions may be overridden via PMPI kinds of things. We wouldn't want our internal library calls to get intercepted by user code. I am not very familiar with the OBJ_RELEASE/OBJ_RETAIN mechanism and till now I do not understand what is the real origin of that problem. RETAIN/RELEASE is part of OMPI's "poor man's C++" design. Wy back in the beginning of the project, we debated whether to use C or C++ for developing the code. There was a desire to use some of the basic object functionality of C++ (e.g., derived classes, constructors, destructors, etc.), but we wanted to stay as portable as possible. So we ended up going with C, but with a few macros that emulate some C++-like functionality. This led to OMPI's OBJ system that is used all over the place. The OBJ system does several things: - allows you to have "constructor"- and "destructor"-like behavior for structs - works for both stack and heap memory - reference counting The reference counting is perhaps the most-used function of OBJ. Here's a sample scenario: /* allocate some memory, call the some_object_type "constructor", and set the reference count of "foo" to 1 */ foo = OBJ_NEW(some_object_type); /* increment the reference count of foo (to 2) */ OBJ_RETAIN(foo); /* increment the reference count of foo (to 3) */ OBJ_RETAIN(foo); /* decrement the reference count of foo (to 1) */ OBJ_RELEASE(foo); OBJ_RELEASE(foo); /* decrement the reference count of foo to 0 -- which will call foo's "destructor" and then free the memory */ OBJ_RELEASE(foo); The same principle works for structs on the stack -- we do the same constructor / destructor behavior, but just don't free the memory. For example: /* Instantiate the memory and call its "constructor" and set the ref count to 1 */ some_object_type foo; OBJ_CONSTRUCT(&foo, some_object_type); /* Increment and decrement the ref count */ OBJ_RETAIN(&foo); OBJ_RETAIN(&foo); OBJ_RELEASE(&foo); OBJ_RELEASE(&foo); /* The last RELEASE will call the destructor, but won't actually free the memory, because the memory was not allocated with OBJ_NEW */ OBJ_RELEASE(&foo); When the destructor is called, the OBJ system sets the magic number in the obj's memory to a sentinel value so that we know that the destructor has been called on this particular struct. Hence, if we call OBJ_RELEASE *again* on a struct that has already had its ref count go to 0 (and therefore already had its destructor called), we get
Re: [OMPI devel] RFC: Bring the lastest ROMIO version from MPICH2-1.3 into the trunk
Great! I see in your other mail that you pulled something from MPICH2 to make this work. Does that mean that there's a even-newer version of ROMIO that we should pull in its entirety? It's a little risky to pull most stuff from one released version of ROMIO and then more stuff from another released version. Meaning: it's little nicer/safer to say that we have ROMIO from a single released version of MPICH2. If possible. :-) Is it possible? Don't get me wrong -- I want the new ROMIO, and I'm sorry you've had to go through so many hoops to get it ready. :-( But we should do it the best way we can; we have history/precedent for taking ROMIO from a single source/released version of MPICH[2], and I'd like to maintain that precedent if at all possible. On Jan 13, 2011, at 8:04 AM, Pascal Deveze wrote: > This problem of assertion is now solved by a patch in ROMIO just commited in > http://bitbucket.org/devezep/new-romio-for-openmpi > > I don't know any other problem in this porting of ROMIO. > > Pascal > > Pascal Deveze a écrit : >> Jeff Squyres a écrit : >>> On Dec 16, 2010, at 3:31 AM, Pascal Deveze wrote: >>> >>> >>> int main(int argc, char **argv) { MPI_File fh; MPI_Info info, info_used; MPI_Init(&argc,&argv); MPI_File_open(MPI_COMM_WORLD, "/tmp/A", MPI_MODE_CREATE | MPI_MODE_RDWR, MPI_INFO_NULL, &fh); MPI_File_close(&fh); MPI_File_open(MPI_COMM_WORLD, "/tmp/A", MPI_MODE_CREATE | MPI_MODE_RDWR, MPI_INFO_NULL, &fh); MPI_File_close(&fh); MPI_Finalize(); } I run this programon one process : salloc -p debug -n1 mpirun -np 1 ./a.out And I get teh assertion error: a.out: attribute/attribute.c:763: ompi_attr_delete: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (keyval))->obj_magic_id' failed. [cuzco10:24785] *** Process received signal *** [cuzco10:24785] Signal: Aborted (6) >>> >>> Ok. >>> >>> >>> I saw that there is a problem with an MPI_COMM_SELF communicator. The problem disappears (and all ROMIO tests are OK) when I comment line 89 in the file ompi/mca/io/romio/romio/adio/common/ad_close.c : // MPI_Comm_free(&(fd->comm)); The problem disappears (and all ROMIO tests are OK) when I comment line 425 in the file ompi/mca/io/romio/romio/adio/common/cb_config_list.c : // MPI_Keyval_free(&keyval); The problem also disappears (but only 50% of the ROMIO tests are OK) when I comment line 133 in the file ompi/runtime/ompi_mpi_finalize.c: // ompi_attr_delete_all(COMM_ATTR, &ompi_mpi_comm_self, // ompi_mpi_comm_self.comm.c_keyhash); >>> >>> It sounds like there's a problem with the ordering of shutdown of things in >>> MPI_FINALIZE w.r.t. ROMIO. >>> >>> FWIW: ROMIO violates some of our abstractions, but it's the price we pay >>> for using a 3rd party package. One very, very important abstraction that >>> we have is that no top-level MPI API functions are not allowed to call any >>> other MPI API functions. E.g., MPI_Send (i.e., ompi/mpi/c/send.c) cannot >>> call MPI_Isend (i.e., ompi/mpi/c/isend.c). MPI_Send *can* call the same >>> back-end implementation functions that isend does -- it's just not allowed >>> to call MPI_. >>> >>> The reason is that the top-level MPI API functions do things like check for >>> whether MPI_INIT / MPI_FINALIZE have been called, etc. The back-end >>> functions do not do this. Additionally, top-level MPI API functions may be >>> overridden via PMPI kinds of things. We wouldn't want our internal library >>> calls to get intercepted by user code. >>> >>> >>> I am not very familiar with the OBJ_RELEASE/OBJ_RETAIN mechanism and till now I do not understand what is the real origin of that problem. >>> >>> RETAIN/RELEASE is part of OMPI's "poor man's C++" design. Wy back in >>> the beginning of the project, we debated whether to use C or C++ for >>> developing the code. There was a desire to use some of the basic object >>> functionality of C++ (e.g., derived classes, constructors, destructors, >>> etc.), but we wanted to stay as portable as possible. So we ended up going >>> with C, but with a few macros that emulate some C++-like functionality. >>> This led to OMPI's OBJ system that is used all over the place. >>> >>> The OBJ system does several things: >>> >>> - allows you to have "constructor"- and "destructor"-like behavior for >>> structs >>> - works for both stack and heap memory >>> - reference counting >>> >>> The reference counting is perhaps the most-used function of OBJ. Here's a >>> sample scenario: >>> >>> /* allocate some memory, call the some_object_type "constructor", >>>and set the reference count of "foo" to 1 */ >>> foo = OBJ_NEW(s
Re: [OMPI devel] OMPI 1.4.3 hangs in gather
+1 on what Pasha said -- if using rdmacm fixes the problem, then there's something else nefarious going on... You might want to check padb with your hangs to see where all the processes are hung to see if anything obvious jumps out. I'd be surprised if there's a bug in the oob cpc; it's been around for a long, long time; it should be pretty stable. Do we create QP's differently between oob and rdmacm, such that perhaps they are "better" (maybe better routed, or using a different SL, or ...) when created via rdmacm? On Jan 12, 2011, at 12:12 PM, Shamis, Pavel wrote: > RDMACM or OOB can not effect on performance of this benchmark, since they are > not involved in communication. So I'm not sure that the performance changes > that you see are related to connection manager changes. > About oob - I'm not aware about hangs issue there, the code is very-very old, > we did not touch it for a long time. > > Regards, > > Pavel (Pasha) Shamis > --- > Application Performance Tools Group > Computer Science and Math Division > Oak Ridge National Laboratory > Email: sham...@ornl.gov > > > > > > On Jan 12, 2011, at 8:45 AM, Doron Shoham wrote: > >> Hi, >> >> For the first problem, I can see that when using rdmacm as openib oob >> I get much better performence results (and no hangs!). >> >> mpirun -display-map -np 64 -machinefile voltairenodes -mca btl >> sm,self,openib -mca btl_openib_connect_rdmacm_priority 100 >> imb/src/IMB-MPI1 gather -npmin 64 >> >> >> #bytes #repetitionst_min[usec] t_max[usec] t_avg[usec] >> >> 0 10000.040.050.05 >> >> 1 100019.64 19.69 19.67 >> >> 2 100019.97 20.02 19.99 >> >> 4 100021.86 21.96 21.89 >> >> 8 100022.87 22.94 22.90 >> >> 16 100024.71 24.80 24.76 >> >> 32 100027.23 27.32 27.27 >> >> 64 100030.96 31.06 31.01 >> >> 128 100036.96 37.08 37.02 >> >> 256 100042.64 42.79 42.72 >> >> 512 100060.32 60.59 60.46 >> >> 1024100082.44 82.74 82.59 >> >> 20481000497.66 499.62 498.70 >> >> 40961000684.15 686.47 685.33 >> >> 8192519 544.07 546.68 545.85 >> >> 16384 519 653.20 656.23 655.27 >> >> 32768 519 704.48 707.55 706.60 >> >> 65536 519 918.00 922.12 920.86 >> >> 131072 320 2414.08 2422.17 2418.20 >> >> 262144 160 4198.25 4227.58 4213.19 >> >> 524288 80 7333.04 7503.99 7438.18 >> >> 1048576 40 13692.6014150.2013948.75 >> >> 2097152 20 30377.3432679.1531779.86 >> >> 4194304 10 61416.7071012.5068380.04 >> >> How can the oob cause the hang? Isn't it only used to bring up the >> connection? >> Does the oob has any part of the connections were made? >> >> Thanks, >> Dororn >> >> On Tue, Jan 11, 2011 at 2:58 PM, Doron Shoham wrote: >>> >>> Hi >>> >>> All machines on the setup are IDataPlex with Nehalem 12 cores per node, >>> 24GB memory. >>> >>> >>> >>> · Problem 1 – OMPI 1.4.3 hangs in gather: >>> >>> >>> >>> I’m trying to run IMB and gather operation with OMPI 1.4.3 (Vanilla). >>> >>> It happens when np >= 64 and message size exceed 4k: >>> >>> mpirun -np 64 -machinefile voltairenodes -mca btl sm,self,openib >>> imb/src-1.4.2/IMB-MPI1 gather –npmin 64 >>> >>> >>> >>> voltairenodes consists of 64 machines. >>> >>> >>> >>> # >>> >>> # Benchmarking Gather >>> >>> # #processes = 64 >>> >>> # >>> >>> #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] >>> >>> 0 1000 0.02 0.02 0.02 >>> >>> 1 33114.0214.1614.09 >>> >>> 2 33112.8713.0812.93 >>> >>> 4 33114.2914.4314.34 >>> >>> 8 33116.0316.2016.11 >>> >>> 16 33117.5417.7417.64 >>> >>> 32 33120.4920.6220.53 >>> >>> 64 33123.5723.8423.70 >>> >>> 128 33128.0228.3528.18 >>> >>> 256 33134.7834.8834.80 >>> >>> 512 33146.3446.9146.60 >>> >>>1024 33163.9664.7164.33 >>> >>>2048 331 460.67 465.74 463.18 >>> >>>4096 331 637.33 643.9
Re: [OMPI devel] RFC: Bring the lastest ROMIO version from MPICH2-1.3 into the trunk
On Jan 13, 2011, at 14:08 , Jeff Squyres wrote: > Great! > > I see in your other mail that you pulled something from MPICH2 to make this > work. > > Does that mean that there's a even-newer version of ROMIO that we should pull > in its entirety? It's a little risky to pull most stuff from one released > version of ROMIO and then more stuff from another released version. Meaning: > it's little nicer/safer to say that we have ROMIO from a single released > version of MPICH2. My understanding is that the MPICH guys provided a patch for the MPI attribute issue. As such the version here is the most up to date. george. > > If possible. :-) > > Is it possible? > > Don't get me wrong -- I want the new ROMIO, and I'm sorry you've had to go > through so many hoops to get it ready. :-( But we should do it the best way > we can; we have history/precedent for taking ROMIO from a single > source/released version of MPICH[2], and I'd like to maintain that precedent > if at all possible. > > > On Jan 13, 2011, at 8:04 AM, Pascal Deveze wrote: > >> This problem of assertion is now solved by a patch in ROMIO just commited in >> http://bitbucket.org/devezep/new-romio-for-openmpi >> >> I don't know any other problem in this porting of ROMIO. >> >> Pascal >> >> Pascal Deveze a écrit : >>> Jeff Squyres a écrit : On Dec 16, 2010, at 3:31 AM, Pascal Deveze wrote: > int main(int argc, char **argv) { > MPI_File fh; > MPI_Info info, info_used; > > MPI_Init(&argc,&argv); > > MPI_File_open(MPI_COMM_WORLD, "/tmp/A", MPI_MODE_CREATE | MPI_MODE_RDWR, > MPI_INFO_NULL, &fh); > MPI_File_close(&fh); > > MPI_File_open(MPI_COMM_WORLD, "/tmp/A", MPI_MODE_CREATE | MPI_MODE_RDWR, > MPI_INFO_NULL, &fh); > MPI_File_close(&fh); > > MPI_Finalize(); > } > > I run this programon one process : salloc -p debug -n1 mpirun -np 1 > ./a.out > And I get teh assertion error: > > a.out: attribute/attribute.c:763: ompi_attr_delete: Assertion > `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) > (keyval))->obj_magic_id' failed. > [cuzco10:24785] *** Process received signal *** > [cuzco10:24785] Signal: Aborted (6) > > Ok. > I saw that there is a problem with an MPI_COMM_SELF communicator. > > The problem disappears (and all ROMIO tests are OK) when I comment line > 89 in the file ompi/mca/io/romio/romio/adio/common/ad_close.c : > // MPI_Comm_free(&(fd->comm)); > > The problem disappears (and all ROMIO tests are OK) when I comment line > 425 in the file ompi/mca/io/romio/romio/adio/common/cb_config_list.c : > // MPI_Keyval_free(&keyval); > > The problem also disappears (but only 50% of the ROMIO tests are OK) when > I comment line 133 in the file ompi/runtime/ompi_mpi_finalize.c: > // ompi_attr_delete_all(COMM_ATTR, &ompi_mpi_comm_self, > // ompi_mpi_comm_self.comm.c_keyhash); > > It sounds like there's a problem with the ordering of shutdown of things in MPI_FINALIZE w.r.t. ROMIO. FWIW: ROMIO violates some of our abstractions, but it's the price we pay for using a 3rd party package. One very, very important abstraction that we have is that no top-level MPI API functions are not allowed to call any other MPI API functions. E.g., MPI_Send (i.e., ompi/mpi/c/send.c) cannot call MPI_Isend (i.e., ompi/mpi/c/isend.c). MPI_Send *can* call the same back-end implementation functions that isend does -- it's just not allowed to call MPI_. The reason is that the top-level MPI API functions do things like check for whether MPI_INIT / MPI_FINALIZE have been called, etc. The back-end functions do not do this. Additionally, top-level MPI API functions may be overridden via PMPI kinds of things. We wouldn't want our internal library calls to get intercepted by user code. > I am not very familiar with the OBJ_RELEASE/OBJ_RETAIN mechanism and till > now I do not understand what is the real origin of that problem. > > RETAIN/RELEASE is part of OMPI's "poor man's C++" design. Wy back in the beginning of the project, we debated whether to use C or C++ for developing the code. There was a desire to use some of the basic object functionality of C++ (e.g., derived classes, constructors, destructors, etc.), but we wanted to stay as portable as possible. So we ended up going with C, but with a few macros that emulate some C++-like functionality. This led to OMPI's OBJ system that is used all over the place. The OBJ system does several things: - allows you to have "constructor"- and "destructor"-like behavior for structs
Re: [OMPI devel] OMPI 1.4.3 hangs in gather
RDMACM creates the same QPs with the same tunings as OOB, so I don't see how CPC may effect on performance. Pavel (Pasha) Shamis --- Application Performance Tools Group Computer Science and Math Division Oak Ridge National Laboratory On Jan 13, 2011, at 2:15 PM, Jeff Squyres wrote: > +1 on what Pasha said -- if using rdmacm fixes the problem, then there's > something else nefarious going on... > > You might want to check padb with your hangs to see where all the processes > are hung to see if anything obvious jumps out. I'd be surprised if there's a > bug in the oob cpc; it's been around for a long, long time; it should be > pretty stable. > > Do we create QP's differently between oob and rdmacm, such that perhaps they > are "better" (maybe better routed, or using a different SL, or ...) when > created via rdmacm? > > > On Jan 12, 2011, at 12:12 PM, Shamis, Pavel wrote: > >> RDMACM or OOB can not effect on performance of this benchmark, since they >> are not involved in communication. So I'm not sure that the performance >> changes that you see are related to connection manager changes. >> About oob - I'm not aware about hangs issue there, the code is very-very >> old, we did not touch it for a long time. >> >> Regards, >> >> Pavel (Pasha) Shamis >> --- >> Application Performance Tools Group >> Computer Science and Math Division >> Oak Ridge National Laboratory >> Email: sham...@ornl.gov >> >> >> >> >> >> On Jan 12, 2011, at 8:45 AM, Doron Shoham wrote: >> >>> Hi, >>> >>> For the first problem, I can see that when using rdmacm as openib oob >>> I get much better performence results (and no hangs!). >>> >>> mpirun -display-map -np 64 -machinefile voltairenodes -mca btl >>> sm,self,openib -mca btl_openib_connect_rdmacm_priority 100 >>> imb/src/IMB-MPI1 gather -npmin 64 >>> >>> >>> #bytes #repetitionst_min[usec] t_max[usec] t_avg[usec] >>> >>> 0 10000.040.050.05 >>> >>> 1 100019.64 19.69 19.67 >>> >>> 2 100019.97 20.02 19.99 >>> >>> 4 100021.86 21.96 21.89 >>> >>> 8 100022.87 22.94 22.90 >>> >>> 16 100024.71 24.80 24.76 >>> >>> 32 100027.23 27.32 27.27 >>> >>> 64 100030.96 31.06 31.01 >>> >>> 128 100036.96 37.08 37.02 >>> >>> 256 100042.64 42.79 42.72 >>> >>> 512 100060.32 60.59 60.46 >>> >>> 1024100082.44 82.74 82.59 >>> >>> 20481000497.66 499.62 498.70 >>> >>> 40961000684.15 686.47 685.33 >>> >>> 8192519 544.07 546.68 545.85 >>> >>> 16384 519 653.20 656.23 655.27 >>> >>> 32768 519 704.48 707.55 706.60 >>> >>> 65536 519 918.00 922.12 920.86 >>> >>> 131072 320 2414.08 2422.17 2418.20 >>> >>> 262144 160 4198.25 4227.58 4213.19 >>> >>> 524288 80 7333.04 7503.99 7438.18 >>> >>> 1048576 40 13692.6014150.2013948.75 >>> >>> 2097152 20 30377.3432679.1531779.86 >>> >>> 4194304 10 61416.7071012.5068380.04 >>> >>> How can the oob cause the hang? Isn't it only used to bring up the >>> connection? >>> Does the oob has any part of the connections were made? >>> >>> Thanks, >>> Dororn >>> >>> On Tue, Jan 11, 2011 at 2:58 PM, Doron Shoham wrote: Hi All machines on the setup are IDataPlex with Nehalem 12 cores per node, 24GB memory. · Problem 1 – OMPI 1.4.3 hangs in gather: I’m trying to run IMB and gather operation with OMPI 1.4.3 (Vanilla). It happens when np >= 64 and message size exceed 4k: mpirun -np 64 -machinefile voltairenodes -mca btl sm,self,openib imb/src-1.4.2/IMB-MPI1 gather –npmin 64 voltairenodes consists of 64 machines. # # Benchmarking Gather # #processes = 64 # #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] 0 1000 0.02 0.02 0.02 1 33114.0214.1614.09 2 33112.8713.0812.93 4 33114.2914.4314.34 8 33116.0316.2016.11 16 33117.5417.7417.64 32 33120.4920.6220.53 64 33123.5723.8423.70 128 33128