Re: [OMPI devel] BTL add procs errors
Couldn't explain it better. Thanks Jeff for the summary ! On Tue, 1 Jun 2010, Jeff Squyres wrote: On May 31, 2010, at 10:27 AM, Ralph Castain wrote: Just curious - your proposed fix sounds exactly like what was done in the OPAL SOS work. Are you therefore proposing to use SOS to provide a more informational status return? No, I think Sylvain's talking about slightly modifying the existing mechanism: 1. Return OMPI_SUCCESS: bml then obeys whatever is in the connectivity bitmask -- even if the bitmask indicates that this BTL can't talk to anyone. 2. Return != OMPI_SUCCESS: treat the problem as a fatal error. I think Sylvain's point is that OMPI_SUCCESS can be returned for non-fatal errors if a BTL just wants to be ignored. In such cases, the BTL can just set its connectivity mask to 0. This will allow OMPI to continue the job. For example, if verbs is borked (e.g., can't create CQ's), it can return a connectivity mask of 0 and OMPI_SUCCESS. The BML is then free to fail over to some other BTL. But if a malloc() fails down in some BTL, then the job is hosed anyway -- so why not return != OMPI_SUCCESS and try to abort cleanly? For sites that want to treat verbs failures as fatal, we can add a new MCA param either in the openib BTL that says "treat all init failures as fatal to the job" or perhaps a new MCA param in R2 that says "if the connectivity map for BTL is empty, abort the job". Or something like that. If so, then it would seem the only real dispute here is: is there -any- condition whereby a given BTL should have the authority to tell OMPI to terminate an application, even if other BTLs could still function? I think his cited example was if malloc() fails. I could see some sites wanting to abort if their high-speed network was down (e.g., MX or openib BTLs failed to init) -- they wouldn't want OMPI to fail over to TCP. The flip side of this argument is that the sysadmin could set "btl = ^tcp" in the system file, and then if openib/mx fails, the BML will abort because some peers won't be reachable. I understand that the current code may not yet support that operation, but I do believe that was the intent of the design. So only the case where -all- BTLs say "I can't do it" would result in termination. Rather than change that design, I believe the intent is to work towards completing that implementation. In the interim, it would seem most sensible to me that we add an MCA param that specifies the termination behavior (i.e., attempt to continue or terminate on first fatal BTL error). Agreed. I think that there are multiple different exit conditions from a BTL init: 1. BTL succeeded in initializing, and some peers are reachable 2. BTL succeeded in initializing, and no peers are reachable 3. BTL failed to initialize, but that failure is localized to the BTL (e.g., openib failed to create a CQ) 4. BTL failed to initialize, and the error is global in nature (e.g., malloc() fail) I think it might be a site-specific decision as to whether to abort the job for condition 3 or not. Today we default to not failing and pair that with an indirect method of failing (i.e., setting btl=^tcp). -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] Wrong documentation for MPI_Comm_size manual page
I'm working on some intercommunicator stuff at the moment. According to MPI-2.2 standard: "An inter-communication is a point-to-point communication between processes in different groups" [Section 6.6] yet the "man" page for MPI_Comm_size reads: "If the communicator is an **intra-communicator** (enables communication between two groups), this function returns the size of the local group" Shouldn't that be **inter-communicator**? Thanks, Simon
Re: [OMPI devel] BTL add procs errors
On Tue, 1 Jun 2010, Jeff Squyres wrote: On May 31, 2010, at 5:10 AM, Sylvain Jeaugey wrote: In my case, the error happens in : mca_btl_openib_add_procs() mca_btl_openib_size_queues() adjust_cq() ibv_create_cq_compat() ibv_create_cq() Can you nail this down any further? If I modify adjust_cq() to always return OMPI_ERROR, I see the openib BTL fail over properly to the TCP BTL. It must be because create_cq actually creates cqs. Try to apply this patch which makes create_cq_compat() *not* creates the cqs and return an error instead : diff -r 13df81d1d862 ompi/mca/btl/openib/btl_openib.c --- a/ompi/mca/btl/openib/btl_openib.c Fri May 28 14:50:25 2010 +0200 +++ b/ompi/mca/btl/openib/btl_openib.c Wed Jun 02 10:56:57 2010 +0200 @@ -146,6 +146,7 @@ int cqe, void *cq_context, struct ibv_comp_channel *channel, int comp_vector) { +return OMPI_ERROR; #if OMPI_IBV_CREATE_CQ_ARGS == 3 return ibv_create_cq(context, cqe, channel); #else You should see MPI_Init complete nicely and your application segfault on the next MPI operation. Sylvain
Re: [OMPI devel] BTL add procs errors
I don't have any IB nodes, but I'm interested to see how this happens. What I would like to understand here is how do we get back in the OpenIB code if the add_procs failed for the BTL ... george. On Jun 2, 2010, at 05:08 , Sylvain Jeaugey wrote: > On Tue, 1 Jun 2010, Jeff Squyres wrote: > >> On May 31, 2010, at 5:10 AM, Sylvain Jeaugey wrote: >> >>> In my case, the error happens in : >>> mca_btl_openib_add_procs() >>> mca_btl_openib_size_queues() >>> adjust_cq() >>> ibv_create_cq_compat() >>> ibv_create_cq() >> >> Can you nail this down any further? If I modify adjust_cq() to always >> return OMPI_ERROR, I see the openib BTL fail over properly to the TCP BTL. > It must be because create_cq actually creates cqs. Try to apply this patch > which makes create_cq_compat() *not* creates the cqs and return an error > instead : > > diff -r 13df81d1d862 ompi/mca/btl/openib/btl_openib.c > --- a/ompi/mca/btl/openib/btl_openib.c Fri May 28 14:50:25 2010 +0200 > +++ b/ompi/mca/btl/openib/btl_openib.c Wed Jun 02 10:56:57 2010 +0200 > @@ -146,6 +146,7 @@ > int cqe, void *cq_context, struct ibv_comp_channel *channel, > int comp_vector) > { > +return OMPI_ERROR; > #if OMPI_IBV_CREATE_CQ_ARGS == 3 > return ibv_create_cq(context, cqe, channel); > #else > > > You should see MPI_Init complete nicely and your application segfault on the > next MPI operation. > > Sylvain > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] RFC: System V Shared Memory for Open MPI
I think adding support for sysv shared memory is a good thing. However, I have some strong objections over the implementation in the hg tree. Here are 2 of the major ones: 1) the sysv shared memory creation is __atomic__ based on the flags used. Therefore, all the RML messages exchange is totally useless. 2) the whole code is replicated in the 3 files (mmap, sysv and windows), even the common parts. However in the sysv case most of the comments have been modified to remove all capitals letter. I'm in favor of extracting all the common parts and moving them in a special file. What should be kept in the particular files should only be the really different parts (small part of the init and finalize). george. On Jun 1, 2010, at 19:26 , Samuel K. Gutierrez wrote: > Hi all, > > Configure option added: --enable-sysv (default: disabled). > > For sysv testing purposes, please enable. > > Thanks! > > -- > Samuel K. Gutierrez > Los Alamos National Laboratory > > On Jun 1, 2010, at 11:11 AM, Samuel K. Gutierrez wrote: > >> Doh! >> >> bitbucket repository: http://bitbucket.org/samuelkgutierrez/ompi_sysv_sm >> >> Thanks, >> >> -- >> Samuel K. Gutierrez >> Los Alamos National Laboratory >> >> >> On Jun 1, 2010, at 11:08 AM, Samuel K. Gutierrez wrote: >> >>> WHAT: New System V shared memory component. >>> >>> WHY: https://svn.open-mpi.org/trac/ompi/ticket/1320 >>> >>> WHERE: >>> M ompi/mca/btl/sm/btl_sm.c >>> M ompi/mca/btl/sm/btl_sm_component.c >>> M ompi/mca/btl/sm/btl_sm.h >>> M ompi/mca/mpool/sm/mpool_sm_component.c >>> M ompi/mca/mpool/sm/mpool_sm.h >>> M ompi/mca/mpool/sm/mpool_sm_module.c >>> A ompi/mca/common/sm/configure.m4 >>> A ompi/mca/common/sm/common_sm_sysv.h >>> A ompi/mca/common/sm/common_sm_windows.c >>> A ompi/mca/common/sm/common_sm_windows.h >>> A ompi/mca/common/sm/common_sm.c >>> A ompi/mca/common/sm/common_sm_sysv.c >>> A ompi/mca/common/sm/common_sm.h >>> M ompi/mca/common/sm/common_sm_mmap.c >>> M ompi/mca/common/sm/common_sm_mmap.h >>> M ompi/mca/common/sm/Makefile.am >>> M ompi/mca/common/sm/help-mpi-common-sm.txt >>> M ompi/mca/coll/sm/coll_sm_module.c >>> M ompi/mca/coll/sm/coll_sm.h >>> >>> WHEN: Upon acceptance. >>> >>> TIMEOUT: Tuesday, June 8, 2010 (after devel concall). >>> >>> HOW: >>> MCA mpi: parameter "mpi_common_sm" (current value: , >>>data source: default value) >>>Which shared memory support will be used. Valid >>>values: sysv,mmap - or a comma delimited combination >>>of them (order dependent). The first component that >>>is successfully selected is used. >>> >>> Thanks! >>> >>> -- >>> Samuel K. Gutierrez >>> Los Alamos National Laboratory >>> >>> >>> >>> >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Wrong documentation for MPI_Comm_size manual page
Absolutely correct. I've fixed it on the dev trunk and filed tickets to get the fix moved into the release branches. Thanks! On Jun 2, 2010, at 4:41 AM, Number Cruncher wrote: > I'm working on some intercommunicator stuff at the moment. According to > MPI-2.2 standard: > "An inter-communication is a point-to-point communication between > processes in different groups" [Section 6.6] > > yet the "man" page for MPI_Comm_size reads: > "If the communicator is an **intra-communicator** (enables > communication between two groups), this function returns the size of > the local group" > > Shouldn't that be **inter-communicator**? > > Thanks, > Simon > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] BTL add procs errors
On Jun 2, 2010, at 5:08 AM, Sylvain Jeaugey wrote: > It must be because create_cq actually creates cqs. Try to apply this > patch which makes create_cq_compat() *not* creates the cqs and return an > error instead : > > diff -r 13df81d1d862 ompi/mca/btl/openib/btl_openib.c > --- a/ompi/mca/btl/openib/btl_openib.c Fri May 28 14:50:25 2010 +0200 > +++ b/ompi/mca/btl/openib/btl_openib.c Wed Jun 02 10:56:57 2010 +0200 > @@ -146,6 +146,7 @@ > int cqe, void *cq_context, struct ibv_comp_channel *channel, > int comp_vector) > { > +return OMPI_ERROR; > #if OMPI_IBV_CREATE_CQ_ARGS == 3 > return ibv_create_cq(context, cqe, channel); > #else > Don't you mean return NULL? This function is supposed to return a (struct ibv_cq *). > You should see MPI_Init complete nicely and your application segfault on > the next MPI operation. That wouldn't surprise me if you return OMPI_ERROR here, since it's expecting a pointer return value (OMPI_ERROR != NULL, so the error check from ibv_create_cq_compat() won't detect the problem properly). Sidenote: why did we call it ibv_create_cq_compat()? That seems like a namespace violation, and is quite confusing. :-\ -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] RFC: System V Shared Memory for Open MPI
On Jun 2, 2010, at 5:38 AM, George Bosilca wrote: > I think adding support for sysv shared memory is a good thing. However, I > have some strong objections over the implementation in the hg tree. Here are > 2 of the major ones: > > 1) the sysv shared memory creation is __atomic__ based on the flags used. > Therefore, all the RML messages exchange is totally useless. Not sure what you mean here. common/sm may create new shmem segments at any time (e.g., during coll sm). The RML message exchange is to ensure that only 1 process creates and initializes the segment and then all the others just attach to it. The initializing of the segment after it is created/attached could be pipelined a little more. E.g, since the init has an atomicly-set flag indicating when it's done, the root could create the seg, signal the others that they can attach, and then do the init -- the non-root procs can wait for flag to change atomicly to know when the seg has been initialized). Is that what you're referring to? > 2) the whole code is replicated in the 3 files (mmap, sysv and windows), even > the common parts. However in the sysv case most of the comments have been > modified to remove all capitals letter. I'm in favor of extracting all the > common parts and moving them in a special file. What should be kept in the > particular files should only be the really different parts (small part of the > init and finalize). Sam -- are the common parts really common? I.e., could they be factored out? Or are they "just different enough" that factoring them out would be a PITA? -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] RFC: move hwloc code base to opal/hwloc
To follow up on this RFC... We discussed this RFC on the weekly call and no one seemed to hate it. But there was a desire to: a) be able to compile out hwloc for environments that don't want/need it (e.g., embedded environments) b) have some degree of isolation in case hwloc ever dies c) have some comonality of hwloc support (e.g., a central copy of the topology as an OPAL global variable, etc.) The agreed-on compromise was to have a small set of OPAL wrappers that hide the real hwloc API. I.e., the OPAL/ORTE/OMPI code bases would use the OPAL wrappers, not hwloc itself. This allows OMPI to cleanly compile out hwloc (e.g., return OPAL_ERR_NOT_AVAILABLE when hwloc is compiled out) for platforms that do not want hwloc support and hwloc-unsupported platforms. The ball is in my court to come up with a decent OPAL subset of the hwloc API that makes sense. On the one hand, the hwloc API is huge because it has many, many accessors for all different kinds of access patterns. But OTOH, we probably don't need all those accessors, even if having a smaller set of accessors may mean slightly less convenient/efficient access to the hwloc data. I'll try to strike a balance and come back to the community with a proposal. On May 13, 2010, at 8:35 PM, Jeff Squyres wrote: > WHAT: hwloc is currently embedded in opal/mca/paffinity/hwloc/hwloc -- move > it to be a first class citizen in opal/hwloc. > > WHY: Let other portions of the OPAL, ORTE, and OMPI code bases use hwloc > services (remember that hwloc provides detailed topology information, not > just processor binding). > > WHERE: Move opal/mca/paffinity/hwloc/hwloc to opal/hwloc, and adjust > associated configury > > WHEN: For v1.5.1 > > TIMEOUT: Tuesday call, May 25 > > - > > MORE DETAILS: > > The hwloc code base is *much* more powerful and useful than PLPA -- it > provides a wealth of information that PLPA did not. Specifically: hwloc > provides data structures detailing the internal topology of a server. You > can see cache line sizes, NUMA layouts, sockets, cores, hardware threads, > ...etc. > > This information should be available to the entire OMPI code base -- not just > locked up in a paffinity component. Putting hwloc up in opal/hwloc makes it > available everywhere. Developers can just call hwloc_, and OMPI's build > system will automatically do all the right symbol-shifting if the embedded > hwloc is used in OMPI (and not symbol-shift if an external hwloc is used, > obviously). It's magically delicious! > > One immediate use that I'd like to see is to have the openib BTL use hwloc's > ibv functionality to find "nearby" HCAs (right now, you can only do this with > rankfiles). > > I can foresee other components using cache line size information to help tune > performance (e.g., sm btl and sm coll...?). > > To be clear: there will still be an hwloc paffinity component. It just won't > embed its own copy of hwloc anymore. It'll use the hwloc services provided > by the OMPI build system, just like the rest of the OPAL / ORTE / OMPI code > bases. > > There will also be an option to compile hwloc out altogether -- some stubs > will be left that return ERR_NOT_SUPPORTED, or somesuch (details TBD). The > reason for this is that there are some systems where processor affinity and > NUMA information aren't relevant (e.g., embedded systems). Memory footprint > is key in such systems; hwloc would simply take up valuable RAM. > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] RFC: Remove all other paffinity components
To follow up on this RFC... This RFC also got discussed on the weekly call (and in several other discussions). Again, no one seemed to hate it. That being said, hwloc still needs a bit more soak time; I just committed the 32 bit fix the other day. So this one will happen eventually (i.e., #1, below -- #2 is the other RFC). It'll probably be off in an hg branch at first, and then I'll bring the results to the community before bringing it back into the trunk. On May 18, 2010, at 8:50 AM, Jeff Squyres wrote: > On May 18, 2010, at 8:31 AM, Terry Dontje wrote: > >> The above sounds like you are replacing the whole paffinity framework with >> hwloc. Is that true? Or is the hwloc accessors you are talking about >> non-paffinity related? > > Good point; these have all gotten muddled in the email chain. Let me > re-state everything in one place in an attempt to be clear: > > 1. Split paffinity into two frameworks (because some OS's support one and not > the other): > - binding: just for getting and setting processor affinity > - hwmap: just for mapping (board, socket, core, hwthread) <--> OS processor > ID > --> Note that hwmap will be an expansion of the current paffinity > capabilities > > 2. Add hwloc to opal > - Commit the hwloc tree to opal/util/hwloc (or somesuch) > - Have the ability to configure hwloc out (e.g., for embedded environments) > - Add a dozen or two hwloc wrappers in opal/util/hwloc.c|h > - The rest of the OPAL/ORTE/OMPI trees *only call these wrapper functions* > -- they do not call hwloc directly > - These wrappers will call the back-end hwloc functions or return > OPAL_ERR_NOT_SUPPORTED (or somesuch) if hwloc is not available > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] RFC: System V Shared Memory for Open MPI
On Jun 2, 2010, at 7:28 AM, Jeff Squyres wrote: On Jun 2, 2010, at 5:38 AM, George Bosilca wrote: I think adding support for sysv shared memory is a good thing. However, I have some strong objections over the implementation in the hg tree. Here are 2 of the major ones: 1) the sysv shared memory creation is __atomic__ based on the flags used. Therefore, all the RML messages exchange is totally useless. Not sure what you mean here. common/sm may create new shmem segments at any time (e.g., during coll sm). The RML message exchange is to ensure that only 1 process creates and initializes the segment and then all the others just attach to it. The initializing of the segment after it is created/attached could be pipelined a little more. E.g, since the init has an atomicly-set flag indicating when it's done, the root could create the seg, signal the others that they can attach, and then do the init -- the non-root procs can wait for flag to change atomicly to know when the seg has been initialized). Is that what you're referring to? 2) the whole code is replicated in the 3 files (mmap, sysv and windows), even the common parts. However in the sysv case most of the comments have been modified to remove all capitals letter. I'm in favor of extracting all the common parts and moving them in a special file. What should be kept in the particular files should only be the really different parts (small part of the init and finalize). Sam -- are the common parts really common? I.e., could they be factored out? Or are they "just different enough" that factoring them out would be a PITA? I'm sure some refactoring could be done - let me take a look. -- Samuel K. Gutierrez Los Alamos National Laboratory -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] RFC: System V Shared Memory for Open MPI
On Jun 2, 2010, at 09:28 , Jeff Squyres wrote: > On Jun 2, 2010, at 5:38 AM, George Bosilca wrote: > >> I think adding support for sysv shared memory is a good thing. However, I >> have some strong objections over the implementation in the hg tree. Here are >> 2 of the major ones: >> >> 1) the sysv shared memory creation is __atomic__ based on the flags used. >> Therefore, all the RML messages exchange is totally useless. > > Not sure what you mean here. common/sm may create new shmem segments at any > time (e.g., during coll sm). The RML message exchange is to ensure that only > 1 process creates and initializes the segment and then all the others just > attach to it. Absolutely not! The RML messaging is not about initializing the shared memory segment. As stated on my original text it has only one purpose: to ensure the file used by mmap is created atomically. The code for Windows do not exchange any RML messages as the function to allocate the shared memory region provided by the OS is atomic (exactly as the sysv one). > The initializing of the segment after it is created/attached could be > pipelined a little more. E.g, since the init has an atomicly-set flag > indicating when it's done, the root could create the seg, signal the others > that they can attach, and then do the init -- the non-root procs can wait for > flag to change atomicly to know when the seg has been initialized). Is that > what you're referring to? This is actually how the whole stuff is working today. As an example look at the sm BTL in file btl_sm.c line 541. george. > >> 2) the whole code is replicated in the 3 files (mmap, sysv and windows), >> even the common parts. However in the sysv case most of the comments have >> been modified to remove all capitals letter. I'm in favor of extracting all >> the common parts and moving them in a special file. What should be kept in >> the particular files should only be the really different parts (small part >> of the init and finalize). > > Sam -- are the common parts really common? I.e., could they be factored out? > Or are they "just different enough" that factoring them out would be a PITA? > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] RFC: System V Shared Memory for Open MPI
On Jun 2, 2010, at 10:44 AM, George Bosilca wrote: > > Not sure what you mean here. common/sm may create new shmem segments at > > any time (e.g., during coll sm). The RML message exchange is to ensure > > that only 1 process creates and initializes the segment and then all the > > others just attach to it. > > Absolutely not! The RML messaging is not about initializing the shared memory > segment. As stated on my original text it has only one purpose: to ensure the > file used by mmap is created atomically. The code for Windows do not exchange > any RML messages as the function to allocate the shared memory region > provided by the OS is atomic (exactly as the sysv one). I thought that Sam said that it was important that only 1 process shmctl/IPC_RMID...? -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] BTL add procs errors
On Wed, 2 Jun 2010, Jeff Squyres wrote: Don't you mean return NULL? This function is supposed to return a (struct ibv_cq *). Oops. My bad. Yes, it should return NULL. And it seems that if I make ibv_create_cq always return NULL, the scenario described by George works smoothly : returned OMPI_ERROR => bitmask cleared => connectivity problem => stop or tcp fallback. The problem is more complicated than I thought. But it made me progress on why I'm crashing : in my case, only a subset of processes have their create_cq fail. But others work fine, hence they request a qp creation, and my process which failed over on tcp starts creating a qp ... and crashes. If you replace : return NULL; by : if (atoi(getenv("OMPI_COMM_WORLD_RANK")) == 26) return NULL; (yes, that's ugly, but it's just to debug the problem) and run on -say- 32 processes, you should be able to reproduce the bug. Well, unless I'm mistaken again. The crash stack should look like this : #0 0x003d0d605a30 in ibv_cmd_create_qp () from /usr/lib64/libibverbs.so.1 #1 0x7f28b44e049b in ibv_cmd_create_qp () from /usr/lib64/libmlx4-rdmav2.so #2 0x003d0d609a42 in ibv_create_qp () from /usr/lib64/libibverbs.so.1 #3 0x7f28b6be6e6e in qp_create_one () from /home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/openmpi/mca_btl_openib.so #4 0x7f28b6be78a4 in oob_module_start_connect () from /home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/openmpi/mca_btl_openib.so #5 0x7f28b6be7fbb in rml_recv_cb () from /home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/openmpi/mca_btl_openib.so #6 0x7f28b8c56868 in orte_rml_recv_msg_callback () from /home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/openmpi/mca_rml_oob.so #7 0x7f28b8a4cf96 in mca_oob_tcp_msg_recv_complete () from /home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/openmpi/mca_oob_tcp.so #8 0x7f28b8a4e2c2 in mca_oob_tcp_peer_recv_handler () from /home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/openmpi/mca_oob_tcp.so #9 0x7f28b9496898 in opal_event_base_loop () from /home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/libopen-pal.so.0 #10 0x7f28b948ace9 in opal_progress () from /home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/libopen-pal.so.0 #11 0x7f28b9951ed5 in ompi_request_default_wait_all () from /home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/libmpi.so.0 This new advance may change everything. Of course, stopping at the bml level still "solves" the problem, but maybe we can fix this more properly within the openib BTL. Unless this is a general out-of-band-connection-protocol issue (). Sylvain
Re: [OMPI devel] BTL add procs errors
On Jun 2, 2010, at 11:29 AM, Sylvain Jeaugey wrote: > But it made me progress on why I'm crashing : in my case, only a subset of > processes have their create_cq fail. Ah, this is the key. If I have one process (out of many) fail the create_cq() function, I get a segv during finalize. I'll dig. > This new advance may change everything. Of course, stopping at the bml > level still "solves" the problem, but maybe we can fix this more properly > within the openib BTL. Unless this is a general > out-of-band-connection-protocol issue (). I don't think this is an OOB CPC issue. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] RFC: System V Shared Memory for Open MPI
On Jun 2, 2010, at 8:49 AM, Jeff Squyres wrote: On Jun 2, 2010, at 10:44 AM, George Bosilca wrote: Not sure what you mean here. common/sm may create new shmem segments at any time (e.g., during coll sm). The RML message exchange is to ensure that only 1 process creates and initializes the segment and then all the others just attach to it. Absolutely not! The RML messaging is not about initializing the shared memory segment. As stated on my original text it has only one purpose: to ensure the file used by mmap is created atomically. The code for Windows do not exchange any RML messages as the function to allocate the shared memory region provided by the OS is atomic (exactly as the sysv one). I thought that Sam said that it was important that only 1 process shmctl/IPC_RMID...? Hi George, We are using RML messaging in the sysv code to exchange the shared memory ID (generated by exactly one process). I'm not sure how we would go about passing along the shared memory ID without RML, but any ideas are greatly appreciated. Thanks, -- Samuel K. Gutierrez Los Alamos National Laboratory -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] BTL add procs errors
On 2 Jun 2010, at 16:49, Jeff Squyres wrote: > On Jun 2, 2010, at 11:29 AM, Sylvain Jeaugey wrote: > >> But it made me progress on why I'm crashing : in my case, only a subset of >> processes have their create_cq fail. > > Ah, this is the key. If I have one process (out of many) fail the > create_cq() function, I get a segv during finalize. I'll dig. Is there an assumption that if process A claims to be able to communicate with process B that process B can also communicate with process A. It almost sounds like the code needs to do a allreduce on the bitmask returned by the btls. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [OMPI devel] BTL add procs errors
On Jun 2, 2010, at 12:02 PM, Ashley Pittman wrote: > > Ah, this is the key. If I have one process (out of many) fail the > > create_cq() function, I get a segv during finalize. I'll dig. > > Is there an assumption that if process A claims to be able to communicate > with process B that process B can also communicate with process A. It almost > sounds like the code needs to do a allreduce on the bitmask returned by the > btls. Actually, this is exactly the case (I just dug into the code and verified this). In this case, we're already well beyond the point where we synchronized and decided who can connect to whom. I.e., the modex is already done -- the openib BTL in process X has decided that it is available and has advertised its RDMACM CPC and OOB CPC contact info. But then later in process X during the openib BTL add_procs, something fails. So the openib clears the connect bits and transparently fails over to TCP. No problem. The problem is the other peers who think that they can still connect to process X via the openib BTL. 1. In this case, the openib BTL was not finalized, so there was a stub still there listening on the RDMACM CPC. When another process tried to connect to X's RDMACM CPC port, Bad Things happened (because it was only half setup) and we segv'ed. Obviously, this should be fixed. "Fixed" in this case probably means closing down the RDMACM CPC listening port. But then that leads to another form of Badness. 2. If the openib BTL cleanly shuts down and is *not* still listening on its modex-advertised RDMACM CPC contact port, then if some other process tries to contact process X via the modex info, it'll fail. This will then be judged to be a fatal error. Failover in the BML will simply have delayed the job abort until someone tries to contact X via the openib BTL. I think that the majority of this discussion about the BML failure (or not) behavior assumed that *all* processes had the same failure (at least: *I* assumed this). But if only *some* of the processes fail a given BTL add_procs, we have a problem because we're beyond the point of deciding who can connect to whom. Shutting down a single BTL module at that point will create an inconsistency of the distributed data. That seems wrong. What to do? -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] BTL add procs errors
On Jun 2, 2010, at 12:18 , Jeff Squyres wrote: > On Jun 2, 2010, at 12:02 PM, Ashley Pittman wrote: > >>> Ah, this is the key. If I have one process (out of many) fail the >>> create_cq() function, I get a segv during finalize. I'll dig. >> >> Is there an assumption that if process A claims to be able to communicate >> with process B that process B can also communicate with process A. It >> almost sounds like the code needs to do a allreduce on the bitmask returned >> by the btls. > > Actually, this is exactly the case (I just dug into the code and verified > this). > > In this case, we're already well beyond the point where we synchronized and > decided who can connect to whom. I.e., the modex is already done -- the > openib BTL in process X has decided that it is available and has advertised > its RDMACM CPC and OOB CPC contact info. > > But then later in process X during the openib BTL add_procs, something fails. > So the openib clears the connect bits and transparently fails over to TCP. > No problem. > > The problem is the other peers who think that they can still connect to > process X via the openib BTL. > > 1. In this case, the openib BTL was not finalized, so there was a stub still > there listening on the RDMACM CPC. When another process tried to connect to > X's RDMACM CPC port, Bad Things happened (because it was only half setup) and > we segv'ed. > > Obviously, this should be fixed. "Fixed" in this case probably means closing > down the RDMACM CPC listening port. But then that leads to another form of > Badness. I wonder how this is possible. If a process X fails to connect to Y, how can Y succeed to connect to X ? Please enlighten me ... > > 2. If the openib BTL cleanly shuts down and is *not* still listening on its > modex-advertised RDMACM CPC contact port, then if some other process tries to > contact process X via the modex info, it'll fail. This will then be judged > to be a fatal error. Failover in the BML will simply have delayed the job > abort until someone tries to contact X via the openib BTL. Isn't there any kind of timeout mechanism in the RDMACM CPC? If there is one and the connection fails, then the PML will automatically try to use the next available BTL, so it will eventually fail over TCP (if available). > > I think that the majority of this discussion about the BML failure (or not) > behavior assumed that *all* processes had the same failure (at least: *I* > assumed this). But if only *some* of the processes fail a given BTL > add_procs, we have a problem because we're beyond the point of deciding who > can connect to whom. Shutting down a single BTL module at that point will > create an inconsistency of the distributed data. We did assume that at least the errors are symmetric, i.e. if A fails to connect to B then B will fail when trying to connect to A. However, if there are other BTL the connection is supposed to smoothly move over some other BTL. As an example in the MX BTL, if two nodes have MX support, but they do not share the same mapper the add_procs will silently fails. george. > > That seems wrong. > > What to do? > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] BTL add procs errors
George Bosilca wrote: We did assume that at least the errors are symmetric, i.e. if A fails to connect to B then B will fail when trying to connect to A. I've not been following this thread closely, but thought I'd add a comment. It used to be that the sm BTL could fail asymmetrically. A shared memory could be allocated and processes start to allocate resources within shared memory. At some point, the shared area would be exhausted. So, some processes were set up to communicate to others, but the others would not be able to communicate back via the same BTL. I think this led to much brokenness. (E.g., how would a process return a sm fragment to a sender?) At this point, my recollection of those issues is very fuzzy. In any case, I think those issues went away with the shared-memory work I did a while back. The size of the area is now computed to be large enough that each process's initial allocation would succeed.
Re: [OMPI devel] RFC: System V Shared Memory for Open MPI
How about ftok ? The init function takes a file_name as argument, and this file name is unique per instance of the shared memory region we want to create. We can use this file name with ftok to create a unique key_t that can be used by shmget to retrieve the shared memory identifier. george. On Jun 2, 2010, at 11:53 , Samuel K. Gutierrez wrote: > On Jun 2, 2010, at 8:49 AM, Jeff Squyres wrote: > >> On Jun 2, 2010, at 10:44 AM, George Bosilca wrote: >> Not sure what you mean here. common/sm may create new shmem segments at any time (e.g., during coll sm). The RML message exchange is to ensure that only 1 process creates and initializes the segment and then all the others just attach to it. >>> >>> Absolutely not! The RML messaging is not about initializing the shared >>> memory segment. As stated on my original text it has only one purpose: to >>> ensure the file used by mmap is created atomically. The code for Windows do >>> not exchange any RML messages as the function to allocate the shared memory >>> region provided by the OS is atomic (exactly as the sysv one). >> >> I thought that Sam said that it was important that only 1 process >> shmctl/IPC_RMID...? > > Hi George, > > We are using RML messaging in the sysv code to exchange the shared memory ID > (generated by exactly one process). I'm not sure how we would go about > passing along the shared memory ID without RML, but any ideas are greatly > appreciated. > > Thanks, > -- > Samuel K. Gutierrez > Los Alamos National Laboratory > >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] RFC: System V Shared Memory for Open MPI
Hi George, That may work - I'll try it. Thanks! -- Samuel K. Gutierrez Los Alamos National Laboratory On Jun 2, 2010, at 10:59 AM, George Bosilca wrote: How about ftok ? The init function takes a file_name as argument, and this file name is unique per instance of the shared memory region we want to create. We can use this file name with ftok to create a unique key_t that can be used by shmget to retrieve the shared memory identifier. george. On Jun 2, 2010, at 11:53 , Samuel K. Gutierrez wrote: On Jun 2, 2010, at 8:49 AM, Jeff Squyres wrote: On Jun 2, 2010, at 10:44 AM, George Bosilca wrote: Not sure what you mean here. common/sm may create new shmem segments at any time (e.g., during coll sm). The RML message exchange is to ensure that only 1 process creates and initializes the segment and then all the others just attach to it. Absolutely not! The RML messaging is not about initializing the shared memory segment. As stated on my original text it has only one purpose: to ensure the file used by mmap is created atomically. The code for Windows do not exchange any RML messages as the function to allocate the shared memory region provided by the OS is atomic (exactly as the sysv one). I thought that Sam said that it was important that only 1 process shmctl/IPC_RMID...? Hi George, We are using RML messaging in the sysv code to exchange the shared memory ID (generated by exactly one process). I'm not sure how we would go about passing along the shared memory ID without RML, but any ideas are greatly appreciated. Thanks, -- Samuel K. Gutierrez Los Alamos National Laboratory -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] BTL add procs errors
On Jun 2, 2010, at 12:42 PM, George Bosilca wrote: > > 1. In this case, the openib BTL was not finalized, so there was a stub > > still there listening on the RDMACM CPC. When another process tried to > > connect to X's RDMACM CPC port, Bad Things happened (because it was only > > half setup) and we segv'ed. > > > > Obviously, this should be fixed. "Fixed" in this case probably means > > closing down the RDMACM CPC listening port. But then that leads to another > > form of Badness. > > I wonder how this is possible. If a process X fails to connect to Y, how can > Y succeed to connect to X ? Please enlighten me ... It doesn't. Process X segvs after it goes into the RDMACM CPC accept code (because the openib BTL was only half setup). > > 2. If the openib BTL cleanly shuts down and is *not* still listening on its > > modex-advertised RDMACM CPC contact port, then if some other process tries > > to contact process X via the modex info, it'll fail. This will then be > > judged to be a fatal error. Failover in the BML will simply have delayed > > the job abort until someone tries to contact X via the openib BTL. > > Isn't there any kind of timeout mechanism in the RDMACM CPC? If there is one > and the connection fails, then the PML will automatically try to use the next > available BTL, so it will eventually fail over TCP (if available). Yes, there is a timeout. I forget offhand what we do if the timeout occurs. We probably report the connect failure in the "normal" way, but I don't know that for sure. > > I think that the majority of this discussion about the BML failure (or not) > > behavior assumed that *all* processes had the same failure (at least: *I* > > assumed this). But if only *some* of the processes fail a given BTL > > add_procs, we have a problem because we're beyond the point of deciding who > > can connect to whom. Shutting down a single BTL module at that point will > > create an inconsistency of the distributed data. > > We did assume that at least the errors are symmetric, i.e. if A fails to > connect to B then B will fail when trying to connect to A. However, if there > are other BTL the connection is supposed to smoothly move over some other > BTL. As an example in the MX BTL, if two nodes have MX support, but they do > not share the same mapper the add_procs will silently fails. This sounds dodgy and icky. We have to wait for a connect timeout to fail over to the next BTL? How long is the typical/default TCP timeout? -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] BTL add procs errors
Yes, I think the mmap code in the sm btl actually has a sync point inside add_procs that when the root allocs and sets up the area, it'll locally broadcast a "yes, we're good -- mmap attach and let's continue" or "bad things happened; sm btl is broke" message. But I am not confident about the other BTLs. On Jun 2, 2010, at 12:51 PM, Eugene Loh wrote: > George Bosilca wrote: > > > We did assume that at least the errors are symmetric, i.e. if A fails > > to connect to B then B will fail when trying to connect to A. > > I've not been following this thread closely, but thought I'd add a comment. > > It used to be that the sm BTL could fail asymmetrically. A shared > memory could be allocated and processes start to allocate resources > within shared memory. At some point, the shared area would be > exhausted. So, some processes were set up to communicate to others, but > the others would not be able to communicate back via the same BTL. I > think this led to much brokenness. (E.g., how would a process return a > sm fragment to a sender?) > > At this point, my recollection of those issues is very fuzzy. > > In any case, I think those issues went away with the shared-memory work > I did a while back. The size of the area is now computed to be large > enough that each process's initial allocation would succeed. > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] RFC: System V Shared Memory for Open MPI
Don't forget that the RML is also used to broadcast the success/failure of the creation of the shared memory segment. If the RML goes away, be sure that you can still determine that without hanging. Personally, I still don't see the problem with using the RML stuff... On Jun 2, 2010, at 1:07 PM, Samuel K. Gutierrez wrote: > Hi George, > > That may work - I'll try it. > > Thanks! > > -- > Samuel K. Gutierrez > Los Alamos National Laboratory > > On Jun 2, 2010, at 10:59 AM, George Bosilca wrote: > > > How about ftok ? The init function takes a file_name as argument, > > and this file name is unique per instance of the shared memory > > region we want to create. We can use this file name with ftok to > > create a unique key_t that can be used by shmget to retrieve the > > shared memory identifier. > > > > george. > > > > On Jun 2, 2010, at 11:53 , Samuel K. Gutierrez wrote: > > > >> On Jun 2, 2010, at 8:49 AM, Jeff Squyres wrote: > >> > >>> On Jun 2, 2010, at 10:44 AM, George Bosilca wrote: > >>> > > Not sure what you mean here. common/sm may create new shmem > > segments at any time (e.g., during coll sm). The RML message > > exchange is to ensure that only 1 process creates and > > initializes the segment and then all the others just attach to it. > > Absolutely not! The RML messaging is not about initializing the > shared memory segment. As stated on my original text it has only > one purpose: to ensure the file used by mmap is created > atomically. The code for Windows do not exchange any RML messages > as the function to allocate the shared memory region provided by > the OS is atomic (exactly as the sysv one). > >>> > >>> I thought that Sam said that it was important that only 1 process > >>> shmctl/IPC_RMID...? > >> > >> Hi George, > >> > >> We are using RML messaging in the sysv code to exchange the shared > >> memory ID (generated by exactly one process). I'm not sure how we > >> would go about passing along the shared memory ID without RML, but > >> any ideas are greatly appreciated. > >> > >> Thanks, > >> -- > >> Samuel K. Gutierrez > >> Los Alamos National Laboratory > >> > >>> > >>> -- > >>> Jeff Squyres > >>> jsquy...@cisco.com > >>> For corporate legal information go to: > >>> http://www.cisco.com/web/about/doing_business/legal/cri/ > >>> > >>> > >>> ___ > >>> devel mailing list > >>> de...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> > >> ___ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] RFC: System V Shared Memory for Open MPI
Good point - I forgot about that. -- Samuel K. Gutierrez Los Alamos National Laboratory On Jun 2, 2010, at 11:40 AM, Jeff Squyres wrote: Don't forget that the RML is also used to broadcast the success/ failure of the creation of the shared memory segment. If the RML goes away, be sure that you can still determine that without hanging. Personally, I still don't see the problem with using the RML stuff... On Jun 2, 2010, at 1:07 PM, Samuel K. Gutierrez wrote: Hi George, That may work - I'll try it. Thanks! -- Samuel K. Gutierrez Los Alamos National Laboratory On Jun 2, 2010, at 10:59 AM, George Bosilca wrote: How about ftok ? The init function takes a file_name as argument, and this file name is unique per instance of the shared memory region we want to create. We can use this file name with ftok to create a unique key_t that can be used by shmget to retrieve the shared memory identifier. george. On Jun 2, 2010, at 11:53 , Samuel K. Gutierrez wrote: On Jun 2, 2010, at 8:49 AM, Jeff Squyres wrote: On Jun 2, 2010, at 10:44 AM, George Bosilca wrote: Not sure what you mean here. common/sm may create new shmem segments at any time (e.g., during coll sm). The RML message exchange is to ensure that only 1 process creates and initializes the segment and then all the others just attach to it. Absolutely not! The RML messaging is not about initializing the shared memory segment. As stated on my original text it has only one purpose: to ensure the file used by mmap is created atomically. The code for Windows do not exchange any RML messages as the function to allocate the shared memory region provided by the OS is atomic (exactly as the sysv one). I thought that Sam said that it was important that only 1 process shmctl/IPC_RMID...? Hi George, We are using RML messaging in the sysv code to exchange the shared memory ID (generated by exactly one process). I'm not sure how we would go about passing along the shared memory ID without RML, but any ideas are greatly appreciated. Thanks, -- Samuel K. Gutierrez Los Alamos National Laboratory -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] BTL add procs errors
Jeff Squyres wrote: Yes, I think the mmap code in the sm btl actually has a sync point inside add_procs that when the root allocs and sets up the area, it'll locally broadcast a "yes, we're good -- mmap attach and let's continue" or "bad things happened; sm btl is broke" message. Yes, that's great. But my point was that (it used to be that) after that point, processes would start eating chunks out of that shared area and for large proc counts the last allocations would fail. (The size of the shared area was poorly chosen and happened to be insufficient.) So, despite the sync point you describe, some procs would succeed at mca_btl_sm_add_procs() while others would not. This particular case is now, I believe, resolved. It just seemed at the time like a case where the upper layers were making assumptions that were inconsistent with what the sm BTL was providing. But I am not confident about the other BTLs. On Jun 2, 2010, at 12:51 PM, Eugene Loh wrote: George Bosilca wrote: We did assume that at least the errors are symmetric, i.e. if A fails to connect to B then B will fail when trying to connect to A. I've not been following this thread closely, but thought I'd add a comment. It used to be that the sm BTL could fail asymmetrically. A shared memory could be allocated and processes start to allocate resources within shared memory. At some point, the shared area would be exhausted. So, some processes were set up to communicate to others, but the others would not be able to communicate back via the same BTL. I think this led to much brokenness. (E.g., how would a process return a sm fragment to a sender?) At this point, my recollection of those issues is very fuzzy. In any case, I think those issues went away with the shared-memory work I did a while back. The size of the area is now computed to be large enough that each process's initial allocation would succeed.
[OMPI devel] RFC: openib BTL failover
WHAT: New PML called "bfo" (Btl Fail Over) that supports failover between two or more openib BTLs. New configurable code in openib BTL that works with the bfo to do failover. Note this only works when we have two or more openib BTLs. This does not failover to another BTL, like tcp. TO CONFIGURE: --enable-openib-failover TO RUN: --mca pml bfo TIMEOUT: June 16, 2010 ADDITIONAL DETAILS: The design relies on the BTL to call back into the PML with each fragment that fails so the PML can decide what needs to be done. There is no additional message tracking or software acknowledges added so that we can have minimal impact on latency. Testing has shown no measurable affect. When errors are detected on the BTL, it is no longer used. No effort is made to bring it back if the problems get corrected. If it gets fixed before the next job starts, then it will be used by the next job. Under normal conditions, these changes have no effect whatsover on the trunk as the bfo PML is never selected, and the failover support is not configured into the openib BTL. Every effort was made to minimize the changes in the openib BTL. The main changes are contained in two new files that only get compiled when the -enable-openib-failover flag is set. The other changes consist of about 75 new lines in various openib BTL files. The bitbucket version is at: http://bitbucket.org/rolfv/rfc-failover Here are the files that would be added/changed. BTL LAYER M ompi/mca/btl/btl.h M ompi/mca/btl/base/btl_base_mca.c M ompi/mca/btl/openib/btl_openib_component.c M ompi/mca/btl/openib/btl_openib.c M ompi/mca/btl/openib/btl_openib.h M ompi/mca/btl/openib/btl_openib_endpoint.h M ompi/mca/btl/openib/btl_openib_mca.c A ompi/mca/btl/openib/btl_openib_failover.c A ompi/mca/btl/openib/btl_openib_failover.h M ompi/mca/btl/openib/btl_openib_frag.h M ompi/mca/btl/openib/Makefile.am M ompi/config/ompi_check_openib.m4 PML LAYER A ompi/mca/pml/bfo A ompi/mca/pml/bfo/pml_bfo_comm.h A ompi/mca/pml/bfo/pml_bfo_sendreq.c A ompi/mca/pml/bfo/pml_bfo_isend.c A ompi/mca/pml/bfo/pml_bfo_component.c A ompi/mca/pml/bfo/Makefile.in A ompi/mca/pml/bfo/help-mpi-pml-bfo.txt A ompi/mca/pml/bfo/pml_bfo_recvfrag.h A ompi/mca/pml/bfo/pml_bfo_progress.c A ompi/mca/pml/bfo/pml_bfo_sendreq.h A ompi/mca/pml/bfo/pml_bfo_component.h A ompi/mca/pml/bfo/pml_bfo_failover.c A ompi/mca/pml/bfo/pml_bfo_recvreq.c A ompi/mca/pml/bfo/pml_bfo_irecv.c A ompi/mca/pml/bfo/pml_bfo_failover.h A ompi/mca/pml/bfo/pml_bfo_recvreq.h A ompi/mca/pml/bfo/pml_bfo_iprobe.c A ompi/mca/pml/bfo/pml_bfo.c A ompi/mca/pml/bfo/post_configure.sh A ompi/mca/pml/bfo/pml_bfo_hdr.h A ompi/mca/pml/bfo/pml_bfo_rdmafrag.c A ompi/mca/pml/bfo/pml_bfo_rdma.c A ompi/mca/pml/bfo/configure.params A ompi/mca/pml/bfo/pml_bfo.h A ompi/mca/pml/bfo/pml_bfo_rdmafrag.h A ompi/mca/pml/bfo/pml_bfo_rdma.h A ompi/mca/pml/bfo/.windows A ompi/mca/pml/bfo/Makefile.am A ompi/mca/pml/bfo/pml_bfo_comm.c A ompi/mca/pml/bfo/pml_bfo_start.c A ompi/mca/pml/bfo/pml_bfo_recvfrag.c