Re: [OMPI devel] 1.3 release date?
Brad, Many thanks for the update. Greg On Oct 22, 2008, at 8:43 PM, Brad Benton wrote: Greg, Here is the latest schedule that we have for getting 1.3 out the door: https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.3 Basically, this schedule sets Nov. 10 as the release date with a backup date of Nov. 17. Here is a bit more detail as to the release to beta and then to release candidate 1, prior to the general release (lifted from the wiki): 1.3 beta: Target: October 27, 2008 1.3 rc1: Target: November 3, 2008 1.3 release: Target: November 10, 2008 --Brad On Fri, Oct 17, 2008 at 5:38 AM, Jeff Squyres wrote: Greg -- I defer to George/Brad for plans of the specific release date. We hope to be feature complete by early next week. This clears the way for a "beta" release. Specifically, there's two things we're waiting for: 1. Some FT stuff that Tim/Josh think can be done by this weekend 2. A critical code review for a big openib BTL change that will be done when Pasha and I are at the Chicago Forum meeting on Monday On Oct 15, 2008, at 4:48 PM, Greg Watson wrote: Hi all, Has a release date been set for 1.3? Thanks, Greg ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] 1.3 release date?
Greg, Here is the latest schedule that we have for getting 1.3 out the door: https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.3 Basically, this schedule sets Nov. 10 as the release date with a backup date of Nov. 17. Here is a bit more detail as to the release to beta and then to release candidate 1, prior to the general release (lifted from the wiki): 1.3 beta: Target: October 27, 2008 1.3 rc1: Target: November 3, 2008 1.3 release: Target: November 10, 2008 --Brad On Fri, Oct 17, 2008 at 5:38 AM, Jeff Squyres wrote: > Greg -- I defer to George/Brad for plans of the specific release date. > > We hope to be feature complete by early next week. This clears the way for > a "beta" release. Specifically, there's two things we're waiting for: > > 1. Some FT stuff that Tim/Josh think can be done by this weekend > 2. A critical code review for a big openib BTL change that will be done > when Pasha and I are at the Chicago Forum meeting on Monday > > > > On Oct 15, 2008, at 4:48 PM, Greg Watson wrote: > > Hi all, >> >> Has a release date been set for 1.3? >> >> Thanks, >> >> Greg >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > -- > Jeff Squyres > Cisco Systems > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] Restarting processes on different node
Leonardo, As you say, there is the possiblity that moving from one node to another has caused problems due to different shared libraries. The result from this could be a segmentation fault, an illegal instruction or even a bus error. In all three cases, however, this failure generates a signal (SIGSEGV, SIGILL or SIGBUG). So, it is possible that you are seeing the failure mode that you were expecting. There are at least 2 ways you can deal with heterogenous libaries. The first is that if the libs are only different due to preloading, you can undo the preloading as described in the BLCR FAQ (http://mantis.lbl.gov/blcr/doc/html/FAQ.html#prelink) The second would be to include the shared libaries in the checpoint itself. While this is very costly in terms of storage, you may find it lets you restart in cases where you might not otherwise be able to. The trick is to add --save-private or --save-all to the checkpoint command that OpenMPI uses to checkpoint the application processes. -Paul Leonardo Fialho wrote: Hi All, I´m trying to implement my FT architecture in Open MPI. Just now I need to restart a faulty process from a checkpoint. I saw that Josh uses orte-restart which call opal-restart through an ordinary mpirun call. It´s now good for me because in this case the restarted process becomes in a new job. I need to restart the process checkpoint in the same job and in another node under an existing orted. The checkpoints are taken without the "--term" option. My modified orted receive a "restart request" from my modified heartbeat mechanism. I have tried to restart using the BLCR cr_restart command. It does not work, I think because the stderr/stdin/stdout was not handled by the opal environment. So, I tried to restart the checkpoint forking the orted and doing an execvp to the opal-restart. It recovers the checkpoint, but after the "opal_cr_init", it dies (*** Process received signal ***). As follows is the job structure (from ompi-ps) after a fault: Process Name |ORTE Name | Local Rank |PID | Node | State | HB Dest. | - orterun | [[8002,0],0] | 65535 | 30434 | aoclsb | Running | | orted | [[8002,0],1] | 65535 | 30435 | nodo1 | Running | [[8002,0],3] | orted | [[8002,0],2] | 65535 | 30438 | nodo2 | Faulty | [[8002,0],3] | orted | [[8002,0],3] | 65535 | 30441 | nodo3 | Running | [[8002,0],4] | orted | [[8002,0],4] | 65535 | 30444 | nodo4 | Running | [[8002,0],1] | Process Name |ORTE Name | Local Rank |PID | Node | State | Ckpt State | Ckpt Loc |Protector | -- ./ping/wait | [[8002,1],0] | 0 | 9069 | nodo1 | Running | Finished | /tmp/radic/0 | [[8002,0],2] | ./ping/wait | [[8002,1],1] | 0 | 6086 | nodo2 | Restoring | Finished | /tmp/radic/1 | [[8002,0],3] | ./ping/wait | [[8002,1],2] | 0 | 5864 | nodo3 | Running | Finished | /tmp/radic/2 | [[8002,0],4] | ./ping/wait | [[8002,1],3] | 0 | 7405 | nodo4 | Running | Finished | /tmp/radic/3 | [[8002,0],1] | The orted running on "nodo2" dies. It was detected by the orted [[8002,0],1] running on "nodo1" and informed to the HNP. The HNP update the procs structure and look for processes running on the faulty node, so it sends a restart request for the orted which holds the checkpoint of the faulty processes. Below is the log generated: [aoclsb:30434] [[8002,0],0] orted_recv: update state request from [[8002,0],3] [aoclsb:30434] [[8002,0],0] orted_update_state: updating state (17) for orted process (vpid=2) [aoclsb:30434] [[8002,0],0] orted_update_state: found process [[8002,1],1] on node nodo2, requesting recovery task for that [aoclsb:30434] [[8002,0],0] orted_update_state: sending restore ([[8002,1],1] process) request to [[8002,0],3] [nodo3:05841] [[8002,0],3] orted_recv: restore checkpoint request from [[8002,0],0] [nodo3:05841] [[8002,0],3] orted_restore_checkpoint: restarting process from checkpoint file (/tmp/radic/1/ompi_blcr_context.6086) [nodo3:05841] [[8002,0],3] orted_restore_checkpoint: executing restart (opal-restart -mca crs_base_snapshot_dir /tmp/radic/1 .) [nodo3:05924] opal_cr: init: Verbose Level: 1024 [nodo3:05924] opal_cr: init: FT Enabled: 1 [nodo3:05924] opal_cr: init: Is a tool program: 1 [nodo3:05924] opal_cr: init: Checkpoint Signal: 10 [nodo3:05924] opal_cr: init: Debug SIGPIPE: 0 (False) [nodo3:05924] opal_cr: init: Temp Directory: /tmp [nodo2:05965] *** Process received signal *** The orted which receives the restart request forks and the call an execvp for the opal-restart, and then, unfortunately, it dies. I know that the restarted process should generate errors because the URI of it daemon is in
Re: [OMPI devel] adding new functions to a BTL
Ralf Wildenhues wrote: Jeff Squyres wrote: We use lt_dlopen() to open the plugins (Libtool's wrapper for a portable dlopen). It opens all plugins (DSOs) in a private scope. That private scope is kept deep in the OPAL MCA base and not exposed elsewhere in the code base. So if you manually dlopen a plugin again, I'll bet that the linker realizes that that DSO has already been loaded into the process space and doesn't actually load it again (but doesn't fail). So the dlsyms fail because you don't have access to the private scope from where Libtool originally opened the DSO. Shouldn't it work to re-dlopen the lib with RTLD_GLOBAL? I used dlopen("...", RTLD_LAZY | RTLD_GLOBAL). It gave me a non-NULL handle, but subsequent dlsyms failed. Also, recent libltdl should allow you to choose which scope you want in the first place, local or global, through lt_dladvise. I'm just learning all this dl stuff right now. Jeff's --enable-static seems to do exactly what I want (namely, make things work in the way that I'm familiar with!). I did try to figure out what OMPI was doing and it seemed to me it was using RTLD_LAZY | RTLD_GLOBAL, which is why I chose that. For now, --enable-static seems to do exactly what I want. Further workarounds probably don't make any sense.
Re: [OMPI devel] adding new functions to a BTL
Hello Jeff, Eugene, > Jeff Squyres wrote: > >> We use lt_dlopen() to open the plugins (Libtool's wrapper for a >> portable dlopen). It opens all plugins (DSOs) in a private scope. >> That private scope is kept deep in the OPAL MCA base and not exposed >> elsewhere in the code base. So if you manually dlopen a plugin again, >> I'll bet that the linker realizes that that DSO has already been >> loaded into the process space and doesn't actually load it again (but >> doesn't fail). So the dlsyms fail because you don't have access to >> the private scope from where Libtool originally opened the DSO. Shouldn't it work to re-dlopen the lib with RTLD_GLOBAL? Also, recent libltdl should allow you to choose which scope you want in the first place, local or global, through lt_dladvise. Hope that helps. Cheers, Ralf
Re: [OMPI devel] adding new functions to a BTL
Jeff Squyres wrote: We use lt_dlopen() to open the plugins (Libtool's wrapper for a portable dlopen). It opens all plugins (DSOs) in a private scope. That private scope is kept deep in the OPAL MCA base and not exposed elsewhere in the code base. So if you manually dlopen a plugin again, I'll bet that the linker realizes that that DSO has already been loaded into the process space and doesn't actually load it again (but doesn't fail). So the dlsyms fail because you don't have access to the private scope from where Libtool originally opened the DSO. Make sense? Yes, I'm nodding my head vigorously (with a vacuous stare on my face). Mostly, I think you're very smart and I'm not! I get the general principles, but am unfamiliar with the details. Never mind: --enable-static is exactly the flavor of suggestion I was looking for. Thanks. I'm back in the saddle. Onward.
Re: [OMPI devel] Component open
Hmmm...interesting. I see what's going on - I'm having a build system issue that is causing some of the dynamic libraries to not be seen. Red herring - thanks for clarifying! Camille: thanks for fixing this way back when. Ralph On Oct 22, 2008, at 1:17 PM, George Bosilca wrote: Ralph, This problem was fixed long ago by some of the work Camille did. The exact revision number is r15402 (https://svn.open-mpi.org/trac/ompi/changeset/15402 ). I'm using this feature daily and so far I had any problems with it. To reuse your example here is what Camille came up with. $ mpiexec --mca routed_base_verbose 30 -n 3 hostname [dancer:09638] mca: base: components_open: Looking for routed components [dancer:09638] mca: base: components_open: opening routed components [dancer:09638] mca: base: components_open: found loaded component binomial [dancer:09638] mca: base: components_open: component binomial has no register function [dancer:09638] mca: base: components_open: component binomial has no open function [dancer:09638] mca: base: components_open: found loaded component direct [dancer:09638] mca: base: components_open: component direct has no register function [dancer:09638] mca: base: components_open: component direct has no open function [dancer:09638] mca: base: components_open: found loaded component linear [dancer:09638] mca: base: components_open: component linear has no register function [dancer:09638] mca: base: components_open: component linear has no open function [dancer:09638] mca:base:select: Auto-selecting routed components [...] And if we force a special component: $ mpiexec --mca routed linear --mca routed_base_verbose 30 -n 3 hostname [dancer:09642] mca: base: components_open: Looking for routed components [dancer:09642] mca: base: components_open: opening routed components [dancer:09642] mca: base: components_open: found loaded component linear [dancer:09642] mca: base: components_open: component linear has no register function [dancer:09642] mca: base: components_open: component linear has no open function [dancer:09642] mca:base:select: Auto-selecting routed components [...] I wonder what are the configuration options you're using? george. On Oct 22, 2008, at 1:30 PM, Ralph Castain wrote: I've been digging a little into optimization and found something that seems counterintuitive in the way OMPI is handling components. Specifically, if I specify a component I want used for a framework, OMPI still does a component load and open on every component in the framework - it only uses my specification during "select". Thus, the cmd line mpirun -mca routed linear still results in the loading and opening of the direct and binomial components - even though we have directed the framework not to use them. This causes us to waste memory when there is no possibility of a different component being selected. Is there a reason why "open" isn't using the mca params to guide the components it is loading? Ralph ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Component open
Ralph, This problem was fixed long ago by some of the work Camille did. The exact revision number is r15402 (https://svn.open-mpi.org/trac/ompi/changeset/15402 ). I'm using this feature daily and so far I had any problems with it. To reuse your example here is what Camille came up with. $ mpiexec --mca routed_base_verbose 30 -n 3 hostname [dancer:09638] mca: base: components_open: Looking for routed components [dancer:09638] mca: base: components_open: opening routed components [dancer:09638] mca: base: components_open: found loaded component binomial [dancer:09638] mca: base: components_open: component binomial has no register function [dancer:09638] mca: base: components_open: component binomial has no open function [dancer:09638] mca: base: components_open: found loaded component direct [dancer:09638] mca: base: components_open: component direct has no register function [dancer:09638] mca: base: components_open: component direct has no open function [dancer:09638] mca: base: components_open: found loaded component linear [dancer:09638] mca: base: components_open: component linear has no register function [dancer:09638] mca: base: components_open: component linear has no open function [dancer:09638] mca:base:select: Auto-selecting routed components [...] And if we force a special component: $ mpiexec --mca routed linear --mca routed_base_verbose 30 -n 3 hostname [dancer:09642] mca: base: components_open: Looking for routed components [dancer:09642] mca: base: components_open: opening routed components [dancer:09642] mca: base: components_open: found loaded component linear [dancer:09642] mca: base: components_open: component linear has no register function [dancer:09642] mca: base: components_open: component linear has no open function [dancer:09642] mca:base:select: Auto-selecting routed components [...] I wonder what are the configuration options you're using? george. On Oct 22, 2008, at 1:30 PM, Ralph Castain wrote: I've been digging a little into optimization and found something that seems counterintuitive in the way OMPI is handling components. Specifically, if I specify a component I want used for a framework, OMPI still does a component load and open on every component in the framework - it only uses my specification during "select". Thus, the cmd line mpirun -mca routed linear still results in the loading and opening of the direct and binomial components - even though we have directed the framework not to use them. This causes us to waste memory when there is no possibility of a different component being selected. Is there a reason why "open" isn't using the mca params to guide the components it is loading? Ralph ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI devel] Comm_spawn limits
I can't swear to this because I haven't fully grokked it yet, but I believe the answer is: 1. if child jobs have completed, it won't hurt. I think the various subsystem cleanup their bookkeeping when a job completes, so we could possibly reuse the number. Might be some race conditions we would have to resolve. 2. if child jobs haven't completed (which is the situation this particular user was attempting), then we would have a problem with jobid confusion. Once we get the procs launched, though, I'm not sure how much of a problem there is - would have to investigate. Could cause some bookkeeping problems for job completion. Interesting possibility, though...consider it another option for now. On Oct 22, 2008, at 12:53 PM, George Bosilca wrote: What's happened if we roll around with the counter ? george. On Oct 22, 2008, at 2:49 PM, Ralph Castain wrote: There recently was activity on the mailing lists where someone was attempting to call comm_spawn 100,000 times. Setting aside the threading issues that were the focus of that exchange, the fact is that OMPI currently cannot handle that many comm_spawns. The ORTE jobid is composed of two elements: 1. the top 16-bits is an "identifier" for that mpirun 2. the lower 16-bits is a running counter identifying the specific job/launch for those procs. Thus, we are limited to 64k comm_spawns. Expanding this would require either revamping the entire way we handle jobs (e.g., removing the mpirun identifier - major effort), or expanding the orte_jobid_t from its current 32-bits to 64-bits. Is this a problem we want to address? Ralph ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Comm_spawn limits
What's happened if we roll around with the counter ? george. On Oct 22, 2008, at 2:49 PM, Ralph Castain wrote: There recently was activity on the mailing lists where someone was attempting to call comm_spawn 100,000 times. Setting aside the threading issues that were the focus of that exchange, the fact is that OMPI currently cannot handle that many comm_spawns. The ORTE jobid is composed of two elements: 1. the top 16-bits is an "identifier" for that mpirun 2. the lower 16-bits is a running counter identifying the specific job/launch for those procs. Thus, we are limited to 64k comm_spawns. Expanding this would require either revamping the entire way we handle jobs (e.g., removing the mpirun identifier - major effort), or expanding the orte_jobid_t from its current 32-bits to 64-bits. Is this a problem we want to address? Ralph ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel smime.p7s Description: S/MIME cryptographic signature
[OMPI devel] Comm_spawn limits
There recently was activity on the mailing lists where someone was attempting to call comm_spawn 100,000 times. Setting aside the threading issues that were the focus of that exchange, the fact is that OMPI currently cannot handle that many comm_spawns. The ORTE jobid is composed of two elements: 1. the top 16-bits is an "identifier" for that mpirun 2. the lower 16-bits is a running counter identifying the specific job/ launch for those procs. Thus, we are limited to 64k comm_spawns. Expanding this would require either revamping the entire way we handle jobs (e.g., removing the mpirun identifier - major effort), or expanding the orte_jobid_t from its current 32-bits to 64-bits. Is this a problem we want to address? Ralph
Re: [OMPI devel] adding new functions to a BTL
George reminds me that I forgot to explain why you couldn't dlsym We use lt_dlopen() to open the plugins (Libtool's wrapper for a portable dlopen). It opens all plugins (DSOs) in a private scope. That private scope is kept deep in the OPAL MCA base and not exposed elsewhere in the code base. So if you manually dlopen a plugin again, I'll bet that the linker realizes that that DSO has already been loaded into the process space and doesn't actually load it again (but doesn't fail). So the dlsyms fail because you don't have access to the private scope from where Libtool originally opened the DSO. Make sense? On Oct 22, 2008, at 1:04 PM, Eugene Loh wrote: I'm trying to prototype an idea inside OMPI and am running into a problem. I want to add a new function to a BTL and to have the PML call this function. I can't just put such a function call into the PML (not even for my prototype) since the PML is loaded before the BTL and so the PML will complain about a missing symbol. So, the PML will just have to refer to the function symbolically and I need to figure out the BTL function address "at the appropriate time" (after the BTL is loaded but before I need to call my function). I tried to dlopen the BTL (seemed successful... I got back a non- NULL handle), but dlsym can't seem to find any of the symbols in the BTL (not even ones that existed before I started any of my work). I can describe other things I tried or other things I think are supposed to work (but that I am reluctant to try), but let's cut to the chase: HELP! Please note that I'm a newbie OMPI developer and so I'm really interested in doing the simplest thing possible to try my prototype. I recognize that certain things will have to be done to add "real code" back to the code base, but at this point I'd prefer to defer difficult work and just test the ideas of my prototype. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] Direct routed module
Youpiii! george. On Oct 21, 2008, at 4:53 PM, Ralph Castain wrote: Hello all I am working on adding a new radix tree routed module and am simultaneously doing a little streamlining to the overall routed- related code for scalability. One thing that would help cleanup several areas of the code base would be to finally dump the "direct" routed module. As you may recall, this module has been continued for historical purposes. It is not scalable since it requires that every process open a direct connection to every other process in the job. This is what pre-1.3 systems do. We originally left it alive for two reasons: (a) we wanted to have a fallback position while we developed the more scalable alternatives, and (b) the C/R code didn't support routed RML comm. The latter situation was resolved some months ago, and we have had plenty of validation of our routed comm system. Thus, if there are no objections by the end of the week, I will remove this module and cleanup the code. Please let me know if this is a concern. Ralph ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI devel] adding new functions to a BTL
Short answer because we're all still in Chicago... Terry tells me that you're just hacking around trying to see what works, etc. So adding direct calls to the BTL in this kind of scenario is ok. I'm sure you're aware that this is not good for real code. :-) To directly call a BTL function, you might just want to configure OMPI with --enable-static; this will suck in all the plugins into libmpi, and therefore all symbols are directly available at link time. There's other, more elegant ways for this hackaround, but if you're just playing/testing, this is probably good enough. On Oct 22, 2008, at 1:04 PM, Eugene Loh wrote: I'm trying to prototype an idea inside OMPI and am running into a problem. I want to add a new function to a BTL and to have the PML call this function. I can't just put such a function call into the PML (not even for my prototype) since the PML is loaded before the BTL and so the PML will complain about a missing symbol. So, the PML will just have to refer to the function symbolically and I need to figure out the BTL function address "at the appropriate time" (after the BTL is loaded but before I need to call my function). I tried to dlopen the BTL (seemed successful... I got back a non- NULL handle), but dlsym can't seem to find any of the symbols in the BTL (not even ones that existed before I started any of my work). I can describe other things I tried or other things I think are supposed to work (but that I am reluctant to try), but let's cut to the chase: HELP! Please note that I'm a newbie OMPI developer and so I'm really interested in doing the simplest thing possible to try my prototype. I recognize that certain things will have to be done to add "real code" back to the code base, but at this point I'd prefer to defer difficult work and just test the ideas of my prototype. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] OOB-TCP Retries
Sorry for delayed response - had some things to finish, then had to stare at this code for awhile. Unfortunately, the OOB is a snarled can of hideous worms. It looks to me that the OOB continues to attempt to complete any pending message requests once it detects that retries have exceeded the limit. In doing so, it looks like it triggers pending events, which would include pending sends - thus causing it to again emit that error message. I can't swear to any of this, of course - the worms are really deep and tangled down there. A rewrite of the OOB is planned for next year - hopefully, the last of the spaghetti to be unraveled. Not sure if that will really happen, though, as I think everyone is afraid of that black hole of despair. If it does, this is one thing we can try to address. Any volunteers?? Ralph On Oct 17, 2008, at 11:02 AM, Leonardo Fialho wrote: Hi All, I´m doing some experiments and modifications in my heartbeat code witch uses the OOB-TCP communication channel. My modified orteds and orterun does not abort all processes when one orted die. The problem is: 1) I kill an orted, so another orted detect the fault when try to send a heartbeat to the faulty orted. 2) The RTE get stable again, by the orted which have sent the heartbeat print the following oob-tcp message: "[node1:21582] [[12518,0],1]-[[12518,0],2] oob-tcp: Communication retries exceeded. Can not communicate with peer" And the question is: a) Once an oob-tcp instance gets the mca_oob_tcp_peer_shutdown it discards this peer, no? b) The message is removed from the queue with ORTE_ERR_UNREACH code, no? c) Why, after retries exceed, the orted continue to plot this message? Thanks, -- Leonardo Fialho Computer Architecture and Operating Systems Department - CAOS Universidad Autonoma de Barcelona - UAB ETSE, Edifcio Q, QC/3088 http://www.caos.uab.es Phone: +34-93-581-2888 Fax: +34-93-581-2478 ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Direct routed module
Sounds good to me. On Oct 21, 2008, at 3:53 PM, Ralph Castain wrote: Hello all I am working on adding a new radix tree routed module and am simultaneously doing a little streamlining to the overall routed- related code for scalability. One thing that would help cleanup several areas of the code base would be to finally dump the "direct" routed module. As you may recall, this module has been continued for historical purposes. It is not scalable since it requires that every process open a direct connection to every other process in the job. This is what pre-1.3 systems do. We originally left it alive for two reasons: (a) we wanted to have a fallback position while we developed the more scalable alternatives, and (b) the C/R code didn't support routed RML comm. The latter situation was resolved some months ago, and we have had plenty of validation of our routed comm system. Thus, if there are no objections by the end of the week, I will remove this module and cleanup the code. Please let me know if this is a concern. Ralph ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
[OMPI devel] adding new functions to a BTL
I'm trying to prototype an idea inside OMPI and am running into a problem. I want to add a new function to a BTL and to have the PML call this function. I can't just put such a function call into the PML (not even for my prototype) since the PML is loaded before the BTL and so the PML will complain about a missing symbol. So, the PML will just have to refer to the function symbolically and I need to figure out the BTL function address "at the appropriate time" (after the BTL is loaded but before I need to call my function). I tried to dlopen the BTL (seemed successful... I got back a non-NULL handle), but dlsym can't seem to find any of the symbols in the BTL (not even ones that existed before I started any of my work). I can describe other things I tried or other things I think are supposed to work (but that I am reluctant to try), but let's cut to the chase: HELP! Please note that I'm a newbie OMPI developer and so I'm really interested in doing the simplest thing possible to try my prototype. I recognize that certain things will have to be done to add "real code" back to the code base, but at this point I'd prefer to defer difficult work and just test the ideas of my prototype.
Re: [OMPI devel] -display-map
Ralph, I guess the issue for us is that we will have to run two commands to get the information we need. One to get the configuration information, such as version and MCA parameters, and one to get the host information, whereas it would seem more logical that this should all be available via some kind of "configuration discovery" command. I understand the issue with supplying the hostfile though, so maybe this just points at the need for us to separate configuration information from the host information. In any case, we'll work with what you think is best. Greg On Oct 20, 2008, at 4:49 PM, Ralph Castain wrote: Hmmm...just to be sure we are all clear on this. The reason we proposed to use mpirun is that "hostfile" has no meaning outside of mpirun. That's why ompi_info can't do anything in this regard. We have no idea what hostfile the user may specify until we actually get the mpirun cmd line. They may have specified a default-hostfile, but they could also specify hostfiles for the individual app_contexts. These may or may not include the node upon which mpirun is executing. So the only way to provide you with a separate command to get a hostfile<->nodename mapping would require you to provide us with the default-hostifle and/or hostfile cmd line options just as if you were issuing the mpirun cmd. We just wouldn't launch - but it would be the exact equivalent of doing "mpirun --do-not-launch". Am I missing something? If so, please do correct me - I would be happy to provide a tool if that would make it easier. Just not sure what that tool would do. Thanks Ralph On Oct 19, 2008, at 1:59 PM, Greg Watson wrote: Ralph, It seems a little strange to be using mpirun for this, but barring providing a separate command, or using ompi_info, I think this would solve our problem. Thanks, Greg On Oct 17, 2008, at 10:46 AM, Ralph Castain wrote: Sorry for delay - had to ponder this one for awhile. Jeff and I agree that adding something to ompi_info would not be a good idea. Ompi_info has no knowledge or understanding of hostfiles, and adding that capability to it would be a major distortion of its intended use. However, we think we can offer an alternative that might better solve the problem. Remember, we now treat hostfiles in a very different manner than before - see the wiki page for a complete description, or "man orte_hosts". So the problem is that, to provide you with what you want, we need to "dump" the information from whatever default-hostfile was provided, and, if no default-hostfile was provided, then the information from each hostfile that was provided with an app_context. The best way we could think of to do this is to add another mpirun cmd line option --dump-hostfiles that would output the line-by- line name from the hostfile plus the name we resolved it to. Of course, --xml would cause it to be in xml format. Would that meet your needs? Ralph On Oct 15, 2008, at 3:12 PM, Greg Watson wrote: Hi Ralph, We've been discussing this back and forth a bit internally and don't really see an easy solution. Our problem is that Eclipse is not running on the head node, so gethostbyname will not necessarily resolve to the same address. For example, the hostfile might refer to the head node by an internal network address that is not visible to the outside world. Since gethostname also looks in /etc/hosts, it may resolve locally but not on a remote system. The only think I can think of would be, rather than us reading the hostfile directly as we do now, to provide an option to ompi_info that would dump the hostfile using the same rules that you apply when you're using the hostfile. Would that be feasible? Greg On Sep 22, 2008, at 4:25 PM, Ralph Castain wrote: Sorry for delay - was on vacation and am now trying to work my way back to the surface. I'm not sure I can fix this one for two reasons: 1. In general, OMPI doesn't really care what name is used for the node. However, the problem is that it needs to be consistent. In this case, ORTE has already used the name returned by gethostname to create its session directory structure long before mpirun reads a hostfile. This is why we retain the value from gethostname instead of allowing it to be overwritten by the name in whatever allocation we are given. Using the name in hostfile would require that I either find some way to remember any prior name, or that I tear down and rebuild the session directory tree - neither seems attractive nor simple (e.g., what happens when the user provides multiple entries in the hostfile for the node, each with a different IP address based on another interface in that node? Sounds crazy, but we have already seen it done - which one do I use?). 2. We don't actually store the hostfile info anywhere - we just use it and forget it. For us to add an XML attribute containing any host
[OMPI devel] Component open
I've been digging a little into optimization and found something that seems counterintuitive in the way OMPI is handling components. Specifically, if I specify a component I want used for a framework, OMPI still does a component load and open on every component in the framework - it only uses my specification during "select". Thus, the cmd line mpirun -mca routed linear still results in the loading and opening of the direct and binomial components - even though we have directed the framework not to use them. This causes us to waste memory when there is no possibility of a different component being selected. Is there a reason why "open" isn't using the mca params to guide the components it is loading? Ralph
[OMPI devel] Restarting processes on different node
Hi All, I´m trying to implement my FT architecture in Open MPI. Just now I need to restart a faulty process from a checkpoint. I saw that Josh uses orte-restart which call opal-restart through an ordinary mpirun call. It´s now good for me because in this case the restarted process becomes in a new job. I need to restart the process checkpoint in the same job and in another node under an existing orted. The checkpoints are taken without the "--term" option. My modified orted receive a "restart request" from my modified heartbeat mechanism. I have tried to restart using the BLCR cr_restart command. It does not work, I think because the stderr/stdin/stdout was not handled by the opal environment. So, I tried to restart the checkpoint forking the orted and doing an execvp to the opal-restart. It recovers the checkpoint, but after the "opal_cr_init", it dies (*** Process received signal ***). As follows is the job structure (from ompi-ps) after a fault: Process Name |ORTE Name | Local Rank |PID | Node | State | HB Dest. | - orterun | [[8002,0],0] | 65535 | 30434 | aoclsb | Running | | orted | [[8002,0],1] | 65535 | 30435 | nodo1 | Running | [[8002,0],3] | orted | [[8002,0],2] | 65535 | 30438 | nodo2 | Faulty | [[8002,0],3] | orted | [[8002,0],3] | 65535 | 30441 | nodo3 | Running | [[8002,0],4] | orted | [[8002,0],4] | 65535 | 30444 | nodo4 | Running | [[8002,0],1] | Process Name |ORTE Name | Local Rank |PID | Node | State | Ckpt State | Ckpt Loc |Protector | -- ./ping/wait | [[8002,1],0] | 0 | 9069 | nodo1 | Running | Finished | /tmp/radic/0 | [[8002,0],2] | ./ping/wait | [[8002,1],1] | 0 | 6086 | nodo2 | Restoring | Finished | /tmp/radic/1 | [[8002,0],3] | ./ping/wait | [[8002,1],2] | 0 | 5864 | nodo3 | Running | Finished | /tmp/radic/2 | [[8002,0],4] | ./ping/wait | [[8002,1],3] | 0 | 7405 | nodo4 | Running | Finished | /tmp/radic/3 | [[8002,0],1] | The orted running on "nodo2" dies. It was detected by the orted [[8002,0],1] running on "nodo1" and informed to the HNP. The HNP update the procs structure and look for processes running on the faulty node, so it sends a restart request for the orted which holds the checkpoint of the faulty processes. Below is the log generated: [aoclsb:30434] [[8002,0],0] orted_recv: update state request from [[8002,0],3] [aoclsb:30434] [[8002,0],0] orted_update_state: updating state (17) for orted process (vpid=2) [aoclsb:30434] [[8002,0],0] orted_update_state: found process [[8002,1],1] on node nodo2, requesting recovery task for that [aoclsb:30434] [[8002,0],0] orted_update_state: sending restore ([[8002,1],1] process) request to [[8002,0],3] [nodo3:05841] [[8002,0],3] orted_recv: restore checkpoint request from [[8002,0],0] [nodo3:05841] [[8002,0],3] orted_restore_checkpoint: restarting process from checkpoint file (/tmp/radic/1/ompi_blcr_context.6086) [nodo3:05841] [[8002,0],3] orted_restore_checkpoint: executing restart (opal-restart -mca crs_base_snapshot_dir /tmp/radic/1 .) [nodo3:05924] opal_cr: init: Verbose Level: 1024 [nodo3:05924] opal_cr: init: FT Enabled: 1 [nodo3:05924] opal_cr: init: Is a tool program: 1 [nodo3:05924] opal_cr: init: Checkpoint Signal: 10 [nodo3:05924] opal_cr: init: Debug SIGPIPE: 0 (False) [nodo3:05924] opal_cr: init: Temp Directory: /tmp [nodo2:05965] *** Process received signal *** The orted which receives the restart request forks and the call an execvp for the opal-restart, and then, unfortunately, it dies. I know that the restarted process should generate errors because the URI of it daemon is incorrect like all other enviroment variables, but it would generate a communication error, or any kind of error other than a process kill. My question is: 1) Why this process dies? I suspect that the checkpoint have pointers which points to libraries which are not loaded, or are loaded on different memory position (because this checkpoint becomes from another node). In this case the error should be "segmentation fault" or something like this, no? If somebody have some information or can give me some help about this error I´ll be grateful. Thanks-- Leonardo Fialho Computer Architecture and Operating Systems Department - CAOS Universidad Autonoma de Barcelona - UAB ETSE, Edifcio Q, QC/3088 http://www.caos.uab.es Phone: +34-93-581-2888 Fax: +34-93-581-2478