[OMPI devel] static rate / connection modularity
I'm about ready to start on the connection modularity stuff in the openib BTL. A few changes are getting rolled up in this work: 1. Modularize the connection scheme in the openib BTL as per previous discussions (use function pointers to choose between the current OOB wireup and the RDMA CM -- I'll probably do just a skeleton of the RDMA CM at first; to be filled in later). Preliminary prototypes of this work in a /tmp branch showed that it cleaned up btl_openib_endpoint.c a *lot*. 2. [Re-]Fix the problem with having heterogeneous numbers of ports across hosts (it seems to be broken again -- bonk). 3. Remove the static rate MCA parameter and instead, have the endpoints negotiate (either in the modex or at wireup time -- whichever works best) to use the speed of the slower port. -- Jeff Squyres Cisco Systems
[OMPI devel] openib credits problem
I got a problem in MTT runs last night with the openib BTL w.r.t. credits: [...lots of IMB output...] #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec 0 1000 367.66 371.58 369.34 0.00 IMB-MPI1: ./btl_openib_endpoint.h:261: Assertion `endpoint->qps [qp].u.pp_qp.rd_credits < rd_num' failed. Gleb -- you've been mucking around in here recently; did something you do cause this, perchance? -- Jeff Squyres Cisco Systems
Re: [OMPI devel] openib credits problem
On Thu, Jul 26, 2007 at 09:12:26AM -0400, Jeff Squyres wrote: > I got a problem in MTT runs last night with the openib BTL w.r.t. > credits: > > [...lots of IMB output...] > #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] > Mbytes/sec > 0 1000 367.66 371.58 > 369.34 0.00 > IMB-MPI1: ./btl_openib_endpoint.h:261: Assertion `endpoint->qps > [qp].u.pp_qp.rd_credits < rd_num' failed. > > Gleb -- you've been mucking around in here recently; did something > you do cause this, perchance? > This is definitely caused by something I did. I am not sure this assert is valid though. I am looking into it. -- Gleb.
Re: [OMPI devel] Hostfiles - yet again
Aurelien Bouteiller wrote: Hi Ralph and everyone, I just want to make sure the proposed usecases does not break one of the current open MPI feature I require. For FT purposes, I need to get some specific hosts (lets say with a better MTBF). Those hosts are not part of the MPI_COMM_WORLD but are used to deploy FT services (like event loggers, checkpoint servers, etc). To enable collaboration between computing nodes and those FT services, I use the usual MPI2 Dynamics with MPI_Accept/Connect. This means that those different instances of mpirun needs to share the same orte registry, so that they can establish the MPI2 connect/accept trough the registered MPI_ports. This background in place, my first concern is how the deployment maps to the allocated resources. The nodes used to deploy FT services are "special". In typical usecase, I get machines with better MTFB, faster or larger disks by requesting special properties to the resources allocation manager. I don't want those to be mixed with regular nodes in the resulting hostfile: these scarce resources should hold only FT services, no computing processes. As I understand things, I don't see any way to avoid mpirun to deploy application processes on my "special" nodes if they are part of the same launch/allocation in your "filtering" usecase. Currently I proceed to two different mpirun with a single orte seed holding the registry. This way I get two different hostfiles, one for computing nodes, one for FT services. I just want to make sure everybody understood this requirement so that this feature does not disappear in the brainstorming :] With the use of resource managers, --host, and --hostfile this should all be possible. Next requirement is the ability to add during runtime some nodes to the initial pool. Because node may fail (but it is the same with comm_spawn basically) , I might need some (lot of) spare nodes to replace failed ones. As I do not want to request for twice as many nodes as I need (after all, things could just go fine, why should I waste that many computing resources for idle spares ?), I definitely want to be able to allocate some new nodes to the pool of the already running machines. As far as I understand, this is impossible to achieve with the usecase2 and quite difficult in usecase1. In my opinion, having the ability to spawn on nodes which are not part of the initial hostfile is a key feature (and not only for FT purposes). I am looking for more detail into the above issue. What resource manager are you using? Ideally, we would prefer not to support this. Any nodes that you run on, or hope to run on, would be designated at the start. For example: mpirun -np 1 --host a,b,c,d,e,f,g This would cause the one process of the mpi job to start on host a. Then, the mpi job has available to it the other hosts should it decide later to start a job on them. However no ORTE daemons would be started on those nodes until calls to MPI_Comm_spawn occur. So, the MPI job would not be consuming any resources until called upon to. Rolf I know there have been some extra discussions on this subject. Unfortunately it looks like I am not part of the list where it happened. I hope those concerns have not been already discussed. Aurelien Ralph H Castain wrote: Yo all As you know, I am working on revamping the hostfile functionality to make it work better with managed environments (at the moment, the two are exclusive). The issue that we need to review is how we want the interaction to work, both for the initial launch and for comm_spawn. In talking with Jeff, we boiled it down to two options that I have flow-charted (see attached): Option 1: in this mode, we read any allocated nodes provided by a resource manager (e.g., SLURM). These nodes establish a base pool of nodes that can be used by both the initial launch and any dynamic comm_spawn requests. The hostfile and any -host info is then used to select nodes from within that pool for use with the specific launch. The initial launch would use the -hostfile or -host command line option to provide that info - comm_spawn would use the MPI_Info fields to provide similar info. This mode has the advantage of allowing a user to obtain a large allocation, and then designate hosts within the pool for use by an initial application, and separately designate (via another hostfile or -host spec) another set of those hosts from the pool to support a comm_spawn'd child job. If no resource managed nodes are found, then the hostfile and -host options would provide the list of hosts to be used. Again, comm_spawn'd jobs would be able to specify their own hostfile and -host nodes. The negative to this option is complexity - in the absence of a managed allocation, I either have to deal with hostfile/dash-host allocations in the RAS and then again in RMAPS, or I have "allocation-like" functionality happening in RMAPS. Option 2: in this mode, we read an
Re: [OMPI devel] Hostfiles - yet again
On 7/26/07 7:33 AM, "rolf.vandeva...@sun.com" wrote: > Aurelien Bouteiller wrote: > >> Hi Ralph and everyone, >> >> I just want to make sure the proposed usecases does not break one of the >> current open MPI feature I require. For FT purposes, I need to get some >> specific hosts (lets say with a better MTBF). Those hosts are not part >> of the MPI_COMM_WORLD but are used to deploy FT services (like event >> loggers, checkpoint servers, etc). To enable collaboration between >> computing nodes and those FT services, I use the usual MPI2 Dynamics >> with MPI_Accept/Connect. This means that those different instances of >> mpirun needs to share the same orte registry, so that they can establish >> the MPI2 connect/accept trough the registered MPI_ports. >> >> This background in place, my first concern is how the deployment maps to >> the allocated resources. The nodes used to deploy FT services are >> "special". In typical usecase, I get machines with better MTFB, faster >> or larger disks by requesting special properties to the resources >> allocation manager. I don't want those to be mixed with regular nodes in >> the resulting hostfile: these scarce resources should hold only FT >> services, no computing processes. As I understand things, I don't see >> any way to avoid mpirun to deploy application processes on my "special" >> nodes if they are part of the same launch/allocation in your "filtering" >> usecase. Currently I proceed to two different mpirun with a single orte >> seed holding the registry. This way I get two different hostfiles, one >> for computing nodes, one for FT services. I just want to make sure >> everybody understood this requirement so that this feature does not >> disappear in the brainstorming :] >> >> > With the use of resource managers, --host, and --hostfile this should > all be possible. > I'll try to keep this in mind as we implement the changewill have to see if this ability really can be supported in the revision. I'll certainly let you know if I run into a conflict. >> Next requirement is the ability to add during runtime some nodes to the >> initial pool. Because node may fail (but it is the same with comm_spawn >> basically) , I might need some (lot of) spare nodes to replace failed >> ones. As I do not want to request for twice as many nodes as I need >> (after all, things could just go fine, why should I waste that many >> computing resources for idle spares ?), I definitely want to be able to >> allocate some new nodes to the pool of the already running machines. As >> far as I understand, this is impossible to achieve with the usecase2 and >> quite difficult in usecase1. In my opinion, having the ability to spawn >> on nodes which are not part of the initial hostfile is a key feature >> (and not only for FT purposes). >> >> >> > I am looking for more detail into the above issue. What > resource manager are you using? > > Ideally, we would prefer not to support this. Any nodes > that you run on, or hope to run on, would be designated > at the start. For example: > > mpirun -np 1 --host a,b,c,d,e,f,g > > This would cause the one process of the mpi job to start on host a. > Then, the mpi job has available to it the other hosts should it decide > later to start a job on them. However no ORTE daemons would > be started on those nodes until calls to MPI_Comm_spawn > occur. So, the MPI job would not be consuming any resources > until called upon to. This has actually been the subject of multiple threads on the user list and is considered a critical capability by some users and vendors. I believe there is little problem in allowing those systems that can support it to dynamically add nodes to ORTE via some API into the resource manager. At the moment, none of the RMs support it, but LSF will (and TM at least may) shortly do so, and some of their customers are depending upon it. The problem is that job startup could be delayed for significant time if all hosts must be preallocated. Admittedly, this raises all kinds of issues about how long the job could be stalled waiting for the new hosts. However, as the other somewhat exhaustive threads have discussed, there are computing models that can live with this uncertainty, and RMs that will provide async callbacks to allow the rest of the app to continue working while waiting. Just my $0.2 - again, this goes back to...are there use-cases and customers to which Open MPI is simply going to say "we won't support that"? > > Rolf > >> I know there have been some extra discussions on this subject. >> Unfortunately it looks like I am not part of the list where it happened. >> I hope those concerns have not been already discussed. >> >> Aurelien >> >> Ralph H Castain wrote: >> >> >>> Yo all >>> >>> As you know, I am working on revamping the hostfile functionality to make it >>> work better with managed environments (at the moment, the two are >>> exclusive). The issue that we need to review is
Re: [OMPI devel] openib credits problem
On Thu, Jul 26, 2007 at 04:29:40PM +0300, Gleb Natapov wrote: > On Thu, Jul 26, 2007 at 09:12:26AM -0400, Jeff Squyres wrote: > > I got a problem in MTT runs last night with the openib BTL w.r.t. > > credits: > > > > [...lots of IMB output...] > > #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] > > Mbytes/sec > > 0 1000 367.66 371.58 > > 369.34 0.00 > > IMB-MPI1: ./btl_openib_endpoint.h:261: Assertion `endpoint->qps > > [qp].u.pp_qp.rd_credits < rd_num' failed. > > > > Gleb -- you've been mucking around in here recently; did something > > you do cause this, perchance? > > > This is definitely caused by something I did. I am not sure this assert > is valid though. I am looking into it. > Assertion is valid. r15635 should fix this. -- Gleb.
Re: [OMPI devel] MPI_ALLOC_MEM warning when requesting 0 (zero) bytes
On 7/25/07, Jeff Squyres wrote: Be sure to read this thread in order -- the conclusion of the thread was that we now actually *do* return NULL, per POSIX advice. OK, I got confused. And now, MPI_Free_mem is going to fail with a NULL pointer? Not sure what POSIX says, but then OMPI should also follow it advice, right? -- Lisandro Dalcín --- Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC) Instituto de Desarrollo Tecnológico para la Industria Química (INTEC) Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) PTLC - Güemes 3450, (3000) Santa Fe, Argentina Tel/Fax: +54-(0)342-451.1594
Re: [OMPI devel] MPI_ALLOC_MEM warning when requesting 0 (zero) bytes
On Jul 26, 2007, at 12:42 PM, Lisandro Dalcin wrote: Be sure to read this thread in order -- the conclusion of the thread was that we now actually *do* return NULL, per POSIX advice. OK, I got confused. And now, MPI_Free_mem is going to fail with a NULL pointer? Not sure what POSIX says, but then OMPI should also follow it advice, right? It's not going to *fail* -- it's just going to return a NULL pointer if you ask for 0 bytes. This is perfectly valid according to POSIX's definition of free(). Also, passing NULL to MPI_FREE_MEM will now silently succeed (it will currently raise an MPI_ERR_ARG exception). -- Jeff Squyres Cisco Systems
Re: [OMPI devel] [RFC] Sparse group implementation
Mohamad - A couple of comments / questions: 1) Why do you need the #if OMPI_GROUP_SPARSE in communicator/comm.c? That seems like part of the API that should under no conditions change based on sparse/not sparse 2) I think it would be better to always have the flags and macros available (like OMPI_GROUP_SPORADIC and OMPI_GROUP_IS_SPORADIC) even when sparse groups are disabled. They dont' take up any space, and mean less #ifs in the general code base 3) Instead of the GROU_GET_PROC_POINTER macro, why not just change the behavior of the ompi_group_peer_lookup() function, so that there is symmetry with how you get a proc from a communicator? static inline functions (especially short ones like that) are basically free. We'll still have to fix all the places in the code where the macro is used or people poke directly at the group structure, but I like static inline over macros whenever possible. So much easier t debug. Other than that, I think you've got my concerns pretty much addressed. Brian On Jul 25, 2007, at 8:45 PM, Mohamad Chaarawi wrote: In the current code, almost all #ifs are due to the fact that we don't want to add the extra memory by the sparse parameters that are added to the group structure. The additional parameters are 5 pointers and 3 integers. If nobody objects, i would actually keep those extra parameters, even if sparse groups are disabled (in the default case on configure), because it would reduce the number of #ifs in the code to only 2 (as i recall that i had it before) .. Thank you, Mohamad On Wed, July 25, 2007 4:23 pm, Brian Barrett wrote: On Jul 25, 2007, at 3:14 PM, Jeff Squyres wrote: On Jul 25, 2007, at 5:07 PM, Brian Barrett wrote: It just adds a lot of #if's throughout the code. Other than that, there's no reason to remove it. I agree, lots of #ifs are bad. But I guess I don't see the problem. The only real important thing people were directly accessing in the ompi_group_t is the array of proc pointers. Indexing into them could be done with a static inline function that just has slightly different time complexity based on compile options. Static inline function that just does an index in the group proc pointer would have almost no added overhead (none if the compiler doesn't suck). Ya, that's what I proposed. :-) But I did also propose removing the extra #if's so that the sparse group code would be available and we'd add an extra "if" in the critical code path. But we can do it this way instead: Still use the MACRO to access proc_t's. In the --disable-sparse- groups scenario, have it map to comm.group.proc[i]. In the -- enable- sparse-groups scenario, have it like I listed in the original proposal: static inline ompi_proc_t lookup_group(ompi_group_t *group, int index) { if (group_is_dense(group)) { return group->procs[index]; } else { return sparse_group_lookup(group, index); } } With: a) groups are always dense if --enable and an MCA parameter turns off sparse groups, or b) there's an added check in the inline function for whether the MCA parameter is on I'm personally in favor of a) because it means only one conditional in the critical path. I don't really care about the sparse groups turned on case. I just want minimal #ifs in the global code and to not have an if() { ... } in the critical path when sparse groups are disabled :). Brian ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Mohamad Chaarawi Instructional Assistant http://www.cs.uh.edu/~mschaara Department of Computer ScienceUniversity of Houston 4800 Calhoun, PGH Room 526Houston, TX 77204, USA ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [RFC] Sparse group implementation
On Thu, July 26, 2007 12:20 pm, Brian Barrett wrote: > Mohamad - > > A couple of comments / questions: > > 1) Why do you need the #if OMPI_GROUP_SPARSE in communicator/comm.c? > That seems like > part of the API that should under no conditions change based on > sparse/not sparse > I don't, there was one #if that i just removed.. but we do need to check in some cases like in ompi_comm_get_rprocs that we are not using the direct access to the pointers list. like for example: if(OMPI_GROUP_IS_DENSE(local_comm->c_local_group)) { rc = ompi_proc_pack(local_comm->c_local_group->grp_proc_pointers, local_size, sbuf); } /* get the proc list for the sparse implementations */ else { proc_list = (ompi_proc_t **) calloc (local_comm->c_local_group->grp_proc_count, sizeof (ompi_proc_t *)); for(i=0 ; ic_local_group->grp_proc_count ; i++) proc_list[i] = GROUP_GET_PROC_POINTER(local_comm->c_local_group,i); rc = ompi_proc_pack (proc_list, local_size, sbuf); } here if sparse groups are disabled, we don't really want to allocate and set this list of pointers that already exists (not to waste more memory and time).. > 2) I think it would be better to always have the flags and macros > available (like OMPI_GROUP_SPORADIC and OMPI_GROUP_IS_SPORADIC) even > when sparse groups are disabled. They dont' take up any space, and > mean less #ifs in the general code base > That's what i actually was proposing.. keep the flags (there are no macros, just the GROUP_GET_PROC_POINTER) and the sparse parameters in the group strucutre, and this will mean, only 1 maybe 2 #ifs.. > 3) Instead of the GROU_GET_PROC_POINTER macro, why not just change > the behavior of the ompi_group_peer_lookup() function, so that there > is symmetry with how you get a proc from a communicator? static > inline functions (especially short ones like that) are basically > free. We'll still have to fix all the places in the code where the > macro is used or people poke directly at the group structure, but I > like static inline over macros whenever possible. So much easier t > debug. Actually i never knew till this morning that this function was in the code.. I have an inline function ompi_group_lookup (which does the same thing), that actually checks if the group is dense or not and act accordingly.. but to use the inline function instead of the macro, means again that we need to compile in all the sparse parameters/code, which im for.. > > Other than that, I think you've got my concerns pretty much addressed. > > Brian > > On Jul 25, 2007, at 8:45 PM, Mohamad Chaarawi wrote: > >> In the current code, almost all #ifs are due to the fact that we don't >> want to add the extra memory by the sparse parameters that are >> added to >> the group structure. >> The additional parameters are 5 pointers and 3 integers. >> If nobody objects, i would actually keep those extra parameters, >> even if >> sparse groups are disabled (in the default case on configure), >> because it >> would reduce the number of #ifs in the code to only 2 (as i recall >> that i >> had it before) .. >> >> Thank you, >> >> Mohamad >> >> On Wed, July 25, 2007 4:23 pm, Brian Barrett wrote: >>> On Jul 25, 2007, at 3:14 PM, Jeff Squyres wrote: >>> On Jul 25, 2007, at 5:07 PM, Brian Barrett wrote: >> It just adds a lot of #if's throughout the code. Other than that, >> there's no reason to remove it. > > I agree, lots of #ifs are bad. But I guess I don't see the > problem. > The only real important thing people were directly accessing in the > ompi_group_t is the array of proc pointers. Indexing into them > could > be done with a static inline function that just has slightly > different time complexity based on compile options. Static inline > function that just does an index in the group proc pointer would > have > almost no added overhead (none if the compiler doesn't suck). Ya, that's what I proposed. :-) But I did also propose removing the extra #if's so that the sparse group code would be available and we'd add an extra "if" in the critical code path. But we can do it this way instead: Still use the MACRO to access proc_t's. In the --disable-sparse- groups scenario, have it map to comm.group.proc[i]. In the -- enable- sparse-groups scenario, have it like I listed in the original proposal: static inline ompi_proc_t lookup_group(ompi_group_t *group, int index) { if (group_is_dense(group)) { return group->procs[index]; } else { return sparse_group_lookup(group, index); } } With: a) groups are always dense if --enable and an MCA parameter turns off spa
Re: [OMPI devel] [RFC] Sparse group implementation
On Jul 26, 2007, at 12:00 PM, Mohamad Chaarawi wrote: 2) I think it would be better to always have the flags and macros available (like OMPI_GROUP_SPORADIC and OMPI_GROUP_IS_SPORADIC) even when sparse groups are disabled. They dont' take up any space, and mean less #ifs in the general code base That's what i actually was proposing.. keep the flags (there are no macros, just the GROUP_GET_PROC_POINTER) and the sparse parameters in the group strucutre, and this will mean, only 1 maybe 2 #ifs.. Why would this mean having the sparse parameters in the group structure? 3) Instead of the GROU_GET_PROC_POINTER macro, why not just change the behavior of the ompi_group_peer_lookup() function, so that there is symmetry with how you get a proc from a communicator? static inline functions (especially short ones like that) are basically free. We'll still have to fix all the places in the code where the macro is used or people poke directly at the group structure, but I like static inline over macros whenever possible. So much easier t debug. Actually i never knew till this morning that this function was in the code.. I have an inline function ompi_group_lookup (which does the same thing), that actually checks if the group is dense or not and act accordingly.. but to use the inline function instead of the macro, means again that we need to compile in all the sparse parameters/code, which im for.. No, it doesn't. Just have something like: static inline ompi_proc_t* ompi_group_lookup(ompi_group_t *group, int peer) { #if OMPI_GROUP_SPARSE /* big long lookup function for sparse groups here */ #else return group->grp_proc_pointers[peer] #endif } Brian
Re: [OMPI devel] Hostfiles - yet again
Hi Aurelien Perhaps some bad news on this subject - see below. On 7/26/07 7:53 AM, "Ralph H Castain" wrote: > > > > On 7/26/07 7:33 AM, "rolf.vandeva...@sun.com" > wrote: > >> Aurelien Bouteiller wrote: >>> Currently I proceed to two different mpirun with a single orte >>> seed holding the registry. This way I get two different hostfiles, one >>> for computing nodes, one for FT services. I just want to make sure >>> everybody understood this requirement so that this feature does not >>> disappear in the brainstorming :] >>> >>> After some investigation, I'm afraid that I have to report that this - as far as I understand what you are doing - may no longer work in Open MPI in the future (and I'm pretty sure isn't working in the trunk today except [maybe] in the special case of hostfile - haven't verified that). To ensure we are correctly communicating, let me reiterate what I understand you are doing: 1. in one window, you start a persistent daemon. You then enter "mpirun" to that command line, specifying a hostfile (let's call it "foo" for now) and the universe used to start the persistent daemon. Thus, mpirun connects to that universe and runs within it. 2. in another window, you type "mpirun" to the command line, specifying a different hostfile ("bar") and again giving it the universe used to start the persistent daemon. Thus, both mpiruns are being "managed" by the same HNP (the persistent daemon). First, there are major issues here involving confusion over allocations and synchronization between the lifetimes of the two jobs started in this manner. You may not see those in hostfile-only use cases, but for managed environments, this proved to cause undesirable confusion over process placement and unexpected application failures. Accordingly, we have been working to eliminate this usage (although the trunk will currently still allow it in some cases). This was caused by mpirun itself processing its local environment and then "pushing" it into the global registry. Keeping everything separated causes a bookkeeper's headache and many lines of code that we would like to eliminate. The current future design only processes allocations at the HNP itself. Thus, the persistent daemon would only be capable of sensing its own local allocation - it cannot see an allocation obtained in a separate window/login. This unfortunately extends to hostfiles as well - the persistent daemon can process the hostfile provided on its command line or environment, but has no mechanism for reading another one. The exception to this is comm_spawn. Our current intent was to allow comm_spawn to specify a hostfile that could be read by the HNP and used for the child job. However, we are still discussing whether this hostfile should be allowed to "add" nodes to the known available resources, or only specify a subset of the already-known resource pool. I suspect we will opt for the latter interpretation as we otherwise open an entirely different set of complications. So I am not sure that you will be able to continue working this way. You may have to start your regular application with the larger pool of resources, specify the ones you want used for the application itself via -host, and then "comm_spawn" your FT services on the other nodes using -host in that launch. Alternatively, you could use the multiple app_context capability to start it all from the command line: mpirun -hostfile big_pool -n 10 -host 1,2,3,4 application : -n 2 -host 99,100 ft_server Hope that helps explain things. As I hope I have indicated, I -think- you will still be able to do what you described, but probably not the way you have been doing it. Please feel free to comment. If this is a big enough issue to a large enough audience, then we can try to find a way to solve it (assuming Open MPI's community decides to support it). Ralph > >>> Next requirement is the ability to add during runtime some nodes to the >>> initial pool. Because node may fail (but it is the same with comm_spawn >>> basically) , I might need some (lot of) spare nodes to replace failed >>> ones. As I do not want to request for twice as many nodes as I need >>> (after all, things could just go fine, why should I waste that many >>> computing resources for idle spares ?), I definitely want to be able to >>> allocate some new nodes to the pool of the already running machines. As >>> far as I understand, this is impossible to achieve with the usecase2 and >>> quite difficult in usecase1. In my opinion, having the ability to spawn >>> on nodes which are not part of the initial hostfile is a key feature >>> (and not only for FT purposes). >>> >>> >>> >> I am looking for more detail into the above issue. What >> resource manager are you using? >> >> Ideally, we would prefer not to support this. Any nodes >> that you run on, or hope to run on, would be designated >> at the start. For example: >> >> mpirun -np 1 --host a,b,c,d,e,f,g >> >> This would cause the on
Re: [OMPI devel] [RFC] Sparse group implementation
On Thu, July 26, 2007 1:18 pm, Brian Barrett wrote: > On Jul 26, 2007, at 12:00 PM, Mohamad Chaarawi wrote: > >>> 2) I think it would be better to always have the flags and macros >>> available (like OMPI_GROUP_SPORADIC and OMPI_GROUP_IS_SPORADIC) even >>> when sparse groups are disabled. They dont' take up any space, and >>> mean less #ifs in the general code base >>> >> That's what i actually was proposing.. keep the flags (there are no >> macros, just the GROUP_GET_PROC_POINTER) and the sparse parameters >> in the >> group strucutre, and this will mean, only 1 maybe 2 #ifs.. > > Why would this mean having the sparse parameters in the group structure? not sure if i understood your question right, but in the group struct we added 5 integers and 3 pointer.. if we want to compile these out, we would then need all the #ifs around the code where we use these parameters.. > >>> 3) Instead of the GROU_GET_PROC_POINTER macro, why not just change >>> the behavior of the ompi_group_peer_lookup() function, so that there >>> is symmetry with how you get a proc from a communicator? static >>> inline functions (especially short ones like that) are basically >>> free. We'll still have to fix all the places in the code where the >>> macro is used or people poke directly at the group structure, but I >>> like static inline over macros whenever possible. So much easier t >>> debug. >> >> Actually i never knew till this morning that this function was in the >> code.. I have an inline function ompi_group_lookup (which does the >> same >> thing), that actually checks if the group is dense or not and act >> accordingly.. but to use the inline function instead of the macro, >> means >> again that we need to compile in all the sparse parameters/code, >> which im >> for.. > > No, it doesn't. Just have something like: > > static inline ompi_proc_t* > ompi_group_lookup(ompi_group_t *group, int peer) > { > #if OMPI_GROUP_SPARSE > /* big long lookup function for sparse groups here */ > #else > return group->grp_proc_pointers[peer] > #endif > } > ok, i guess i can do that... > Brian > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Mohamad Chaarawi Instructional Assistant http://www.cs.uh.edu/~mschaara Department of Computer ScienceUniversity of Houston 4800 Calhoun, PGH Room 526Houston, TX 77204, USA
Re: [OMPI devel] [RFC] Sparse group implementation
On Jul 26, 2007, at 1:01 PM, Mohamad Chaarawi wrote: On Thu, July 26, 2007 1:18 pm, Brian Barrett wrote: On Jul 26, 2007, at 12:00 PM, Mohamad Chaarawi wrote: 2) I think it would be better to always have the flags and macros available (like OMPI_GROUP_SPORADIC and OMPI_GROUP_IS_SPORADIC) even when sparse groups are disabled. They dont' take up any space, and mean less #ifs in the general code base That's what i actually was proposing.. keep the flags (there are no macros, just the GROUP_GET_PROC_POINTER) and the sparse parameters in the group strucutre, and this will mean, only 1 maybe 2 #ifs.. Why would this mean having the sparse parameters in the group structure? not sure if i understood your question right, but in the group struct we added 5 integers and 3 pointer.. if we want to compile these out, we would then need all the #ifs around the code where we use these parameters.. I don't follow why you would need all the sparse stuff in ompi_group_t when OMPI_GROUP_SPARSE is 0. The OMPI_GROUP_IS and OMPI_GROU_SET macros only modify grp_flags, which is always present. Like the ompi_group_peer_lookup, much can be hidden inside the functions rather than exposed through the interface, if you're concerned about the other functionality currently #if'ed in the code. Brian
Re: [OMPI devel] [RFC] Sparse group implementation
On Thu, July 26, 2007 2:07 pm, Brian Barrett wrote: > On Jul 26, 2007, at 1:01 PM, Mohamad Chaarawi wrote: > >> >> On Thu, July 26, 2007 1:18 pm, Brian Barrett wrote: >>> On Jul 26, 2007, at 12:00 PM, Mohamad Chaarawi wrote: >>> > 2) I think it would be better to always have the flags and macros > available (like OMPI_GROUP_SPORADIC and OMPI_GROUP_IS_SPORADIC) > even > when sparse groups are disabled. They dont' take up any space, and > mean less #ifs in the general code base > That's what i actually was proposing.. keep the flags (there are no macros, just the GROUP_GET_PROC_POINTER) and the sparse parameters in the group strucutre, and this will mean, only 1 maybe 2 #ifs.. >>> >>> Why would this mean having the sparse parameters in the group >>> structure? >> >> not sure if i understood your question right, but in the group >> struct we >> added 5 integers and 3 pointer.. if we want to compile these out, >> we would >> then need all the #ifs around the code where we use these parameters.. > > I don't follow why you would need all the sparse stuff in > ompi_group_t when OMPI_GROUP_SPARSE is 0. The OMPI_GROUP_IS and > OMPI_GROU_SET macros only modify grp_flags, which is always present. > I don't need them, right now they are compiled out.. but since they are, all functions using these parameters (example: translate_ranks_strided, the long lookup function) have to be also compiled out, and other common functions that changed (like translate ranks, where we now check if sparse groups are enabled so we use an easier translate_ranks corresponding to the storage type) have to have the #ifs to compile stuff out. > Like the ompi_group_peer_lookup, much can be hidden inside the > functions rather than exposed through the interface, if you're > concerned about the other functionality currently #if'ed in the code. > in the ompi_group_peer_lookup that u suggested, we have an #if, so the same way with all functions that use sparse parameters.. I think i get what u are saying, Im don't want any functionality from including the sparse stuff when they are disabled, just easier code to look at.. > Brian > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Mohamad Chaarawi Instructional Assistant http://www.cs.uh.edu/~mschaara Department of Computer ScienceUniversity of Houston 4800 Calhoun, PGH Room 526Houston, TX 77204, USA
Re: [OMPI devel] Hostfiles - yet again
Ralph H Castain wrote: After some investigation, I'm afraid that I have to report that this - as far as I understand what you are doing - may no longer work in Open MPI in the future (and I'm pretty sure isn't working in the trunk today except [maybe] in the special case of hostfile - haven't verified that). To ensure we are correctly communicating, let me reiterate what I understand you are doing: Correct. Also consider that for my testing I use a batch scheduler that is not managed by orte right now and provide myself the hostfiles (This batch scheduler is named OAR and is in use on the grid5000 research facility in France). This was caused by mpirun itself processing its local environment and then "pushing" it into the global registry. Keeping everything separated causes a bookkeeper's headache and many lines of code that we would like to eliminate. I see the point. I Agree there is very few benefit at allowing users to have different local environments on different mpirun instances; while it should be a real pain to have a clean code managing this. For my sole usage, the app_context feature you described is a more elegant and equivalent way of spawning my FT services. I will switch to this right away. Still it might be of some use to be able to start different mpirun the same way you plan comm_spawn to work: sharing the same environment, but allowing for use of a different hostfile. The use case that comes in mind is "grid", where different batch schedulers are in use on each clusters, so you can't gather a single hostfile. This is not a feature I would fight for, but I can imagine some people might find it useful. More important for me is the ability to refill the hostfile with fresh hosts when some of the original ones died. Allocating an huge amount of spares preventively is just not the correct way to go. On the side I am not sure that even the best comm_spawn you discussed could be of some help in this case as I do not want the new nodes to go in a different COMM_WORLD. Finding a way to update the registry and all the orted to do so is a much larger issue than simple spawning and I have not been really thinking about it for now. Maybe we should discuss this issue separately. Aurelien Please feel free to comment. If this is a big enough issue to a large enough audience, then we can try to find a way to solve it (assuming Open MPI's community decides to support it). Ralph Next requirement is the ability to add during runtime some nodes to the initial pool. Because node may fail (but it is the same with comm_spawn basically) , I might need some (lot of) spare nodes to replace failed ones. As I do not want to request for twice as many nodes as I need (after all, things could just go fine, why should I waste that many computing resources for idle spares ?), I definitely want to be able to allocate some new nodes to the pool of the already running machines. As far as I understand, this is impossible to achieve with the usecase2 and quite difficult in usecase1. In my opinion, having the ability to spawn on nodes which are not part of the initial hostfile is a key feature (and not only for FT purposes). I am looking for more detail into the above issue. What resource manager are you using? Ideally, we would prefer not to support this. Any nodes that you run on, or hope to run on, would be designated at the start. For example: mpirun -np 1 --host a,b,c,d,e,f,g This would cause the one process of the mpi job to start on host a. Then, the mpi job has available to it the other hosts should it decide later to start a job on them. However no ORTE daemons would be started on those nodes until calls to MPI_Comm_spawn occur. So, the MPI job would not be consuming any resources until called upon to. This has actually been the subject of multiple threads on the user list and is considered a critical capability by some users and vendors. I believe there is little problem in allowing those systems that can support it to dynamically add nodes to ORTE via some API into the resource manager. At the moment, none of the RMs support it, but LSF will (and TM at least may) shortly do so, and some of their customers are depending upon it. The problem is that job startup could be delayed for significant time if all hosts must be preallocated. Admittedly, this raises all kinds of issues about how long the job could be stalled waiting for the new hosts. However, as the other somewhat exhaustive threads have discussed, there are computing models that can live with this uncertainty, and RMs that will provide async callbacks to allow the rest of the app to continue working while waiting. Just my $0.2 - again, this goes back to...are there use-cases and customers to which Open MPI is simply going to say "we won't support that"? Rolf I know there have been some extra discussions on this subject. Unfortunately it looks like I am not
Re: [OMPI devel] Hostfiles - yet again
On 7/26/07 2:24 PM, "Aurelien Bouteiller" wrote: > Ralph H Castain wrote: >> After some investigation, I'm afraid that I have to report that this - as >> far as I understand what you are doing - may no longer work in Open MPI in >> the future (and I'm pretty sure isn't working in the trunk today except >> [maybe] in the special case of hostfile - haven't verified that). >> >> To ensure we are correctly communicating, let me reiterate what I understand >> you are doing: >> > Correct. Also consider that for my testing I use a batch scheduler that > is not managed by orte right now and provide myself the hostfiles (This > batch scheduler is named OAR and is in use on the grid5000 research > facility in France). > >> This was caused by mpirun itself processing its local environment and then >> "pushing" it into the global registry. Keeping everything separated causes a >> bookkeeper's headache and many lines of code that we would like to >> eliminate. >> >> > I see the point. I Agree there is very few benefit at allowing users to > have different local environments on different mpirun instances; while > it should be a real pain to have a clean code managing this. For my sole > usage, the app_context feature you described is a more elegant and > equivalent way of spawning my FT services. I will switch to this right > away. > > Still it might be of some use to be able to start different mpirun the > same way you plan comm_spawn to work: sharing the same environment, but > allowing for use of a different hostfile. The use case that comes in > mind is "grid", where different batch schedulers are in use on each > clusters, so you can't gather a single hostfile. This is not a feature I > would fight for, but I can imagine some people might find it useful. One of the design changes we made was to explicitly not support multi-cluster operations from inside of Open MPI. Instead, people (not us) are looking at adding a layer on top of Open MPI to handle the cross-cluster coordination. I expect you'll hear more about those efforts in the not-too-distant future. > > More important for me is the ability to refill the hostfile with fresh > hosts when some of the original ones died. Allocating an huge amount of > spares preventively is just not the correct way to go. On the side I am > not sure that even the best comm_spawn you discussed could be of some > help in this case as I do not want the new nodes to go in a different > COMM_WORLD. Finding a way to update the registry and all the orted to do > so is a much larger issue than simple spawning and I have not been > really thinking about it for now. Maybe we should discuss this issue > separately. Ah, now -that- is a different topic indeed. I do plan to support a dynamic add_hosts API as part of the revamped system. I'll try to flesh that out as a separate RFC later. Thanks Ralph > > Aurelien >> Please feel free to comment. If this is a big enough issue to a large enough >> audience, then we can try to find a way to solve it (assuming Open MPI's >> community decides to support it). >> >> Ralph >> >> >> > Next requirement is the ability to add during runtime some nodes to the > initial pool. Because node may fail (but it is the same with comm_spawn > basically) , I might need some (lot of) spare nodes to replace failed > ones. As I do not want to request for twice as many nodes as I need > (after all, things could just go fine, why should I waste that many > computing resources for idle spares ?), I definitely want to be able to > allocate some new nodes to the pool of the already running machines. As > far as I understand, this is impossible to achieve with the usecase2 and > quite difficult in usecase1. In my opinion, having the ability to spawn > on nodes which are not part of the initial hostfile is a key feature > (and not only for FT purposes). > > > > I am looking for more detail into the above issue. What resource manager are you using? Ideally, we would prefer not to support this. Any nodes that you run on, or hope to run on, would be designated at the start. For example: mpirun -np 1 --host a,b,c,d,e,f,g This would cause the one process of the mpi job to start on host a. Then, the mpi job has available to it the other hosts should it decide later to start a job on them. However no ORTE daemons would be started on those nodes until calls to MPI_Comm_spawn occur. So, the MPI job would not be consuming any resources until called upon to. >>> This has actually been the subject of multiple threads on the user list and >>> is considered a critical capability by some users and vendors. I believe >>> there is little problem in allowing those systems that can support it to >>> dynamically add nodes to ORTE via some API into the resource manager. At the >>> moment, none of th
Re: [OMPI devel] Hostfiles - yet again
mpirun -hostfile big_pool -n 10 -host 1,2,3,4 application : -n 2 -host 99,100 ft_server This will not work: this is a way to launch MIMD jobs, that share the same COMM_WORLD. Not the way to launch two different applications that interact trough Accept/Connect. Direct consequence on simple NAS benchmarks are: * if the second command does not use MPI-Init, then the first application locks forever in MPI-Init * if both use MPI init, the MPI_Comm_size of the jobs are incorrect. bouteill@dancer:~$ ompi-build/debug/bin/mpirun -prefix /home/bouteill/ompi-build/debug/ -np 4 -host node01,node02,node03,node04 NPB3.2-MPI/bin/lu.A.4 : -np 1 -host node01 NPB3.2-MPI/bin/mg.A.1 NAS Parallel Benchmarks 3.2 -- LU Benchmark Warning: program is running on 5 processors but was compiled for 4 Size: 64x 64x 64 Iterations: 250 Number of processes: 5