[OMPI devel] pmi2 slurm/openmpi patch
Hello, I think there a few things still missing in openmpi pmi2 to make it work with slurm. We are the ones at Bull who integrated the pmi2 code from mpich2 to slurm. The attached patch should fix the issue (call slurm with --mpi=pmi2). This still needs to be checked with other pmi2 implemenations (we use pmi2.h but some use pmi.h ? constants are prefixed with PMI2_ but some use PMI_ ?). Piotr Lesnicki diff --git a/opal/mca/common/pmi/common_pmi.c b/opal/mca/common/pmi/common_pmi.c --- a/opal/mca/common/pmi/common_pmi.c +++ b/opal/mca/common/pmi/common_pmi.c @@ -25,6 +25,8 @@ #include "common_pmi.h" static int mca_common_pmi_init_count = 0; +static int mca_common_pmi_init_size = 0; +static int mca_common_pmi_init_rank = 0; bool mca_common_pmi_init (void) { if (0 < mca_common_pmi_init_count++) { @@ -41,6 +43,8 @@ } if (PMI_SUCCESS != PMI2_Init(&spawned, &size, &rank, &appnum)) { +mca_common_pmi_init_size = size; +mca_common_pmi_init_rank = rank; mca_common_pmi_init_count--; return false; } @@ -107,3 +111,23 @@ } return err_msg; } + + +bool mca_common_pmi_rank(int *rank) { +#ifndef WANT_PMI2_SUPPORT +if (PMI_SUCCESS != (ret = PMI_Get_rank(&mca_common_pmi_rank))) +return false; +#endif +*rank = mca_common_pmi_init_rank; +return true; +} + + +bool mca_common_pmi_size(int *size) { +#ifndef WANT_PMI2_SUPPORT +if (PMI_SUCCESS != (ret = PMI_Get_universe_size(&mca_common_pmi_size))) +return false; +#endif +*size = mca_common_pmi_init_size; +return true; +} diff --git a/opal/mca/common/pmi/common_pmi.h b/opal/mca/common/pmi/common_pmi.h --- a/opal/mca/common/pmi/common_pmi.h +++ b/opal/mca/common/pmi/common_pmi.h @@ -42,3 +42,6 @@ OPAL_DECLSPEC char* opal_errmgr_base_pmi_error(int pmi_err); #endif + +bool mca_common_pmi_rank(int *rank); +bool mca_common_pmi_size(int *size); diff --git a/orte/mca/ess/pmi/ess_pmi_module.c b/orte/mca/ess/pmi/ess_pmi_module.c --- a/orte/mca/ess/pmi/ess_pmi_module.c +++ b/orte/mca/ess/pmi/ess_pmi_module.c @@ -38,6 +38,9 @@ #endif #include +#ifdef WANT_PMI2_SUPPORT +#include +#endif #include "opal/util/opal_environ.h" #include "opal/util/output.h" @@ -126,7 +129,7 @@ } ORTE_PROC_MY_NAME->jobid = jobid; /* get our rank from PMI */ -if (PMI_SUCCESS != (ret = PMI_Get_rank(&i))) { +if (!(ret = mca_common_pmi_rank(&i))) { OPAL_PMI_ERROR(ret, "PMI_Get_rank"); error = "could not get PMI rank"; goto error; @@ -134,7 +137,7 @@ ORTE_PROC_MY_NAME->vpid = i + 1; /* compensate for orterun */ /* get the number of procs from PMI */ -if (PMI_SUCCESS != (ret = PMI_Get_universe_size(&i))) { +if (!(ret = mca_common_pmi_size(&i))) { OPAL_PMI_ERROR(ret, "PMI_Get_universe_size"); error = "could not get PMI universe size"; goto error; @@ -148,6 +151,14 @@ goto error; } } else { /* we are a direct-launched MPI process */ +#ifdef WANT_PMI2_SUPPORT +/* Get domain id */ +pmi_id = malloc(PMI2_MAX_VALLEN); +if (PMI_SUCCESS != (ret = PMI2_Job_GetId(pmi_id, PMI2_MAX_VALLEN))) { +error = "PMI2_Job_GetId failed"; +goto error; +} +#else /* get our PMI id length */ if (PMI_SUCCESS != (ret = PMI_Get_id_length_max(&pmi_maxlen))) { error = "PMI_Get_id_length_max"; @@ -159,6 +170,7 @@ error = "PMI_Get_kvs_domain_id"; goto error; } +#endif /* PMI is very nice to us - the domain id is an integer followed * by a '.', followed by essentially a stepid. The first integer * defines an overall job number. The second integer is the number of @@ -180,20 +192,22 @@ ORTE_PROC_MY_NAME->jobid = ORTE_CONSTRUCT_LOCAL_JOBID(jobfam << 16, stepid); /* get our rank */ -if (PMI_SUCCESS != (ret = PMI_Get_rank(&i))) { +if (!(ret = mca_common_pmi_rank(&i))) { OPAL_PMI_ERROR(ret, "PMI_Get_rank"); error = "could not get PMI rank"; goto error; } ORTE_PROC_MY_NAME->vpid = i; +int rank = i; /* get the number of procs from PMI */ -if (PMI_SUCCESS != (ret = PMI_Get_universe_size(&i))) { +if (!(ret = mca_common_pmi_size(&i))) { OPAL_PMI_ERROR(ret, "PMI_Get_universe_size"); error = "could not get PMI universe size"; goto error; } orte_process_info.num_procs = i; +int size = i; /* push into the environ for pickup in MPI layer for * MPI-3 required info key */ @@ -245,6 +259,42 @@ goto error; } +#ifdef WANT_PMI2_SUPPORT +/* get our local proc info to find our local rank */ +char *pmapping = malloc(PMI2_MAX_VALLEN); +
Re: [OMPI devel] ompi_info
On Jul 17, 2013, at 20:15 , "Jeff Squyres (jsquyres)" wrote: > On Jul 17, 2013, at 12:16 PM, Nathan Hjelm wrote: > >> As Ralph suggested you need to pass the --level or -l option to see all the >> variables. --level 9 will print everything. If you think there are variables >> everyday users should see you are welcome to change them to OPAL_INFO_LVL_1. >> We are trying to avoid moving too many variables to this info level. > > I think George might have a point here, though. He was specifically asking > about the --all option, right? > > I think it might be reasonable for "ompi_info --all" to actually show *all* > MCA params (up through level 9). Thanks Jeff, I'm totally puzzled by the divergence in opinion in this community on the word ALL. ALL like in "every single one of them", not like in "4 poorly chosen MCA arguments that I don't even know how to care about". > Thoughts? Give back to the word ALL it's original meaning: "the whole quantity or extent of a group". > >>> Btw, something is wrong i the following output. I have an "btl = sm,self" >>> in my .openmpi/mca-params.conf so I should not even see the BTL TCP >>> parameters. >> >> I think ompi_info has always shown all the variables despite what you have >> the selection variable set (at least in some cases). We now just display >> everything in all cases. An additional benefit to the updated code is that >> if you set a selection variable through the environment >> (OMPI_MCA_btl=self,sm) it no longer appears as unset in ompi_info. The old >> code unset all selection variables in order to ensure all parameters got >> printed (very annoying but necessary). Ralph comment above is not accurate. Prior to this change (well the one from few weeks ago), explicitly forbidden components did not leave traces in the MCA parameters list. I validate this with the latest stable. > Yes, I think I like this new behavior better, too. > > Does anyone violently disagree? Yes. This behavior means the every single MPI process out there will 1) load all existing .so components, and 2) will give them a chance to leave undesired traces in the memory of the application. So first we generate an increased I/O traffic, and 2) we use memory that shouldn't be used. We can argue about the impact of all this, but from my perspective what I see is that Open MPI is doing it when explicit arguments to prevent the usage of these component were provided. George. > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] ompi_info
On Jul 18, 2013, at 5:46 AM, George Bosilca wrote: > On Jul 17, 2013, at 20:15 , "Jeff Squyres (jsquyres)" > wrote: > >> On Jul 17, 2013, at 12:16 PM, Nathan Hjelm wrote: >> >>> As Ralph suggested you need to pass the --level or -l option to see all the >>> variables. --level 9 will print everything. If you think there are >>> variables everyday users should see you are welcome to change them to >>> OPAL_INFO_LVL_1. We are trying to avoid moving too many variables to this >>> info level. >> >> I think George might have a point here, though. He was specifically asking >> about the --all option, right? >> >> I think it might be reasonable for "ompi_info --all" to actually show *all* >> MCA params (up through level 9). > > Thanks Jeff, > > I'm totally puzzled by the divergence in opinion in this community on the > word ALL. ALL like in "every single one of them", not like in "4 poorly > chosen MCA arguments that I don't even know how to care about". I don't think there is a divergence of opinion on this - I think it was likely a programming oversight. I certainly would agree that all should operate that way. > >> Thoughts? > > Give back to the word ALL it's original meaning: "the whole quantity or > extent of a group". > >> Btw, something is wrong i the following output. I have an "btl = sm,self" in my .openmpi/mca-params.conf so I should not even see the BTL TCP parameters. >>> >>> I think ompi_info has always shown all the variables despite what you have >>> the selection variable set (at least in some cases). We now just display >>> everything in all cases. An additional benefit to the updated code is that >>> if you set a selection variable through the environment >>> (OMPI_MCA_btl=self,sm) it no longer appears as unset in ompi_info. The old >>> code unset all selection variables in order to ensure all parameters got >>> printed (very annoying but necessary). > > Ralph comment above is not accurate. Prior to this change (well the one from > few weeks ago), explicitly forbidden components did not leave traces in the > MCA parameters list. I validate this with the latest stable. FWIW: that wasn't my comment > >> Yes, I think I like this new behavior better, too. >> >> Does anyone violently disagree? > > Yes. This behavior means the every single MPI process out there will 1) load > all existing .so components, and 2) will give them a chance to leave > undesired traces in the memory of the application. So first we generate an > increased I/O traffic, and 2) we use memory that shouldn't be used. We can > argue about the impact of all this, but from my perspective what I see is > that Open MPI is doing it when explicit arguments to prevent the usage of > these component were provided. That's a good point, and a bad behavior. IIRC, it results from the MPI Forum's adoption of the MPI-T requirement that stipulates we must allow access to all control and performance variables at startup so they can be externally seen/manipulated. I guess the question is: does that truly mean "all" per your proposed definition, or "all that fall within the pre-given MCA directives on components"? > > George. > >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] pmi2 slurm/openmpi patch
Thanks Piotr - I'll apply that and move it to the 1.7 branch. Some of us are trying to test the pmi2 support in 2.6.0 and hitting a problem. We have verified that the pmi2 support was built/installed, and that both slurmctld and slurmd are at 2.6.0 level. When we run "srun --mpi-list", we get: srun: MPI types are... srun: mpi/mvapich srun: mpi/pmi2 srun: mpi/mpich1_shmem srun: mpi/mpich1_p4 srun: mpi/none srun: mpi/lam srun: mpi/openmpi srun: mpi/mpichmx srun: mpi/mpichgm So it looks like the install is correct. However, when we attempt to run a job with "srun --mpi=pmi2 foo", we get an error from the slurmd on the remote node: slurmd[n1]: mpi/pmi2: no value for key in req and the PMI calls in the app fail. Any ideas as to the source of the problem? Do we have to configure something else, or start slurmd with some option? Thanks Ralph On Jul 18, 2013, at 2:02 AM, Piotr Lesnicki wrote: > Hello, > > I think there a few things still missing in openmpi pmi2 to make it work with > slurm. We are the ones at Bull who integrated the pmi2 code from mpich2 to > slurm. The attached patch should fix the issue (call slurm with --mpi=pmi2). > This still needs to be checked with other pmi2 implemenations (we use pmi2.h > but some use pmi.h ? constants are prefixed with PMI2_ but some use PMI_ ?). > > Piotr Lesnicki >
Re: [OMPI devel] ompi_info
On Jul 18, 2013, at 8:06 AM, Ralph Castain wrote: > That's a good point, and a bad behavior. IIRC, it results from the MPI > Forum's adoption of the MPI-T requirement that stipulates we must allow > access to all control and performance variables at startup so they can be > externally seen/manipulated. Minor nit: MPI_T does not require this. However, it does recommend that you offer users access to as many variables as possible as early as reasonably possible for the convenience and control of the user. If an implementation chooses to offer 5% of the possible control/performance variables to the user just before MPI_Finalize, that's still a valid MPI_T implementation. But it may not be a very useful one... -Dave
Re: [OMPI devel] ompi_info
On Jul 18, 2013, at 15:06 , Ralph Castain wrote: I think ompi_info has always shown all the variables despite what you have the selection variable set (at least in some cases). We now just display everything in all cases. An additional benefit to the updated code is that if you set a selection variable through the environment (OMPI_MCA_btl=self,sm) it no longer appears as unset in ompi_info. The old code unset all selection variables in order to ensure all parameters got printed (very annoying but necessary). >> >> Ralph comment above is not accurate. Prior to this change (well the one from >> few weeks ago), explicitly forbidden components did not leave traces in the >> MCA parameters list. I validate this with the latest stable. > > FWIW: that wasn't my comment Sorry Ralph I was wrong, the comment was from Nathan. This discussion grew out of hand, it became difficult to follow who said what and when. George.
Re: [OMPI devel] ompi_info
On Jul 18, 2013, at 7:05 AM, David Goodell (dgoodell) wrote: > On Jul 18, 2013, at 8:06 AM, Ralph Castain wrote: > >> That's a good point, and a bad behavior. IIRC, it results from the MPI >> Forum's adoption of the MPI-T requirement that stipulates we must allow >> access to all control and performance variables at startup so they can be >> externally seen/manipulated. > > Minor nit: MPI_T does not require this. However, it does recommend that you > offer users access to as many variables as possible as early as reasonably > possible for the convenience and control of the user. > > If an implementation chooses to offer 5% of the possible control/performance > variables to the user just before MPI_Finalize, that's still a valid MPI_T > implementation. But it may not be a very useful one... The problem here is one of use vs startup performance. George is quite correct with his concerns - this behavior would have been a serious problem for RoadRunner, for example, where we had a small IO channel feeding a lot of nodes. It will definitely become an issue at exascale where IO bandwidth and memory will be at a premium. This is especially troubling when you consider how few people will ever use this capability. Perhaps we should offer a switch that says "I want access to MPI-T" so that the rest of the world isn't hammered by this kind of behavior? > > -Dave > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] ompi_info
On Thu, Jul 18, 2013 at 07:53:35AM -0700, Ralph Castain wrote: > > On Jul 18, 2013, at 7:05 AM, David Goodell (dgoodell) > wrote: > > > On Jul 18, 2013, at 8:06 AM, Ralph Castain wrote: > > > >> That's a good point, and a bad behavior. IIRC, it results from the MPI > >> Forum's adoption of the MPI-T requirement that stipulates we must allow > >> access to all control and performance variables at startup so they can be > >> externally seen/manipulated. > > > > Minor nit: MPI_T does not require this. However, it does recommend that > > you offer users access to as many variables as possible as early as > > reasonably possible for the convenience and control of the user. > > > > If an implementation chooses to offer 5% of the possible > > control/performance variables to the user just before MPI_Finalize, that's > > still a valid MPI_T implementation. But it may not be a very useful one... > > The problem here is one of use vs startup performance. George is quite > correct with his concerns - this behavior would have been a serious problem > for RoadRunner, for example, where we had a small IO channel feeding a lot of > nodes. It will definitely become an issue at exascale where IO bandwidth and > memory will be at a premium. > > This is especially troubling when you consider how few people will ever use > this capability. Perhaps we should offer a switch that says "I want access to > MPI-T" so that the rest of the world isn't hammered by this kind of behavior? This was discussed in depth before the MCA rewrite came into the trunk. There are only two cases where we load and register all the available components: ompi_info, and MPI_T_init_thread(). The normal MPI case does not have this behavior and instead loads only the requested components. -Nathan
[OMPI devel] KNEM + user-space hybrid for sm BTL
Hello, Could someone, who is more familiar with the architecture of the sm BTL, comment on the technical feasibility of the following: is it possible to easily extend the BTL (i.e. without having to rewrite it completely from scratch) so as to be able to perform transfers using both KNEM (or other kernel-assisted copying mechanism) for messages over a given size and the normal user-space mechanism for smaller messages with the switch-over point being a user-tunable parameter? >From what I've seen, both implementations have something in common, e.g. both use FIFOs to communicate controlling information. The motivation behind this are our efforts to become greener by extracting the best possible out of the box performance on our systems without having to profile each and every user application that runs on them. We've already determined that activating KNEM really benefits some collective operations on big shared-memory systems, but the increased latency significantly slows down small message transfers, which also hits the pipelined implementations. sm's code doesn't seem to be very complex but still I've decided to ask first before diving any deeper. Kind regards, Hristo -- Hristo Iliev, PhD - High Performance Computing Team RWTH Aachen University, Center for Computing and Communication Rechen- und Kommunikationszentrum der RWTH Aachen Seffenter Weg 23, D 52074 Aachen (Germany) smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI devel] ompi_info
On Jul 18, 2013, at 9:53 AM, Ralph Castain wrote: > On Jul 18, 2013, at 7:05 AM, David Goodell (dgoodell) > wrote: > >> On Jul 18, 2013, at 8:06 AM, Ralph Castain wrote: >> >>> That's a good point, and a bad behavior. IIRC, it results from the MPI >>> Forum's adoption of the MPI-T requirement that stipulates we must allow >>> access to all control and performance variables at startup so they can be >>> externally seen/manipulated. >> >> Minor nit: MPI_T does not require this. However, it does recommend that you >> offer users access to as many variables as possible as early as reasonably >> possible for the convenience and control of the user. >> >> If an implementation chooses to offer 5% of the possible control/performance >> variables to the user just before MPI_Finalize, that's still a valid MPI_T >> implementation. But it may not be a very useful one... > > The problem here is one of use vs startup performance. George is quite > correct with his concerns - this behavior would have been a serious problem > for RoadRunner, for example, where we had a small IO channel feeding a lot of > nodes. It will definitely become an issue at exascale where IO bandwidth and > memory will be at a premium. My point was not that the performance concerns were unfounded. Rather, I wanted to point out that the "load everything" behavior is not a hard requirement from the MPI standard, so we have room for different implementation choices/tradeoffs. -Dave
Re: [OMPI devel] ompi_info
On Jul 18, 2013, at 8:17 AM, "David Goodell (dgoodell)" wrote: > On Jul 18, 2013, at 9:53 AM, Ralph Castain wrote: > >> On Jul 18, 2013, at 7:05 AM, David Goodell (dgoodell) >> wrote: >> >>> On Jul 18, 2013, at 8:06 AM, Ralph Castain wrote: >>> That's a good point, and a bad behavior. IIRC, it results from the MPI Forum's adoption of the MPI-T requirement that stipulates we must allow access to all control and performance variables at startup so they can be externally seen/manipulated. >>> >>> Minor nit: MPI_T does not require this. However, it does recommend that >>> you offer users access to as many variables as possible as early as >>> reasonably possible for the convenience and control of the user. >>> >>> If an implementation chooses to offer 5% of the possible >>> control/performance variables to the user just before MPI_Finalize, that's >>> still a valid MPI_T implementation. But it may not be a very useful one... >> >> The problem here is one of use vs startup performance. George is quite >> correct with his concerns - this behavior would have been a serious problem >> for RoadRunner, for example, where we had a small IO channel feeding a lot >> of nodes. It will definitely become an issue at exascale where IO bandwidth >> and memory will be at a premium. > > My point was not that the performance concerns were unfounded. Rather, I > wanted to point out that the "load everything" behavior is not a hard > requirement from the MPI standard, so we have room for different > implementation choices/tradeoffs. I understood - I was more just pointing out the potential performance issue of load everything. However, Nathan has addressed it by pointing out that the problem is my aged, fading memory. > > -Dave > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] KNEM + user-space hybrid for sm BTL
On Jul 18, 2013, at 17:12 , "Iliev, Hristo" wrote: > Hello, > > Could someone, who is more familiar with the architecture of the sm BTL, > comment on the technical feasibility of the following: is it possible to > easily extend the BTL (i.e. without having to rewrite it completely from > scratch) so as to be able to perform transfers using both KNEM (or other > kernel-assisted copying mechanism) for messages over a given size and the > normal user-space mechanism for smaller messages with the switch-over point > being a user-tunable parameter? This is already what the SM BTL does. When support for kernel-assisted mechanisms is enabled everything under the eager size is going over "traditional" shared memory (double copy and so on), while larger messages use the single-copy mechanism. George. > > From what I’ve seen, both implementations have something in common, e.g. both > use FIFOs to communicate controlling information. > The motivation behind this are our efforts to become greener by extracting > the best possible out of the box performance on our systems without having to > profile each and every user application that runs on them. We’ve already > determined that activating KNEM really benefits some collective operations on > big shared-memory systems, but the increased latency significantly slows down > small message transfers, which also hits the pipelined implementations. > > sm’s code doesn’t seem to be very complex but still I’ve decided to ask first > before diving any deeper. > > Kind regards, > Hristo > -- > Hristo Iliev, PhD – High Performance Computing Team > RWTH Aachen University, Center for Computing and Communication > Rechen- und Kommunikationszentrum der RWTH Aachen > Seffenter Weg 23, D 52074 Aachen (Germany) > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] ompi_info
On Thu, Jul 18, 2013 at 08:33:37AM -0700, Ralph Castain wrote: > > On Jul 18, 2013, at 8:17 AM, "David Goodell (dgoodell)" > wrote: > > > On Jul 18, 2013, at 9:53 AM, Ralph Castain wrote: > > > >> On Jul 18, 2013, at 7:05 AM, David Goodell (dgoodell) > >> wrote: > >> > >>> On Jul 18, 2013, at 8:06 AM, Ralph Castain wrote: > >>> > That's a good point, and a bad behavior. IIRC, it results from the MPI > Forum's adoption of the MPI-T requirement that stipulates we must allow > access to all control and performance variables at startup so they can > be externally seen/manipulated. > >>> > >>> Minor nit: MPI_T does not require this. However, it does recommend that > >>> you offer users access to as many variables as possible as early as > >>> reasonably possible for the convenience and control of the user. > >>> > >>> If an implementation chooses to offer 5% of the possible > >>> control/performance variables to the user just before MPI_Finalize, > >>> that's still a valid MPI_T implementation. But it may not be a very > >>> useful one... > >> > >> The problem here is one of use vs startup performance. George is quite > >> correct with his concerns - this behavior would have been a serious > >> problem for RoadRunner, for example, where we had a small IO channel > >> feeding a lot of nodes. It will definitely become an issue at exascale > >> where IO bandwidth and memory will be at a premium. > > > > My point was not that the performance concerns were unfounded. Rather, I > > wanted to point out that the "load everything" behavior is not a hard > > requirement from the MPI standard, so we have room for different > > implementation choices/tradeoffs. > > I understood - I was more just pointing out the potential performance issue > of load everything. However, Nathan has addressed it by pointing out that the > problem is my aged, fading memory. So, I think what I can take from this discussion to to make the following changes to ompi_info: - Make --all without a --level option imply --level 9. - Allow the user to modify this behavior by specifying a level: ex --all --level 5 would print every variable up to level 5. I will make these changes today and CMR them into 1.7.3 unless there are any objections. -Nathan
Re: [OMPI devel] ompi_info
On Jul 18, 2013, at 17:07 , Nathan Hjelm wrote: > This was discussed in depth before the MCA rewrite came into the trunk. There > are only two cases where we load and register all the available components: > ompi_info, and MPI_T_init_thread(). The normal MPI case does not have this > behavior and instead loads only the requested components. How is this part of the code validated? It might capitalize on some type of "trust". Unfortunately … I have no such notion. I would rather take the path of the "least astonishment", a __consistent__ behavior where we always abide to the configuration files (user level as well as system level). If you want to see every single parameter possibly available to you (based on your rights of course), temporary remove the configuration file. Or we can provide a specific ompi_info option to ignore the configuration files, but not make this the default. George.
Re: [OMPI devel] KNEM + user-space hybrid for sm BTL
Le 18 juil. 2013 à 11:12, "Iliev, Hristo" a écrit : > Hello, > > Could someone, who is more familiar with the architecture of the sm BTL, > comment on the technical feasibility of the following: is it possible to > easily extend the BTL (i.e. without having to rewrite it completely from > scratch) so as to be able to perform transfers using both KNEM (or other > kernel-assisted copying mechanism) for messages over a given size and the > normal user-space mechanism for smaller messages with the switch-over point > being a user-tunable parameter? > > From what I’ve seen, both implementations have something in common, e.g. both > use FIFOs to communicate controlling information. > The motivation behind this are our efforts to become greener by extracting > the best possible out of the box performance on our systems without having to > profile each and every user application that runs on them. We’ve already > determined that activating KNEM really benefits some collective operations on > big shared-memory systems, but the increased latency significantly slows down > small message transfers, which also hits the pipelined implementations. > Hristo, The knem BTL currently available in the trunk does just this :) You can use either Knem or Linux CMA to accelerate interprocess transfers. You can use the following mca parameters to turn on knem mode: -mca btl_sm_use_knem 1 If my memory serves me well, anything under eager limit is sent by regular double copy: -mca btl_sm_eager_limit 4096 (is the default, so anything below 1 page is copy-in, copy-out). If I remember correctly, anything below 16k decreased performance. We also have a collective component leveraging on knem capabilities. If you want more info about the details, you can look at the following paper we published at IPDPS last year. It covers what we found to be the best cutoff values for using (or not) knem in several collective. Teng Ma, George Bosilca, Aurelien Bouteiller, Jack Dongarra, "HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters," Parallel and Distributed Processing Symposium, International, pp. 970-982, 2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012 http://www.computer.org/csdl/proceedings/ipdps/2012/4675/00/4675a970-abs.html Enjoy, Aurelien > sm’s code doesn’t seem to be very complex but still I’ve decided to ask first > before diving any deeper. > > Kind regards, > Hristo > -- > Hristo Iliev, PhD – High Performance Computing Team > RWTH Aachen University, Center for Computing and Communication > Rechen- und Kommunikationszentrum der RWTH Aachen > Seffenter Weg 23, D 52074 Aachen (Germany) > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- * Dr. Aurélien Bouteiller * Researcher at Innovative Computing Laboratory * University of Tennessee * 1122 Volunteer Boulevard, suite 309b * Knoxville, TN 37996 * 865 974 9375
Re: [OMPI devel] ompi_info
On Thu, Jul 18, 2013 at 05:50:40PM +0200, George Bosilca wrote: > On Jul 18, 2013, at 17:07 , Nathan Hjelm wrote: > > > This was discussed in depth before the MCA rewrite came into the trunk. > > There are only two cases where we load and register all the available > > components: ompi_info, and MPI_T_init_thread(). The normal MPI case does > > not have this behavior and instead loads only the requested components. > > How is this part of the code validated? It might capitalize on some type of > "trust". Unfortunately ? I have no such notion. The fact that ompi_mpi_init never call ompi_info_register_params() which is the only path that sets the MCA_BASE_REGISTER_ALL when registering framework pameters. The register all behavior has to be explicitly asked for. > I would rather take the path of the "least astonishment", a __consistent__ > behavior where we always abide to the configuration files (user level as well > as system level). If you want to see every single parameter possibly > available to you (based on your rights of course), temporary remove the > configuration file. Or we can provide a specific ompi_info option to ignore > the configuration files, but not make this the default. In some ways this was the default behavior (if no file values were set). The current behavior was chosen to be consistent and reflect what I thought was the original intent. The old behavior would ignore component selection variables set in the environment (ompi_info actually unset them). So, if you set one of these variables in the environment (or the ompi_info command line) you would 1) still get all components in the framework, and 2) not see the variable as set even though it is in an actual run. So, if I did: export OMPI_MCA_btl=self,sm or added --mca btl self,sm to the ompi_info command line I would still see all the btls + this: MCA btl: parameter "btl" (current value: "", data source: default, level: 2 user/detail, type: string) Default selection set of components for the btl framework ( means use all components that can be found) instead of: MCA btl: parameter "btl" (current value: "self,sm", data source: environment, level: 2 user/detail, type: string) Default selection set of components for the btl framework ( means use all components that can be found) Very annoying! That said. The register all behavior is easy to control. If there is a consensus that we need another ompi_info option I am more than happy to add it. But then again, --all should mean all components, all frameworks, all levels. -Nathan
Re: [OMPI devel] ompi_info
On Jul 18, 2013, at 11:50 AM, George Bosilca wrote: > How is this part of the code validated? It might capitalize on some type of > "trust". Unfortunately … I have no such notion. Not sure what you're asking here. > I would rather take the path of the "least astonishment", a __consistent__ > behavior where we always abide to the configuration files (user level as well > as system level). If you want to see every single parameter possibly > available to you (based on your rights of course), temporary remove the > configuration file. Or we can provide a specific ompi_info option to ignore > the configuration files, but not make this the default. I think MPI applications and ompi_info are different cases. 1. We've definitely had cases of user (and OMPI developer!) confusion over the years where people would run ompi_info and not see their favorite MCA component listed. After a while, they figured out it was because they had an env variable/file limiting which components were used (e.g., OMPI_MCA_btl=sm,tcp,self would silently disable all other BTLs in ompi_info output). This actually seems to be fairly counter-intuitive behavior, if you ask me -- it was done this way as an artifact of the old implementation architecture. Personally, I think changing ompi_info's behavior to always listing all components is a good idea. Is there a reason to be concerned about the memory footprint and IO traffic of running ompi_info? What might be a useful addition, however, is in the above example (user has OMPI_MCA_btl=sm,tcp,self in their environment) to somehow mark all other BTL params as "inactive because of OMPI_MCA_BTL env variable value", or something like that. *** If someone wants this behavior, please propose a specific way to mark prettyprint and parsable ompi_info output. 2. MPI application behavior has not changed -- if you call MPI_Init, we open exactly the same frameworks/components that were opened before. But if you're using a tool (i.e., call the MPI_T init function), then you pay an extra price (potentially more dlopens, more memory usage, etc.). This is the same as it has always been for tools: tools cost something (memory, performance, whatever). That being said, if you want a different behavior, please propose something specific (e.g., specific new MCA param + value(s) for specific behavior(s)). -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI devel] RFC: Change ompi_proc_t endpoint data lookup
What: Change the ompi_proc_t endpoint data lookup to be more flexible Why: As collectives and one-sided components are using transports directly, an old problem of endpoint tracking is resurfacing. We need a fix that doesn't suck. When: Assuming there are no major objections, I'll start writing the code next week... More Info: Today, endpoint information is stored in one of two places on the ompi_proc_t: proc_pml and proc_bml. The proc_pml pointer is an opaque structure having meaning only to the PML and the proc_bml pointer is an opaque structure having meaning only to the BML. CM, OB1, and BFO don't use proc_pml, although the MTLs store their endpoint data on the proc_pml. R2 uses the proc_bml to hold an opaque data structure which holds all the btl endpoint data. The specific problem is the Portals 4 collective and one-sided components. They both need endpoint information for communication (obviously). Before there was a Portals 4 BTL, they peeked at the proc_pml pointer, knew what it looked like, and were ok. Now the data they need is possibly in the proc_pml or in the (opaque) proc_bml, which poses a problem. Jeff and I talked about this and had a number of restrictions that seemed to make sense for a solution: * Don't make ompi_proc_t bigger than absolutely necessary * Avoid adding extra indirection into the endpoint resolution path * Allow enough flexibility that IB or friends could use the same mechanism * Don't break the BML / BTL interface (too much work) What we came up with was a two pronged approach, depending on run-time needs. First, rather than having the proc_pml and proc_bml on the ompi_proc_t, we would have a proc_endpoint[] array of fixed size. The size of the array would be determined at compile time based on compile-time registering of endpoint slots. At compile time, a #define with a component's slot would be set, removing any extra indexing overhead over today's mechanism. So R2 would have a call in it's configure.m4 like: OMPI_REQUIRE_ENDPOINT_TAG(BML_R2) And would then find it's endpoint data with a call like: r2_endpoint = proc->proc_endpoint[OMPI_ENDPOINT_TAG_BML_R2]; which (assuming modest compiler optimization) is instruction equivalent to: r2_endpoint = proc->proc_bml; To allow for dynamic indexing (something we haven't had to date), the last entry in the array would be a pointer to an object like an opal_pointer_array, but without the locking, and some allocation calls during init. Since the indexes never need to be used by a remote process, there's no synchronization required in registering. The dynamic indexing could be turned off at configure time for space-concious builds. For example, on our big systems, I disable dlopen support, so static allocation of endpoint slots is good enough. In the average build, the only tag registered would be BML_R2. If we lazy allocate the pointer array element, that's two entries in the proc_endpoint array, so the same size as today. I was going to have the CM stop using the endpoint and push that handling on the MTL. Assuming all MTLs but Portals shared the same tag (easy to do), there'd be an 8*nprocs increase in space used per process if an MTL was built, but if you disabled R2, that disappears. How does this solve my problem? Rather than having Portals 4 use the MTL tag, it would have it's own tag, shared between the MTL, BTL, OSC, and COLL components. Since the chances of Portals 4 being built on a platform with support for another MTL is almost zero, in most cases, the size of the ompi_proc_t only increases by 8 bytes over today's setup. Since most Portals 4 builds will be on more static platforms, I can disable dynamic indexing and be back at today's size, but with an easy way to deal with endpoint data sharing between components of different frameworks. So, to review our original goals: * ompi_proc_t will remain the same size on most platforms, increase by 8*nprocs bytes if an MTL is built, but can shrink by 8*nprocs bytes on static systems (by disabling dynamic indexing and building only one of either the MTLs or BMLs). * If you're using a pre-allocated tag, there's no extra indirection or math, assuming basic compiler optimization. There is a higher cost for dynamic tags, but that's probably ok for us. * I think that IB could start registering a tag if it needed for sharing QP information between frameworks, at the cost of an extra tag. Probably makes the most sense for the MXM case (assuming someone writes an MXM osc component). * The PML interface would change slightly (remove about 5 lines of code / pml). The MTL would have to change a bit to look at their own tag instead of the proc_pml (fairly easy). The R2 BML would need to change to use proc_endpoint[OMPI_ENDPOINT_TAG_BML_R2] instead of proc_bml, but that shouldn't be hard. The consumers of the BML (OB1, BFO, RDMA OSC, etc.) would not have to change. I know RFCs are usually sent after the code is written, bu
Re: [OMPI devel] RFC: Change ompi_proc_t endpoint data lookup
+1, but I helped come up with the idea. :-) On Jul 18, 2013, at 5:32 PM, "Barrett, Brian W" wrote: > What: Change the ompi_proc_t endpoint data lookup to be more flexible > > Why: As collectives and one-sided components are using transports > directly, an old problem of endpoint tracking is resurfacing. We need a > fix that doesn't suck. > > When: Assuming there are no major objections, I'll start writing the code > next week... > > More Info: > > Today, endpoint information is stored in one of two places on the > ompi_proc_t: proc_pml and proc_bml. The proc_pml pointer is an opaque > structure having meaning only to the PML and the proc_bml pointer is an > opaque structure having meaning only to the BML. CM, OB1, and BFO don't > use proc_pml, although the MTLs store their endpoint data on the proc_pml. > R2 uses the proc_bml to hold an opaque data structure which holds all the > btl endpoint data. > > The specific problem is the Portals 4 collective and one-sided components. > They both need endpoint information for communication (obviously). > Before there was a Portals 4 BTL, they peeked at the proc_pml pointer, > knew what it looked like, and were ok. Now the data they need is possibly > in the proc_pml or in the (opaque) proc_bml, which poses a problem. > > Jeff and I talked about this and had a number of restrictions that seemed > to make sense for a solution: > > * Don't make ompi_proc_t bigger than absolutely necessary > * Avoid adding extra indirection into the endpoint resolution path > * Allow enough flexibility that IB or friends could use the same > mechanism > * Don't break the BML / BTL interface (too much work) > > What we came up with was a two pronged approach, depending on run-time > needs. > > First, rather than having the proc_pml and proc_bml on the ompi_proc_t, we > would have a proc_endpoint[] array of fixed size. The size of the array > would be determined at compile time based on compile-time registering of > endpoint slots. At compile time, a #define with a component's slot would > be set, removing any extra indexing overhead over today's mechanism. So > R2 would have a call in it's configure.m4 like: > > OMPI_REQUIRE_ENDPOINT_TAG(BML_R2) > > And would then find it's endpoint data with a call like: > > r2_endpoint = proc->proc_endpoint[OMPI_ENDPOINT_TAG_BML_R2]; > > which (assuming modest compiler optimization) is instruction equivalent to: > > r2_endpoint = proc->proc_bml; > > To allow for dynamic indexing (something we haven't had to date), the last > entry in the array would be a pointer to an object like an > opal_pointer_array, but without the locking, and some allocation calls > during init. Since the indexes never need to be used by a remote process, > there's no synchronization required in registering. The dynamic indexing > could be turned off at configure time for space-concious builds. For > example, on our big systems, I disable dlopen support, so static > allocation of endpoint slots is good enough. > > In the average build, the only tag registered would be BML_R2. If we lazy > allocate the pointer array element, that's two entries in the > proc_endpoint array, so the same size as today. I was going to have the > CM stop using the endpoint and push that handling on the MTL. Assuming > all MTLs but Portals shared the same tag (easy to do), there'd be an > 8*nprocs increase in space used per process if an MTL was built, but if > you disabled R2, that disappears. > > How does this solve my problem? Rather than having Portals 4 use the MTL > tag, it would have it's own tag, shared between the MTL, BTL, OSC, and > COLL components. Since the chances of Portals 4 being built on a platform > with support for another MTL is almost zero, in most cases, the size of > the ompi_proc_t only increases by 8 bytes over today's setup. Since most > Portals 4 builds will be on more static platforms, I can disable dynamic > indexing and be back at today's size, but with an easy way to deal with > endpoint data sharing between components of different frameworks. > > So, to review our original goals: > > * ompi_proc_t will remain the same size on most platforms, increase by > 8*nprocs bytes if an MTL is built, but can shrink by 8*nprocs bytes on > static systems (by disabling dynamic indexing and building only one of > either the MTLs or BMLs). > * If you're using a pre-allocated tag, there's no extra indirection or > math, assuming basic compiler optimization. There is a higher cost for > dynamic tags, but that's probably ok for us. > * I think that IB could start registering a tag if it needed for sharing > QP information between frameworks, at the cost of an extra tag. Probably > makes the most sense for the MXM case (assuming someone writes an MXM osc > component). > * The PML interface would change slightly (remove about 5 lines of code > / pml). The MTL would have to change a bit to look at their own tag > instead of the proc_pml
Re: [OMPI devel] RFC: Change ompi_proc_t endpoint data lookup
+1, though I do have a question. We are looking at exascale requirements, and one of the big issues is memory footprint. We currently retrieve the endpoint info for every process in the job, plus all the procs in any communicator with which we do a connect/accept - even though we probably will only communicate with a small number of them. This wastes a lot of memory at scale. As long as we are re-working the endpoint stuff, would it be a thought to go ahead and change how we handle the above? I'm looking to switch to a lazy definition approach where we compute endpoints for procs on first-message instead of during mpi_init, retrieving the endpoint info for that proc only at that time. So instead of storing all the endpoint info for every proc in each proc, each proc only would contain the info it requires for that application. Ideally, I'd like to see that extended to the ompi_proc_t array itself - maybe changing it to a sparse array/list of some type, so we only create that storage for procs we actually communicate to. If you'd prefer to discuss this as a separate issue, that's fine - just something we need to work on at some point in the next year or two. On Jul 18, 2013, at 6:26 PM, "Jeff Squyres (jsquyres)" wrote: > +1, but I helped come up with the idea. :-) > > > On Jul 18, 2013, at 5:32 PM, "Barrett, Brian W" wrote: > >> What: Change the ompi_proc_t endpoint data lookup to be more flexible >> >> Why: As collectives and one-sided components are using transports >> directly, an old problem of endpoint tracking is resurfacing. We need a >> fix that doesn't suck. >> >> When: Assuming there are no major objections, I'll start writing the code >> next week... >> >> More Info: >> >> Today, endpoint information is stored in one of two places on the >> ompi_proc_t: proc_pml and proc_bml. The proc_pml pointer is an opaque >> structure having meaning only to the PML and the proc_bml pointer is an >> opaque structure having meaning only to the BML. CM, OB1, and BFO don't >> use proc_pml, although the MTLs store their endpoint data on the proc_pml. >> R2 uses the proc_bml to hold an opaque data structure which holds all the >> btl endpoint data. >> >> The specific problem is the Portals 4 collective and one-sided components. >> They both need endpoint information for communication (obviously). >> Before there was a Portals 4 BTL, they peeked at the proc_pml pointer, >> knew what it looked like, and were ok. Now the data they need is possibly >> in the proc_pml or in the (opaque) proc_bml, which poses a problem. >> >> Jeff and I talked about this and had a number of restrictions that seemed >> to make sense for a solution: >> >> * Don't make ompi_proc_t bigger than absolutely necessary >> * Avoid adding extra indirection into the endpoint resolution path >> * Allow enough flexibility that IB or friends could use the same >> mechanism >> * Don't break the BML / BTL interface (too much work) >> >> What we came up with was a two pronged approach, depending on run-time >> needs. >> >> First, rather than having the proc_pml and proc_bml on the ompi_proc_t, we >> would have a proc_endpoint[] array of fixed size. The size of the array >> would be determined at compile time based on compile-time registering of >> endpoint slots. At compile time, a #define with a component's slot would >> be set, removing any extra indexing overhead over today's mechanism. So >> R2 would have a call in it's configure.m4 like: >> >> OMPI_REQUIRE_ENDPOINT_TAG(BML_R2) >> >> And would then find it's endpoint data with a call like: >> >> r2_endpoint = proc->proc_endpoint[OMPI_ENDPOINT_TAG_BML_R2]; >> >> which (assuming modest compiler optimization) is instruction equivalent to: >> >> r2_endpoint = proc->proc_bml; >> >> To allow for dynamic indexing (something we haven't had to date), the last >> entry in the array would be a pointer to an object like an >> opal_pointer_array, but without the locking, and some allocation calls >> during init. Since the indexes never need to be used by a remote process, >> there's no synchronization required in registering. The dynamic indexing >> could be turned off at configure time for space-concious builds. For >> example, on our big systems, I disable dlopen support, so static >> allocation of endpoint slots is good enough. >> >> In the average build, the only tag registered would be BML_R2. If we lazy >> allocate the pointer array element, that's two entries in the >> proc_endpoint array, so the same size as today. I was going to have the >> CM stop using the endpoint and push that handling on the MTL. Assuming >> all MTLs but Portals shared the same tag (easy to do), there'd be an >> 8*nprocs increase in space used per process if an MTL was built, but if >> you disabled R2, that disappears. >> >> How does this solve my problem? Rather than having Portals 4 use the MTL >> tag, it would have it's own tag, shared between the MTL, BTL, OSC, and >> COLL
Re: [OMPI devel] [EXTERNAL] Re: RFC: Change ompi_proc_t endpoint data lookup
On 7/18/13 7:39 PM, "Ralph Castain" mailto:r...@open-mpi.org>> wrote: We are looking at exascale requirements, and one of the big issues is memory footprint. We currently retrieve the endpoint info for every process in the job, plus all the procs in any communicator with which we do a connect/accept - even though we probably will only communicate with a small number of them. This wastes a lot of memory at scale. As long as we are re-working the endpoint stuff, would it be a thought to go ahead and change how we handle the above? I'm looking to switch to a lazy definition approach where we compute endpoints for procs on first-message instead of during mpi_init, retrieving the endpoint info for that proc only at that time. So instead of storing all the endpoint info for every proc in each proc, each proc only would contain the info it requires for that application. It depends on what you mean by endpoint information. If you mean what I call endpoint information (the stuff the PML/MTL/BML stores on an ompi_proc_t), then I really don't care. For Portals, the endpoint information is quite small (8-16 bytes, depending on addressing mode), so I'd rather pre-populate the array and not slow down the send path with yet another conditional than have to check for endpoint data. Of course, given the Portals usage model, I'd really like to jam the endpoint data into shared memory at some point (not this patch). If others want to figure out how to do lazy endpoint data setup for their network, I think that's reasonable. Ideally, I'd like to see that extended to the ompi_proc_t array itself - maybe changing it to a sparse array/list of some type, so we only create that storage for procs we actually communicate to. This would actually break a whole lot of things in OMPI and is a huge change. However, I still have plans to add a --enable-minimal-memory type option some day which will make the ompi_proc_t significantly smaller by assuming homogeneous convertors and that you can programmatically get a remote host name when needed. Again, unless we need to get micro-small (and I don't think we do), the sparseness requires conditionals in the critical path that worry me. If you'd prefer to discuss this as a separate issue, that's fine - just something we need to work on at some point in the next year or two. I agree some work is needed, but I think it's orthogonal to this issue and is something we're going to need to study in detail. There are a number of space/time tradeoffs in that path. Which isn't a problem, but there's a whole lot of low hanging fruit before we get to the hard stuff. Now if you want the OFED interfaces to run at exascale, well, buy lots of memory. Brian -- Brian W. Barrett Scalable System Software Group Sandia National Laboratories