[OMPI devel] pmi2 slurm/openmpi patch

2013-07-18 Thread Piotr Lesnicki

Hello,

I think there a few things still missing in openmpi pmi2 to make it work 
with slurm. We are the ones at Bull who integrated the pmi2 code from 
mpich2 to slurm. The attached patch should fix the issue (call slurm 
with --mpi=pmi2). This still needs to be checked with other pmi2 
implemenations (we use pmi2.h but some use pmi.h ? constants are 
prefixed with PMI2_ but some use PMI_ ?).


Piotr Lesnicki
diff --git a/opal/mca/common/pmi/common_pmi.c b/opal/mca/common/pmi/common_pmi.c
--- a/opal/mca/common/pmi/common_pmi.c
+++ b/opal/mca/common/pmi/common_pmi.c
@@ -25,6 +25,8 @@
 #include "common_pmi.h"

 static int mca_common_pmi_init_count = 0;
+static int mca_common_pmi_init_size = 0;
+static int mca_common_pmi_init_rank = 0;

 bool mca_common_pmi_init (void) {
 if (0 < mca_common_pmi_init_count++) {
@@ -41,6 +43,8 @@
 }

 if (PMI_SUCCESS != PMI2_Init(&spawned, &size, &rank, &appnum)) {
+mca_common_pmi_init_size = size;
+mca_common_pmi_init_rank = rank;
 mca_common_pmi_init_count--;
 return false;
 }
@@ -107,3 +111,23 @@
 }
 return err_msg;
 }
+
+
+bool mca_common_pmi_rank(int *rank) {
+#ifndef WANT_PMI2_SUPPORT
+if (PMI_SUCCESS != (ret = PMI_Get_rank(&mca_common_pmi_rank)))
+return false;
+#endif
+*rank = mca_common_pmi_init_rank;
+return true;
+}
+
+
+bool mca_common_pmi_size(int *size) {
+#ifndef WANT_PMI2_SUPPORT
+if (PMI_SUCCESS != (ret = PMI_Get_universe_size(&mca_common_pmi_size)))
+return false;
+#endif
+*size = mca_common_pmi_init_size;
+return true;
+}
diff --git a/opal/mca/common/pmi/common_pmi.h b/opal/mca/common/pmi/common_pmi.h
--- a/opal/mca/common/pmi/common_pmi.h
+++ b/opal/mca/common/pmi/common_pmi.h
@@ -42,3 +42,6 @@
 OPAL_DECLSPEC char* opal_errmgr_base_pmi_error(int pmi_err);

 #endif
+
+bool mca_common_pmi_rank(int *rank);
+bool mca_common_pmi_size(int *size);
diff --git a/orte/mca/ess/pmi/ess_pmi_module.c b/orte/mca/ess/pmi/ess_pmi_module.c
--- a/orte/mca/ess/pmi/ess_pmi_module.c
+++ b/orte/mca/ess/pmi/ess_pmi_module.c
@@ -38,6 +38,9 @@
 #endif

 #include 
+#ifdef WANT_PMI2_SUPPORT
+#include 
+#endif

 #include "opal/util/opal_environ.h"
 #include "opal/util/output.h"
@@ -126,7 +129,7 @@
 }
 ORTE_PROC_MY_NAME->jobid = jobid;
 /* get our rank from PMI */
-if (PMI_SUCCESS != (ret = PMI_Get_rank(&i))) {
+if (!(ret = mca_common_pmi_rank(&i))) {
 OPAL_PMI_ERROR(ret, "PMI_Get_rank");
 error = "could not get PMI rank";
 goto error;
@@ -134,7 +137,7 @@
 ORTE_PROC_MY_NAME->vpid = i + 1;  /* compensate for orterun */

 /* get the number of procs from PMI */
-if (PMI_SUCCESS != (ret = PMI_Get_universe_size(&i))) {
+if (!(ret = mca_common_pmi_size(&i))) {
 OPAL_PMI_ERROR(ret, "PMI_Get_universe_size");
 error = "could not get PMI universe size";
 goto error;
@@ -148,6 +151,14 @@
 goto error;
 }
 } else {  /* we are a direct-launched MPI process */
+#ifdef WANT_PMI2_SUPPORT
+/* Get domain id */
+pmi_id = malloc(PMI2_MAX_VALLEN);
+if (PMI_SUCCESS != (ret = PMI2_Job_GetId(pmi_id, PMI2_MAX_VALLEN))) {
+error = "PMI2_Job_GetId failed";
+goto error;
+}
+#else
 /* get our PMI id length */
 if (PMI_SUCCESS != (ret = PMI_Get_id_length_max(&pmi_maxlen))) {
 error = "PMI_Get_id_length_max";
@@ -159,6 +170,7 @@
 error = "PMI_Get_kvs_domain_id";
 goto error;
 }
+#endif
 /* PMI is very nice to us - the domain id is an integer followed
  * by a '.', followed by essentially a stepid. The first integer
  * defines an overall job number. The second integer is the number of
@@ -180,20 +192,22 @@
 ORTE_PROC_MY_NAME->jobid = ORTE_CONSTRUCT_LOCAL_JOBID(jobfam << 16, stepid);

 /* get our rank */
-if (PMI_SUCCESS != (ret = PMI_Get_rank(&i))) {
+if (!(ret = mca_common_pmi_rank(&i))) {
 OPAL_PMI_ERROR(ret, "PMI_Get_rank");
 error = "could not get PMI rank";
 goto error;
 }
 ORTE_PROC_MY_NAME->vpid = i;
+int rank = i;

 /* get the number of procs from PMI */
-if (PMI_SUCCESS != (ret = PMI_Get_universe_size(&i))) {
+if (!(ret = mca_common_pmi_size(&i))) {
 OPAL_PMI_ERROR(ret, "PMI_Get_universe_size");
 error = "could not get PMI universe size";
 goto error;
 }
 orte_process_info.num_procs = i;
+int size = i;
 /* push into the environ for pickup in MPI layer for
  * MPI-3 required info key
  */
@@ -245,6 +259,42 @@
 goto error;
 }

+#ifdef WANT_PMI2_SUPPORT
+/* get our local proc info to find our local rank */
+char *pmapping = malloc(PMI2_MAX_VALLEN);
+  

Re: [OMPI devel] ompi_info

2013-07-18 Thread George Bosilca
On Jul 17, 2013, at 20:15 , "Jeff Squyres (jsquyres)"  
wrote:

> On Jul 17, 2013, at 12:16 PM, Nathan Hjelm  wrote:
> 
>> As Ralph suggested you need to pass the --level or -l option to see all the 
>> variables. --level 9 will print everything. If you think there are variables 
>> everyday users should see you are welcome to change them to OPAL_INFO_LVL_1. 
>> We are trying to avoid moving too many variables to this info level.
> 
> I think George might have a point here, though.  He was specifically asking 
> about the --all option, right?
> 
> I think it might be reasonable for "ompi_info --all" to actually show *all* 
> MCA params (up through level 9).

Thanks Jeff,

I'm totally puzzled by the divergence in opinion in this community on the word 
ALL. ALL like in "every single one of them", not like in "4 poorly chosen MCA 
arguments that I don't even know how to care about".

> Thoughts?

Give back to the word ALL it's original meaning: "the whole quantity or extent 
of a group". 

> 
>>> Btw, something is wrong i the following output. I have an "btl = sm,self" 
>>> in my .openmpi/mca-params.conf so I should not even see the BTL TCP 
>>> parameters.
>> 
>> I think ompi_info has always shown all the variables despite what you have 
>> the selection variable set (at least in some cases). We now just display 
>> everything in all cases. An additional benefit to the updated code is that 
>> if you set a selection variable through the environment 
>> (OMPI_MCA_btl=self,sm) it no longer appears as unset in ompi_info. The old 
>> code unset all selection variables in order to ensure all parameters got 
>> printed (very annoying but necessary).

Ralph comment above is not accurate. Prior to this change (well the one from 
few weeks ago), explicitly forbidden components did not leave traces in the MCA 
parameters list. I validate this with the latest stable.

> Yes, I think I like this new behavior better, too.
> 
> Does anyone violently disagree?

Yes. This behavior means the every single MPI process out there will 1) load 
all existing .so components, and 2) will give them a chance to leave undesired 
traces in the memory of the application. So first we generate an increased I/O 
traffic, and 2) we use memory that shouldn't be used. We can argue about the 
impact of all this, but from my perspective what I see is that Open MPI is 
doing it when explicit arguments to prevent the usage of these component were 
provided.

  George.

> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] ompi_info

2013-07-18 Thread Ralph Castain

On Jul 18, 2013, at 5:46 AM, George Bosilca  wrote:

> On Jul 17, 2013, at 20:15 , "Jeff Squyres (jsquyres)"  
> wrote:
> 
>> On Jul 17, 2013, at 12:16 PM, Nathan Hjelm  wrote:
>> 
>>> As Ralph suggested you need to pass the --level or -l option to see all the 
>>> variables. --level 9 will print everything. If you think there are 
>>> variables everyday users should see you are welcome to change them to 
>>> OPAL_INFO_LVL_1. We are trying to avoid moving too many variables to this 
>>> info level.
>> 
>> I think George might have a point here, though.  He was specifically asking 
>> about the --all option, right?
>> 
>> I think it might be reasonable for "ompi_info --all" to actually show *all* 
>> MCA params (up through level 9).
> 
> Thanks Jeff,
> 
> I'm totally puzzled by the divergence in opinion in this community on the 
> word ALL. ALL like in "every single one of them", not like in "4 poorly 
> chosen MCA arguments that I don't even know how to care about".

I don't think there is a divergence of opinion on this - I think it was likely 
a programming oversight. I certainly would agree that all should operate that 
way.

> 
>> Thoughts?
> 
> Give back to the word ALL it's original meaning: "the whole quantity or 
> extent of a group". 
> 
>> 
 Btw, something is wrong i the following output. I have an "btl = sm,self" 
 in my .openmpi/mca-params.conf so I should not even see the BTL TCP 
 parameters.
>>> 
>>> I think ompi_info has always shown all the variables despite what you have 
>>> the selection variable set (at least in some cases). We now just display 
>>> everything in all cases. An additional benefit to the updated code is that 
>>> if you set a selection variable through the environment 
>>> (OMPI_MCA_btl=self,sm) it no longer appears as unset in ompi_info. The old 
>>> code unset all selection variables in order to ensure all parameters got 
>>> printed (very annoying but necessary).
> 
> Ralph comment above is not accurate. Prior to this change (well the one from 
> few weeks ago), explicitly forbidden components did not leave traces in the 
> MCA parameters list. I validate this with the latest stable.

FWIW: that wasn't my comment

> 
>> Yes, I think I like this new behavior better, too.
>> 
>> Does anyone violently disagree?
> 
> Yes. This behavior means the every single MPI process out there will 1) load 
> all existing .so components, and 2) will give them a chance to leave 
> undesired traces in the memory of the application. So first we generate an 
> increased I/O traffic, and 2) we use memory that shouldn't be used. We can 
> argue about the impact of all this, but from my perspective what I see is 
> that Open MPI is doing it when explicit arguments to prevent the usage of 
> these component were provided.

That's a good point, and a bad behavior. IIRC, it results from the MPI Forum's 
adoption of the MPI-T requirement that stipulates we must allow access to all 
control and performance variables at startup so they can be externally 
seen/manipulated. I guess the question is: does that truly mean "all" per your 
proposed definition, or "all that fall within the pre-given MCA directives on 
components"?


> 
>  George.
> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] pmi2 slurm/openmpi patch

2013-07-18 Thread Ralph Castain
Thanks Piotr - I'll apply that and move it to the 1.7 branch.

Some of us are trying to test the pmi2 support in 2.6.0 and hitting a problem. 
We have verified that the pmi2 support was built/installed, and that both 
slurmctld and slurmd are at 2.6.0 level. When we run "srun --mpi-list", we get:

srun: MPI types are... 
srun: mpi/mvapich
srun: mpi/pmi2
srun: mpi/mpich1_shmem
srun: mpi/mpich1_p4
srun: mpi/none
srun: mpi/lam
srun: mpi/openmpi
srun: mpi/mpichmx
srun: mpi/mpichgm

So it looks like the install is correct. However, when we attempt to run a job 
with "srun --mpi=pmi2 foo", we get an error from the slurmd on the remote node:

slurmd[n1]: mpi/pmi2: no value for key  in req

and the PMI calls in the app fail. Any ideas as to the source of the problem? 
Do we have to configure something else, or start slurmd with some option?

Thanks
Ralph


On Jul 18, 2013, at 2:02 AM, Piotr Lesnicki  wrote:

> Hello,
> 
> I think there a few things still missing in openmpi pmi2 to make it work with 
> slurm. We are the ones at Bull who integrated the pmi2 code from mpich2 to 
> slurm. The attached patch should fix the issue (call slurm with --mpi=pmi2). 
> This still needs to be checked with other pmi2 implemenations (we use pmi2.h 
> but some use pmi.h ? constants are prefixed with PMI2_ but some use PMI_ ?).
> 
> Piotr Lesnicki
> 



Re: [OMPI devel] ompi_info

2013-07-18 Thread David Goodell (dgoodell)
On Jul 18, 2013, at 8:06 AM, Ralph Castain  wrote:

> That's a good point, and a bad behavior. IIRC, it results from the MPI 
> Forum's adoption of the MPI-T requirement that stipulates we must allow 
> access to all control and performance variables at startup so they can be 
> externally seen/manipulated.

Minor nit: MPI_T does not require this.  However, it does recommend that you 
offer users access to as many variables as possible as early as reasonably 
possible for the convenience and control of the user.

If an implementation chooses to offer 5% of the possible control/performance 
variables to the user just before MPI_Finalize, that's still a valid MPI_T 
implementation.  But it may not be a very useful one...

-Dave




Re: [OMPI devel] ompi_info

2013-07-18 Thread George Bosilca

On Jul 18, 2013, at 15:06 , Ralph Castain  wrote:

 I think ompi_info has always shown all the variables despite what you have 
 the selection variable set (at least in some cases). We now just display 
 everything in all cases. An additional benefit to the updated code is that 
 if you set a selection variable through the environment 
 (OMPI_MCA_btl=self,sm) it no longer appears as unset in ompi_info. The old 
 code unset all selection variables in order to ensure all parameters got 
 printed (very annoying but necessary).
>> 
>> Ralph comment above is not accurate. Prior to this change (well the one from 
>> few weeks ago), explicitly forbidden components did not leave traces in the 
>> MCA parameters list. I validate this with the latest stable.
> 
> FWIW: that wasn't my comment

Sorry Ralph I was wrong, the comment was from Nathan. This discussion grew out 
of hand, it became difficult to follow who said what and when.

  George.




Re: [OMPI devel] ompi_info

2013-07-18 Thread Ralph Castain

On Jul 18, 2013, at 7:05 AM, David Goodell (dgoodell)  
wrote:

> On Jul 18, 2013, at 8:06 AM, Ralph Castain  wrote:
> 
>> That's a good point, and a bad behavior. IIRC, it results from the MPI 
>> Forum's adoption of the MPI-T requirement that stipulates we must allow 
>> access to all control and performance variables at startup so they can be 
>> externally seen/manipulated.
> 
> Minor nit: MPI_T does not require this.  However, it does recommend that you 
> offer users access to as many variables as possible as early as reasonably 
> possible for the convenience and control of the user.
> 
> If an implementation chooses to offer 5% of the possible control/performance 
> variables to the user just before MPI_Finalize, that's still a valid MPI_T 
> implementation.  But it may not be a very useful one...

The problem here is one of use vs startup performance. George is quite correct 
with his concerns - this behavior would have been a serious problem for 
RoadRunner, for example, where we had a small IO channel feeding a lot of 
nodes. It will definitely become an issue at exascale where IO bandwidth and 
memory will be at a premium.

This is especially troubling when you consider how few people will ever use 
this capability. Perhaps we should offer a switch that says "I want access to 
MPI-T" so that the rest of the world isn't hammered by this kind of behavior?


> 
> -Dave
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] ompi_info

2013-07-18 Thread Nathan Hjelm
On Thu, Jul 18, 2013 at 07:53:35AM -0700, Ralph Castain wrote:
> 
> On Jul 18, 2013, at 7:05 AM, David Goodell (dgoodell)  
> wrote:
> 
> > On Jul 18, 2013, at 8:06 AM, Ralph Castain  wrote:
> > 
> >> That's a good point, and a bad behavior. IIRC, it results from the MPI 
> >> Forum's adoption of the MPI-T requirement that stipulates we must allow 
> >> access to all control and performance variables at startup so they can be 
> >> externally seen/manipulated.
> > 
> > Minor nit: MPI_T does not require this.  However, it does recommend that 
> > you offer users access to as many variables as possible as early as 
> > reasonably possible for the convenience and control of the user.
> > 
> > If an implementation chooses to offer 5% of the possible 
> > control/performance variables to the user just before MPI_Finalize, that's 
> > still a valid MPI_T implementation.  But it may not be a very useful one...
> 
> The problem here is one of use vs startup performance. George is quite 
> correct with his concerns - this behavior would have been a serious problem 
> for RoadRunner, for example, where we had a small IO channel feeding a lot of 
> nodes. It will definitely become an issue at exascale where IO bandwidth and 
> memory will be at a premium.
> 
> This is especially troubling when you consider how few people will ever use 
> this capability. Perhaps we should offer a switch that says "I want access to 
> MPI-T" so that the rest of the world isn't hammered by this kind of behavior?

This was discussed in depth before the MCA rewrite came into the trunk. There 
are only two cases where we load and register all the available components: 
ompi_info, and MPI_T_init_thread(). The normal MPI case does not have this 
behavior and instead loads only the requested components.

-Nathan


[OMPI devel] KNEM + user-space hybrid for sm BTL

2013-07-18 Thread Iliev, Hristo
Hello,



Could someone, who is more familiar with the architecture of the sm BTL,
comment on the technical feasibility of the following: is it possible to
easily extend the BTL (i.e. without having to rewrite it completely from
scratch) so as to be able to perform transfers using both KNEM (or other
kernel-assisted copying mechanism) for messages over a given size and the
normal user-space mechanism for smaller messages with the switch-over point
being a user-tunable parameter?



>From what I've seen, both implementations have something in common, e.g.
both use FIFOs to communicate controlling information.

The motivation behind this are our efforts to become greener by extracting
the best possible out of the box performance on our systems without having
to profile each and every user application that runs on them. We've already
determined that activating KNEM really benefits some collective operations
on big shared-memory systems, but the increased latency significantly slows
down small message transfers, which also hits the pipelined implementations.



sm's code doesn't seem to be very complex but still I've decided to ask
first before diving any deeper.



Kind regards,

Hristo

--

Hristo Iliev, PhD - High Performance Computing Team

RWTH Aachen University, Center for Computing and Communication

Rechen- und Kommunikationszentrum der RWTH Aachen

Seffenter Weg 23, D 52074 Aachen (Germany)







smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI devel] ompi_info

2013-07-18 Thread David Goodell (dgoodell)
On Jul 18, 2013, at 9:53 AM, Ralph Castain  wrote:

> On Jul 18, 2013, at 7:05 AM, David Goodell (dgoodell)  
> wrote:
> 
>> On Jul 18, 2013, at 8:06 AM, Ralph Castain  wrote:
>> 
>>> That's a good point, and a bad behavior. IIRC, it results from the MPI 
>>> Forum's adoption of the MPI-T requirement that stipulates we must allow 
>>> access to all control and performance variables at startup so they can be 
>>> externally seen/manipulated.
>> 
>> Minor nit: MPI_T does not require this.  However, it does recommend that you 
>> offer users access to as many variables as possible as early as reasonably 
>> possible for the convenience and control of the user.
>> 
>> If an implementation chooses to offer 5% of the possible control/performance 
>> variables to the user just before MPI_Finalize, that's still a valid MPI_T 
>> implementation.  But it may not be a very useful one...
> 
> The problem here is one of use vs startup performance. George is quite 
> correct with his concerns - this behavior would have been a serious problem 
> for RoadRunner, for example, where we had a small IO channel feeding a lot of 
> nodes. It will definitely become an issue at exascale where IO bandwidth and 
> memory will be at a premium.

My point was not that the performance concerns were unfounded.  Rather, I 
wanted to point out that the "load everything" behavior is not a hard 
requirement from the MPI standard, so we have room for different implementation 
choices/tradeoffs.

-Dave




Re: [OMPI devel] ompi_info

2013-07-18 Thread Ralph Castain

On Jul 18, 2013, at 8:17 AM, "David Goodell (dgoodell)"  
wrote:

> On Jul 18, 2013, at 9:53 AM, Ralph Castain  wrote:
> 
>> On Jul 18, 2013, at 7:05 AM, David Goodell (dgoodell)  
>> wrote:
>> 
>>> On Jul 18, 2013, at 8:06 AM, Ralph Castain  wrote:
>>> 
 That's a good point, and a bad behavior. IIRC, it results from the MPI 
 Forum's adoption of the MPI-T requirement that stipulates we must allow 
 access to all control and performance variables at startup so they can be 
 externally seen/manipulated.
>>> 
>>> Minor nit: MPI_T does not require this.  However, it does recommend that 
>>> you offer users access to as many variables as possible as early as 
>>> reasonably possible for the convenience and control of the user.
>>> 
>>> If an implementation chooses to offer 5% of the possible 
>>> control/performance variables to the user just before MPI_Finalize, that's 
>>> still a valid MPI_T implementation.  But it may not be a very useful one...
>> 
>> The problem here is one of use vs startup performance. George is quite 
>> correct with his concerns - this behavior would have been a serious problem 
>> for RoadRunner, for example, where we had a small IO channel feeding a lot 
>> of nodes. It will definitely become an issue at exascale where IO bandwidth 
>> and memory will be at a premium.
> 
> My point was not that the performance concerns were unfounded.  Rather, I 
> wanted to point out that the "load everything" behavior is not a hard 
> requirement from the MPI standard, so we have room for different 
> implementation choices/tradeoffs.

I understood - I was more just pointing out the potential performance issue of 
load everything. However, Nathan has addressed it by pointing out that the 
problem is my aged, fading memory.

> 
> -Dave
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] KNEM + user-space hybrid for sm BTL

2013-07-18 Thread George Bosilca

On Jul 18, 2013, at 17:12 , "Iliev, Hristo"  wrote:

> Hello,
>  
> Could someone, who is more familiar with the architecture of the sm BTL, 
> comment on the technical feasibility of the following: is it possible to 
> easily extend the BTL (i.e. without having to rewrite it completely from 
> scratch) so as to be able to perform transfers using both KNEM (or other 
> kernel-assisted copying mechanism) for messages over a given size and the 
> normal user-space mechanism for smaller messages with the switch-over point 
> being a user-tunable parameter?

This is already what the SM BTL does. When support for kernel-assisted 
mechanisms is enabled everything under the eager size is going over 
"traditional" shared memory (double copy and so on), while larger messages use 
the single-copy mechanism.

  George.

>  
> From what I’ve seen, both implementations have something in common, e.g. both 
> use FIFOs to communicate controlling information.
> The motivation behind this are our efforts to become greener by extracting 
> the best possible out of the box performance on our systems without having to 
> profile each and every user application that runs on them. We’ve already 
> determined that activating KNEM really benefits some collective operations on 
> big shared-memory systems, but the increased latency significantly slows down 
> small message transfers, which also hits the pipelined implementations.
>  
> sm’s code doesn’t seem to be very complex but still I’ve decided to ask first 
> before diving any deeper.
>  
> Kind regards,
> Hristo
> --
> Hristo Iliev, PhD – High Performance Computing Team
> RWTH Aachen University, Center for Computing and Communication
> Rechen- und Kommunikationszentrum der RWTH Aachen
> Seffenter Weg 23, D 52074 Aachen (Germany)
>  
>  
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] ompi_info

2013-07-18 Thread Nathan Hjelm
On Thu, Jul 18, 2013 at 08:33:37AM -0700, Ralph Castain wrote:
> 
> On Jul 18, 2013, at 8:17 AM, "David Goodell (dgoodell)"  
> wrote:
> 
> > On Jul 18, 2013, at 9:53 AM, Ralph Castain  wrote:
> > 
> >> On Jul 18, 2013, at 7:05 AM, David Goodell (dgoodell)  
> >> wrote:
> >> 
> >>> On Jul 18, 2013, at 8:06 AM, Ralph Castain  wrote:
> >>> 
>  That's a good point, and a bad behavior. IIRC, it results from the MPI 
>  Forum's adoption of the MPI-T requirement that stipulates we must allow 
>  access to all control and performance variables at startup so they can 
>  be externally seen/manipulated.
> >>> 
> >>> Minor nit: MPI_T does not require this.  However, it does recommend that 
> >>> you offer users access to as many variables as possible as early as 
> >>> reasonably possible for the convenience and control of the user.
> >>> 
> >>> If an implementation chooses to offer 5% of the possible 
> >>> control/performance variables to the user just before MPI_Finalize, 
> >>> that's still a valid MPI_T implementation.  But it may not be a very 
> >>> useful one...
> >> 
> >> The problem here is one of use vs startup performance. George is quite 
> >> correct with his concerns - this behavior would have been a serious 
> >> problem for RoadRunner, for example, where we had a small IO channel 
> >> feeding a lot of nodes. It will definitely become an issue at exascale 
> >> where IO bandwidth and memory will be at a premium.
> > 
> > My point was not that the performance concerns were unfounded.  Rather, I 
> > wanted to point out that the "load everything" behavior is not a hard 
> > requirement from the MPI standard, so we have room for different 
> > implementation choices/tradeoffs.
> 
> I understood - I was more just pointing out the potential performance issue 
> of load everything. However, Nathan has addressed it by pointing out that the 
> problem is my aged, fading memory.

So, I think what I can take from this discussion to to make the following 
changes to ompi_info:

 - Make --all without a --level option imply --level 9.

 - Allow the user to modify this behavior by specifying a level: ex --all 
--level 5 would print every variable up to level 5.

I will make these changes today and CMR them into 1.7.3 unless there are any 
objections.

-Nathan


Re: [OMPI devel] ompi_info

2013-07-18 Thread George Bosilca
On Jul 18, 2013, at 17:07 , Nathan Hjelm  wrote:

> This was discussed in depth before the MCA rewrite came into the trunk. There 
> are only two cases where we load and register all the available components: 
> ompi_info, and MPI_T_init_thread(). The normal MPI case does not have this 
> behavior and instead loads only the requested components.

How is this part of the code validated? It might capitalize on some type of 
"trust". Unfortunately … I have no such notion.

I would rather take the path of the "least astonishment", a __consistent__ 
behavior where we always abide to the configuration files (user level as well 
as system level). If you want to see every single parameter possibly available 
to you (based on your rights of course), temporary remove the configuration 
file. Or we can provide a specific ompi_info option to ignore the configuration 
files, but not make this the default.

  George.




Re: [OMPI devel] KNEM + user-space hybrid for sm BTL

2013-07-18 Thread Aurélien Bouteiller

Le 18 juil. 2013 à 11:12, "Iliev, Hristo"  a écrit :

> Hello,
>  
> Could someone, who is more familiar with the architecture of the sm BTL, 
> comment on the technical feasibility of the following: is it possible to 
> easily extend the BTL (i.e. without having to rewrite it completely from 
> scratch) so as to be able to perform transfers using both KNEM (or other 
> kernel-assisted copying mechanism) for messages over a given size and the 
> normal user-space mechanism for smaller messages with the switch-over point 
> being a user-tunable parameter?
>  
> From what I’ve seen, both implementations have something in common, e.g. both 
> use FIFOs to communicate controlling information.
> The motivation behind this are our efforts to become greener by extracting 
> the best possible out of the box performance on our systems without having to 
> profile each and every user application that runs on them. We’ve already 
> determined that activating KNEM really benefits some collective operations on 
> big shared-memory systems, but the increased latency significantly slows down 
> small message transfers, which also hits the pipelined implementations.
>  


Hristo, 

The knem BTL currently available in the trunk does just this :) You can use 
either Knem or Linux CMA to accelerate interprocess transfers. You can use the 
following mca parameters to turn on knem mode: 

-mca btl_sm_use_knem 1

If my memory serves me well, anything under eager limit is sent by regular 
double copy: 

-mca btl_sm_eager_limit 4096 (is the default, so anything below 1 page is 
copy-in, copy-out). If I remember correctly, anything below 16k decreased 
performance. 



We also have a collective component leveraging on knem capabilities. If you 
want more info about the details,
you can look at the following paper we published at IPDPS last year. It covers 
what we found to be the best cutoff values for using (or not) knem in several 
collective. 

Teng Ma, George Bosilca, Aurelien Bouteiller, Jack Dongarra, "HierKNEM: An 
Adaptive Framework for Kernel-Assisted and Topology-Aware Collective 
Communications on Many-core Clusters," Parallel and Distributed Processing 
Symposium, International, pp. 970-982, 2012 IEEE 26th International Parallel 
and Distributed Processing Symposium, 2012 

http://www.computer.org/csdl/proceedings/ipdps/2012/4675/00/4675a970-abs.html


Enjoy, 
Aurelien 



> sm’s code doesn’t seem to be very complex but still I’ve decided to ask first 
> before diving any deeper.
>  
> Kind regards,
> Hristo
> --
> Hristo Iliev, PhD – High Performance Computing Team
> RWTH Aachen University, Center for Computing and Communication
> Rechen- und Kommunikationszentrum der RWTH Aachen
> Seffenter Weg 23, D 52074 Aachen (Germany)
>  
>  
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
* Dr. Aurélien Bouteiller
* Researcher at Innovative Computing Laboratory
* University of Tennessee
* 1122 Volunteer Boulevard, suite 309b
* Knoxville, TN 37996
* 865 974 9375










Re: [OMPI devel] ompi_info

2013-07-18 Thread Nathan Hjelm
On Thu, Jul 18, 2013 at 05:50:40PM +0200, George Bosilca wrote:
> On Jul 18, 2013, at 17:07 , Nathan Hjelm  wrote:
> 
> > This was discussed in depth before the MCA rewrite came into the trunk. 
> > There are only two cases where we load and register all the available 
> > components: ompi_info, and MPI_T_init_thread(). The normal MPI case does 
> > not have this behavior and instead loads only the requested components.
> 
> How is this part of the code validated? It might capitalize on some type of 
> "trust". Unfortunately ? I have no such notion.

The fact that ompi_mpi_init never call ompi_info_register_params() which is the 
only path that sets the MCA_BASE_REGISTER_ALL when registering framework 
pameters. The register all behavior has to be explicitly asked for.

> I would rather take the path of the "least astonishment", a __consistent__ 
> behavior where we always abide to the configuration files (user level as well 
> as system level). If you want to see every single parameter possibly 
> available to you (based on your rights of course), temporary remove the 
> configuration file. Or we can provide a specific ompi_info option to ignore 
> the configuration files, but not make this the default.

In some ways this was the default behavior (if no file values were set). The 
current behavior was chosen to be consistent and reflect what I thought was the 
original intent. The old behavior would ignore component selection variables 
set in the environment (ompi_info actually unset them). So, if you set one of 
these variables in the environment (or the ompi_info command line) you would 1) 
still get all components in the framework, and 2) not see the variable as set 
even though it is in an actual run.

So, if I did:

export OMPI_MCA_btl=self,sm

or added --mca btl self,sm to the ompi_info command line

I would still see all the btls + this:

 MCA btl: parameter "btl" (current value: "", data source: 
default, level: 2 user/detail, type: string)
  Default selection set of components for the btl 
framework ( means use all components that can be found)

instead of:

 MCA btl: parameter "btl" (current value: "self,sm", data 
source: environment, level: 2 user/detail, type: string)
  Default selection set of components for the btl 
framework ( means use all components that can be found)

Very annoying!

That said. The register all behavior is easy to control. If there is a 
consensus that we need another ompi_info option I am more than happy to add it. 
But then again, --all should mean all components, all frameworks, all levels. 

-Nathan


Re: [OMPI devel] ompi_info

2013-07-18 Thread Jeff Squyres (jsquyres)
On Jul 18, 2013, at 11:50 AM, George Bosilca  wrote:

> How is this part of the code validated? It might capitalize on some type of 
> "trust". Unfortunately … I have no such notion.

Not sure what you're asking here.

> I would rather take the path of the "least astonishment", a __consistent__ 
> behavior where we always abide to the configuration files (user level as well 
> as system level). If you want to see every single parameter possibly 
> available to you (based on your rights of course), temporary remove the 
> configuration file. Or we can provide a specific ompi_info option to ignore 
> the configuration files, but not make this the default.


I think MPI applications and ompi_info are different cases.

1. We've definitely had cases of user (and OMPI developer!) confusion over the 
years where people would run ompi_info and not see their favorite MCA component 
listed.  After a while, they figured out it was because they had an env 
variable/file limiting which components were used (e.g., 
OMPI_MCA_btl=sm,tcp,self would silently disable all other BTLs in ompi_info 
output).  This actually seems to be fairly counter-intuitive behavior, if you 
ask me -- it was done this way as an artifact of the old implementation 
architecture.

Personally, I think changing ompi_info's behavior to always listing all 
components is a good idea.  Is there a reason to be concerned about the memory 
footprint and IO traffic of running ompi_info?

What might be a useful addition, however, is in the above example (user has 
OMPI_MCA_btl=sm,tcp,self in their environment) to somehow mark all other BTL 
params as "inactive because of OMPI_MCA_BTL env variable value", or something 
like that.

*** If someone wants this behavior, please propose a specific way to mark 
prettyprint and parsable ompi_info output.

2. MPI application behavior has not changed -- if you call MPI_Init, we open 
exactly the same frameworks/components that were opened before.  But if you're 
using a tool (i.e., call the MPI_T init function), then you pay an extra price 
(potentially more dlopens, more memory usage, etc.).  This is the same as it 
has always been for tools: tools cost something (memory, performance, whatever).

That being said, if you want a different behavior, please propose something 
specific (e.g., specific new MCA param + value(s) for specific behavior(s)).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI devel] RFC: Change ompi_proc_t endpoint data lookup

2013-07-18 Thread Barrett, Brian W
What: Change the ompi_proc_t endpoint data lookup to be more flexible

Why: As collectives and one-sided components are using transports
directly, an old problem of endpoint tracking is resurfacing.  We need a
fix that doesn't suck.

When: Assuming there are no major objections, I'll start writing the code
next week...

More Info: 

Today, endpoint information is stored in one of two places on the
ompi_proc_t: proc_pml and proc_bml.  The proc_pml pointer is an opaque
structure having meaning only to the PML and the proc_bml pointer is an
opaque structure having meaning only to the BML.  CM, OB1, and BFO don't
use proc_pml, although the MTLs store their endpoint data on the proc_pml.
 R2 uses the proc_bml to hold an opaque data structure which holds all the
btl endpoint data.

The specific problem is the Portals 4 collective and one-sided components.
 They both need endpoint information for communication (obviously).
Before there was a Portals 4 BTL, they peeked at the proc_pml pointer,
knew what it looked like, and were ok.  Now the data they need is possibly
in the proc_pml or in the (opaque) proc_bml, which poses a problem.

Jeff and I talked about this and had a number of restrictions that seemed
to make sense for a solution:

  * Don't make ompi_proc_t bigger than absolutely necessary
  * Avoid adding extra indirection into the endpoint resolution path
  * Allow enough flexibility that IB or friends could use the same
mechanism
  * Don't break the BML / BTL interface (too much work)

What we came up with was a two pronged approach, depending on run-time
needs.

First, rather than having the proc_pml and proc_bml on the ompi_proc_t, we
would have a proc_endpoint[] array of fixed size.  The size of the array
would be determined at compile time based on compile-time registering of
endpoint slots.  At compile time, a #define with a component's slot would
be set, removing any extra indexing overhead over today's mechanism.  So
R2 would have a call in it's configure.m4 like:

  OMPI_REQUIRE_ENDPOINT_TAG(BML_R2)

And would then find it's endpoint data with a call like:

  r2_endpoint = proc->proc_endpoint[OMPI_ENDPOINT_TAG_BML_R2];

which (assuming modest compiler optimization) is instruction equivalent to:

  r2_endpoint = proc->proc_bml;

To allow for dynamic indexing (something we haven't had to date), the last
entry in the array would be a pointer to an object like an
opal_pointer_array, but without the locking, and some allocation calls
during init.  Since the indexes never need to be used by a remote process,
there's no synchronization required in registering.  The dynamic indexing
could be turned off at configure time for space-concious builds.  For
example, on our big systems, I disable dlopen support, so static
allocation of endpoint slots is good enough.

In the average build, the only tag registered would be BML_R2.  If we lazy
allocate the pointer array element, that's two entries in the
proc_endpoint array, so the same size as today.  I was going to have the
CM stop using the endpoint and push that handling on the MTL.  Assuming
all MTLs but Portals shared the same tag (easy to do), there'd be an
8*nprocs increase in space used per process if an MTL was built, but if
you disabled R2, that disappears.

How does this solve my problem?  Rather than having Portals 4 use the MTL
tag, it would have it's own tag, shared between the MTL, BTL, OSC, and
COLL components.  Since the chances of Portals 4 being built on a platform
with support for another MTL is almost zero, in most cases, the size of
the ompi_proc_t only increases by 8 bytes over today's setup.  Since most
Portals 4 builds will be on more static platforms, I can disable dynamic
indexing and be back at today's size, but with an easy way to deal with
endpoint data sharing between components of different frameworks.

So, to review our original goals:

  * ompi_proc_t will remain the same size on most platforms, increase by
8*nprocs bytes if an MTL is built, but can shrink by 8*nprocs bytes on
static systems (by disabling dynamic indexing and building only one of
either the MTLs or BMLs).
  * If you're using a pre-allocated tag, there's no extra indirection or
math, assuming basic compiler optimization.  There is a higher cost for
dynamic tags, but that's probably ok for us.
  * I think that IB could start registering a tag if it needed for sharing
QP information between frameworks, at the cost of an extra tag.  Probably
makes the most sense for the MXM case (assuming someone writes an MXM osc
component).
  * The PML interface would change slightly (remove about 5 lines of code
/ pml).  The MTL would have to change a bit to look at their own tag
instead of the proc_pml (fairly easy).  The R2 BML would need to change to
use proc_endpoint[OMPI_ENDPOINT_TAG_BML_R2] instead of proc_bml, but that
shouldn't be hard.  The consumers of the BML (OB1, BFO, RDMA OSC, etc.)
would not have to change.

I know RFCs are usually sent after the code is written, bu

Re: [OMPI devel] RFC: Change ompi_proc_t endpoint data lookup

2013-07-18 Thread Jeff Squyres (jsquyres)
+1, but I helped come up with the idea.  :-)


On Jul 18, 2013, at 5:32 PM, "Barrett, Brian W"  wrote:

> What: Change the ompi_proc_t endpoint data lookup to be more flexible
> 
> Why: As collectives and one-sided components are using transports
> directly, an old problem of endpoint tracking is resurfacing.  We need a
> fix that doesn't suck.
> 
> When: Assuming there are no major objections, I'll start writing the code
> next week...
> 
> More Info: 
> 
> Today, endpoint information is stored in one of two places on the
> ompi_proc_t: proc_pml and proc_bml.  The proc_pml pointer is an opaque
> structure having meaning only to the PML and the proc_bml pointer is an
> opaque structure having meaning only to the BML.  CM, OB1, and BFO don't
> use proc_pml, although the MTLs store their endpoint data on the proc_pml.
> R2 uses the proc_bml to hold an opaque data structure which holds all the
> btl endpoint data.
> 
> The specific problem is the Portals 4 collective and one-sided components.
> They both need endpoint information for communication (obviously).
> Before there was a Portals 4 BTL, they peeked at the proc_pml pointer,
> knew what it looked like, and were ok.  Now the data they need is possibly
> in the proc_pml or in the (opaque) proc_bml, which poses a problem.
> 
> Jeff and I talked about this and had a number of restrictions that seemed
> to make sense for a solution:
> 
>  * Don't make ompi_proc_t bigger than absolutely necessary
>  * Avoid adding extra indirection into the endpoint resolution path
>  * Allow enough flexibility that IB or friends could use the same
> mechanism
>  * Don't break the BML / BTL interface (too much work)
> 
> What we came up with was a two pronged approach, depending on run-time
> needs.
> 
> First, rather than having the proc_pml and proc_bml on the ompi_proc_t, we
> would have a proc_endpoint[] array of fixed size.  The size of the array
> would be determined at compile time based on compile-time registering of
> endpoint slots.  At compile time, a #define with a component's slot would
> be set, removing any extra indexing overhead over today's mechanism.  So
> R2 would have a call in it's configure.m4 like:
> 
>  OMPI_REQUIRE_ENDPOINT_TAG(BML_R2)
> 
> And would then find it's endpoint data with a call like:
> 
>  r2_endpoint = proc->proc_endpoint[OMPI_ENDPOINT_TAG_BML_R2];
> 
> which (assuming modest compiler optimization) is instruction equivalent to:
> 
>  r2_endpoint = proc->proc_bml;
> 
> To allow for dynamic indexing (something we haven't had to date), the last
> entry in the array would be a pointer to an object like an
> opal_pointer_array, but without the locking, and some allocation calls
> during init.  Since the indexes never need to be used by a remote process,
> there's no synchronization required in registering.  The dynamic indexing
> could be turned off at configure time for space-concious builds.  For
> example, on our big systems, I disable dlopen support, so static
> allocation of endpoint slots is good enough.
> 
> In the average build, the only tag registered would be BML_R2.  If we lazy
> allocate the pointer array element, that's two entries in the
> proc_endpoint array, so the same size as today.  I was going to have the
> CM stop using the endpoint and push that handling on the MTL.  Assuming
> all MTLs but Portals shared the same tag (easy to do), there'd be an
> 8*nprocs increase in space used per process if an MTL was built, but if
> you disabled R2, that disappears.
> 
> How does this solve my problem?  Rather than having Portals 4 use the MTL
> tag, it would have it's own tag, shared between the MTL, BTL, OSC, and
> COLL components.  Since the chances of Portals 4 being built on a platform
> with support for another MTL is almost zero, in most cases, the size of
> the ompi_proc_t only increases by 8 bytes over today's setup.  Since most
> Portals 4 builds will be on more static platforms, I can disable dynamic
> indexing and be back at today's size, but with an easy way to deal with
> endpoint data sharing between components of different frameworks.
> 
> So, to review our original goals:
> 
>  * ompi_proc_t will remain the same size on most platforms, increase by
> 8*nprocs bytes if an MTL is built, but can shrink by 8*nprocs bytes on
> static systems (by disabling dynamic indexing and building only one of
> either the MTLs or BMLs).
>  * If you're using a pre-allocated tag, there's no extra indirection or
> math, assuming basic compiler optimization.  There is a higher cost for
> dynamic tags, but that's probably ok for us.
>  * I think that IB could start registering a tag if it needed for sharing
> QP information between frameworks, at the cost of an extra tag.  Probably
> makes the most sense for the MXM case (assuming someone writes an MXM osc
> component).
>  * The PML interface would change slightly (remove about 5 lines of code
> / pml).  The MTL would have to change a bit to look at their own tag
> instead of the proc_pml

Re: [OMPI devel] RFC: Change ompi_proc_t endpoint data lookup

2013-07-18 Thread Ralph Castain
+1, though I do have a question.

We are looking at exascale requirements, and one of the big issues is memory 
footprint. We currently retrieve the endpoint info for every process in the 
job, plus all the procs in any communicator with which we do a connect/accept - 
even though we probably will only communicate with a small number of them. This 
wastes a lot of memory at scale.

As long as we are re-working the endpoint stuff, would it be a thought to go 
ahead and change how we handle the above? I'm looking to switch to a lazy 
definition approach where we compute endpoints for procs on first-message 
instead of during mpi_init, retrieving the endpoint info for that proc only at 
that time. So instead of storing all the endpoint info for every proc in each 
proc, each proc only would contain the info it requires for that application.

Ideally, I'd like to see that extended to the ompi_proc_t array itself - maybe 
changing it to a sparse array/list of some type, so we only create that storage 
for procs we actually communicate to.

If you'd prefer to discuss this as a separate issue, that's fine - just 
something we need to work on at some point in the next year or two.


On Jul 18, 2013, at 6:26 PM, "Jeff Squyres (jsquyres)"  
wrote:

> +1, but I helped come up with the idea.  :-)
> 
> 
> On Jul 18, 2013, at 5:32 PM, "Barrett, Brian W"  wrote:
> 
>> What: Change the ompi_proc_t endpoint data lookup to be more flexible
>> 
>> Why: As collectives and one-sided components are using transports
>> directly, an old problem of endpoint tracking is resurfacing.  We need a
>> fix that doesn't suck.
>> 
>> When: Assuming there are no major objections, I'll start writing the code
>> next week...
>> 
>> More Info: 
>> 
>> Today, endpoint information is stored in one of two places on the
>> ompi_proc_t: proc_pml and proc_bml.  The proc_pml pointer is an opaque
>> structure having meaning only to the PML and the proc_bml pointer is an
>> opaque structure having meaning only to the BML.  CM, OB1, and BFO don't
>> use proc_pml, although the MTLs store their endpoint data on the proc_pml.
>> R2 uses the proc_bml to hold an opaque data structure which holds all the
>> btl endpoint data.
>> 
>> The specific problem is the Portals 4 collective and one-sided components.
>> They both need endpoint information for communication (obviously).
>> Before there was a Portals 4 BTL, they peeked at the proc_pml pointer,
>> knew what it looked like, and were ok.  Now the data they need is possibly
>> in the proc_pml or in the (opaque) proc_bml, which poses a problem.
>> 
>> Jeff and I talked about this and had a number of restrictions that seemed
>> to make sense for a solution:
>> 
>> * Don't make ompi_proc_t bigger than absolutely necessary
>> * Avoid adding extra indirection into the endpoint resolution path
>> * Allow enough flexibility that IB or friends could use the same
>> mechanism
>> * Don't break the BML / BTL interface (too much work)
>> 
>> What we came up with was a two pronged approach, depending on run-time
>> needs.
>> 
>> First, rather than having the proc_pml and proc_bml on the ompi_proc_t, we
>> would have a proc_endpoint[] array of fixed size.  The size of the array
>> would be determined at compile time based on compile-time registering of
>> endpoint slots.  At compile time, a #define with a component's slot would
>> be set, removing any extra indexing overhead over today's mechanism.  So
>> R2 would have a call in it's configure.m4 like:
>> 
>> OMPI_REQUIRE_ENDPOINT_TAG(BML_R2)
>> 
>> And would then find it's endpoint data with a call like:
>> 
>> r2_endpoint = proc->proc_endpoint[OMPI_ENDPOINT_TAG_BML_R2];
>> 
>> which (assuming modest compiler optimization) is instruction equivalent to:
>> 
>> r2_endpoint = proc->proc_bml;
>> 
>> To allow for dynamic indexing (something we haven't had to date), the last
>> entry in the array would be a pointer to an object like an
>> opal_pointer_array, but without the locking, and some allocation calls
>> during init.  Since the indexes never need to be used by a remote process,
>> there's no synchronization required in registering.  The dynamic indexing
>> could be turned off at configure time for space-concious builds.  For
>> example, on our big systems, I disable dlopen support, so static
>> allocation of endpoint slots is good enough.
>> 
>> In the average build, the only tag registered would be BML_R2.  If we lazy
>> allocate the pointer array element, that's two entries in the
>> proc_endpoint array, so the same size as today.  I was going to have the
>> CM stop using the endpoint and push that handling on the MTL.  Assuming
>> all MTLs but Portals shared the same tag (easy to do), there'd be an
>> 8*nprocs increase in space used per process if an MTL was built, but if
>> you disabled R2, that disappears.
>> 
>> How does this solve my problem?  Rather than having Portals 4 use the MTL
>> tag, it would have it's own tag, shared between the MTL, BTL, OSC, and
>> COLL

Re: [OMPI devel] [EXTERNAL] Re: RFC: Change ompi_proc_t endpoint data lookup

2013-07-18 Thread Barrett, Brian W
On 7/18/13 7:39 PM, "Ralph Castain" 
mailto:r...@open-mpi.org>> wrote:

We are looking at exascale requirements, and one of the big issues is memory 
footprint. We currently retrieve the endpoint info for every process in the 
job, plus all the procs in any communicator with which we do a connect/accept - 
even though we probably will only communicate with a small number of them. This 
wastes a lot of memory at scale.

As long as we are re-working the endpoint stuff, would it be a thought to go 
ahead and change how we handle the above? I'm looking to switch to a lazy 
definition approach where we compute endpoints for procs on first-message 
instead of during mpi_init, retrieving the endpoint info for that proc only at 
that time. So instead of storing all the endpoint info for every proc in each 
proc, each proc only would contain the info it requires for that application.

It depends on what you mean by endpoint information.  If you mean what I call 
endpoint information (the stuff the PML/MTL/BML stores on an ompi_proc_t), then 
I really don't care.  For Portals, the endpoint information is quite small 
(8-16 bytes, depending on addressing mode), so I'd rather pre-populate the 
array and not slow down the send path with yet another conditional than have to 
check for endpoint data.  Of course, given the Portals usage model, I'd really 
like to jam the endpoint data into shared memory at some point (not this 
patch).  If others want to figure out how to do lazy endpoint data setup for 
their network, I think that's reasonable.

Ideally, I'd like to see that extended to the ompi_proc_t array itself - maybe 
changing it to a sparse array/list of some type, so we only create that storage 
for procs we actually communicate to.

This would actually break a whole lot of things in OMPI and is a huge change.  
However, I still have plans to add a --enable-minimal-memory type option some 
day which will make the ompi_proc_t significantly smaller by assuming 
homogeneous convertors and that you can programmatically get a remote host name 
when needed.  Again, unless we need to get micro-small (and I don't think we 
do), the sparseness requires conditionals in the critical path that worry me.

If you'd prefer to discuss this as a separate issue, that's fine - just 
something we need to work on at some point in the next year or two.

I agree some work is needed, but I think it's orthogonal to this issue and is 
something we're going to need to study in detail.  There are a number of 
space/time tradeoffs in that path.  Which isn't a problem, but there's a whole 
lot of low hanging fruit before we get to the hard stuff.  Now if you want the 
OFED interfaces to run at exascale, well, buy lots of memory.

Brian

--
  Brian W. Barrett
  Scalable System Software Group
  Sandia National Laboratories