Re: [OMPI devel] Fix for XLC + libtool issue

2008-04-25 Thread Jeff Squyres

Good to hear that upgrading fixes this problem.

We actually already have an outstanding ticket to upgrade to 2.2.2 (https://svn.open-mpi.org/trac/ompi/ticket/1265 
).  We were following the Libtool development process closely and  
waiting for at least 2.2.2 (get past 2.2.0).


This will definitely happen before OMPI v1.3 ships.

Additionally, Ralf W. recomends to us that we should also upgrade  
Autoconf to 2.62 or later.  I've been loosely watching that process;  
2.62 requires a newer GNU m4 which we haven't yet decided if we want  
to require.


I don't know if this will be addressed before v1.3 or not.


On Apr 22, 2008, at 1:09 PM, Sérgio Durigan Júnior wrote:


Hi everybody,

Taking a look at your website, I could see an issue about compiling  
Open

MPI shared libs using IBM's XLC compiler + libtool:

http://www.open-mpi.org/faq/?category=building#build-ibm-compilers

Well, I think we have the solution for this: upgrading libtool to the
latest version seems to work well in this case. So, I'd like to ask  
you
to upgrade the libtool used by Open MPI to the latest version (2.2.2  
or
more), and change this FAQ to provide information about how the user  
can

manually update libtool in older versions of Open MPI.

The instructions for upgrading libtool manually are:

1) Get the latest libtool package from
http://www.gnu.org/software/libtool

2) Untar, compile and install it

3) Have a fresh and clean Open MPI source tree

4) Inside there, run 'libtoolize --force'

5) Then run 'aclocal'

6) At last, run 'autoreconf --force'

After that, you'll be able to build Open MPI without problems :-).

Thanks in advance,

--
Sérgio Durigan Júnior
Linux on Power Toolchain - Software Engineer
Linux Technology Center - LTC
IBM Brazil

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems




Re: [OMPI devel] Fix for XLC + libtool issue

2008-04-25 Thread Ralf Wildenhues
Hi Jeff, all,

* Jeff Squyres wrote on Fri, Apr 25, 2008 at 12:35:12PM CEST:
> Good to hear that upgrading fixes this problem.
> 
> We actually already have an outstanding ticket to upgrade to 2.2.2
> (https://svn.open-mpi.org/trac/ompi/ticket/1265 ).  We were following
> the Libtool development process closely and  
> waiting for at least 2.2.2 (get past 2.2.0).

2.2.2 is out since April 1st, and has seen a number of fixes since.  We
hope to do 2.2.4 soon, but of course if you try it out before then any
eventual remaining issues may be fixed before that.

FWIW I've been building OMPI with development versions of autotools all
the time (but only tested on GNU/Linux/x86).

> Additionally, Ralf W. recomends to us that we should also upgrade  
> Autoconf to 2.62 or later.  I've been loosely watching that process;  
> 2.62 requires a newer GNU m4 which we haven't yet decided if we want  
> to require.

Yes you need GNU m4 1.4.5 or newer.  m4 1.4.11 and Autoconf 2.62 speed
up auto* and config.status run time, respectively.  For the latter, we
used OMPI as test bed application, see the first set of timings in:


Cheers,
Ralf


Re: [OMPI devel] Fix for XLC + libtool issue

2008-04-25 Thread Jeff Squyres

On Apr 25, 2008, at 7:40 AM, Ralf Wildenhues wrote:


We actually already have an outstanding ticket to upgrade to 2.2.2
(https://svn.open-mpi.org/trac/ompi/ticket/1265 ).  We were following
the Libtool development process closely and
waiting for at least 2.2.2 (get past 2.2.0).


2.2.2 is out since April 1st, and has seen a number of fixes since.   
We

hope to do 2.2.4 soon, but of course if you try it out before then any
eventual remaining issues may be fixed before that.


Sorry -- I didn't mean to imply that we hadn't noticed that 2.2.2 had  
been released.  I was trying to say that LT 2.2.2 was our gating  
factor and that has now been met.  We have an outstanding ticket to  
upgrade the automated process that builds the official OMPI tarballs  
(we have a strictly controlled process that runs in a specific  
environment to make official OMPI tarballs -- that's where the upgrade  
needs to occur); it just hasn't happened yet.  It's marked as a  
blocker for the OMPI v1.3 release.


I have been using 2.2.2 on my development cluster since shortly after  
it was released (I think other developers are, too).



Additionally, Ralf W. recomends to us that we should also upgrade
Autoconf to 2.62 or later.  I've been loosely watching that process;
2.62 requires a newer GNU m4 which we haven't yet decided if we want
to require.


Yes you need GNU m4 1.4.5 or newer.  m4 1.4.11 and Autoconf 2.62 speed
up auto* and config.status run time, respectively.  For the latter, we
used OMPI as test bed application, see the first set of timings in:




Wow -- those timings are impressive!  Quoting that URL (OMPI is [1]):

-
For example[1], in a large package with 871 substituted variables, of  
which 2*136 are produced by AM_CONDITIONAL, and roughly 210 Makefiles.  
'./config.status' execution for those Makefiles (no headers, no  
depfiles):
- with Automake-1.9.6: 78.54user 9.32system 1:38.60elapsed 89%CPU  
(0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major 
+2551217minor)pagefaults 0swaps
- with Automake 1.10 (no superfluous $(*_TRUE)/$(*_FALSE) settings):  
56.11user 8.31system 1:16.51elapsed 84%CPU (0avgtext+0avgdata  
0maxresident)k 0inputs+0outputs (0major+2284709minor)pagefaults 0swaps
- additionally with the Autoconf patch below: 11.24user 3.62system  
0:21.89elapsed 67%CPU (0avgtext+0avgdata 0maxresident)k 0inputs 
+0outputs (0major+935332minor)pagefaults 0swaps

-

Is the "with the Autoconf patch below" equivalent to AM 1.10 + AC 2.62?

--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] Fix for XLC + libtool issue

2008-04-25 Thread Ralf Wildenhues
* Jeff Squyres wrote on Fri, Apr 25, 2008 at 01:54:39PM CEST:
> On Apr 25, 2008, at 7:40 AM, Ralf Wildenhues wrote:
> 
> > 
> Wow -- those timings are impressive!  Quoting that URL (OMPI is [1]):
> 
> -
> For example[1], in a large package with 871 substituted variables, of  
> which 2*136 are produced by AM_CONDITIONAL, and roughly 210 Makefiles.  
> './config.status' execution for those Makefiles (no headers, no  
> depfiles):
> - with Automake-1.9.6: 78.54user 9.32system 1:38.60elapsed 89%CPU  
> (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major 
> +2551217minor)pagefaults 0swaps
> - with Automake 1.10 (no superfluous $(*_TRUE)/$(*_FALSE) settings):  
> 56.11user 8.31system 1:16.51elapsed 84%CPU (0avgtext+0avgdata  
> 0maxresident)k 0inputs+0outputs (0major+2284709minor)pagefaults 0swaps
> - additionally with the Autoconf patch below: 11.24user 3.62system  
> 0:21.89elapsed 67%CPU (0avgtext+0avgdata 0maxresident)k 0inputs 
> +0outputs (0major+935332minor)pagefaults 0swaps
> -
> 
> Is the "with the Autoconf patch below" equivalent to AM 1.10 + AC 2.62?

Yes.  The patch from the message made it into Autoconf 2.62.  OMPI is a
poster child in hitting the quadratic overhead with Autoconf 2.59.

Cheers,
Ralf


Re: [OMPI devel] Loadbalancing

2008-04-25 Thread Jeff Squyres

Kewl!

I added ticket 1277 so that we are sure to document this for v1.3.


On Apr 23, 2008, at 11:09 AM, Ralph H Castain wrote:


I added a new "loadbalance" feature to OMPI today in r18252.

Brief summary: adding --loadbalance to the mpirun cmd line will  
cause the
round-robin mapper to balance your specified #procs across the  
available

nodes.

More detail:
Several users had noted that mapping byslot always caused us to
preferentially load the first nodes in an allocation, potentially  
leaving
other nodes unused. If they mapped bynode, of course, this wouldn't  
happen -

but then they were forced to a specific rank-to-node relationship.

What they wanted was to have the ranks numbered byslot, but to have  
the ppn

balanced across the entire allocation.

This is now supported via the --loadbalance cmd line option. Here is  
an
example of its affect (again, remember that loadbalance only impacts  
mapping

byslot):

   no-lb  lb bynode
node0:  0,1,2,30,1,2   0,3,6
node1:  4,5,6  3,4 1,4
node2: 5,6 2,5


As you can see, the affect of --loadbalance is to balance the ppn  
across all
the available nodes while retaining byslot rank associations. In  
this case,

instead of leaving one node unused, we take advantage of all available
resources.

Hope this proves helpful
Ralph


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] [OMPI svn] svn:open-mpi r18252

2008-04-25 Thread Tim Prins
This commit causes mpirun to segfault when running the IBM spawn tests 
on our slurm platforms (it may affect others as well). The failures only 
happen when mpirun is run in a batch script.


The backtrace I get is:
Program terminated with signal 11, Segmentation fault.
#0  0x002a969b9dbe in daemon_leader (jobid=2643591169, 
num_local_contributors=1,

type=1 '\001', data=0x588c40, flag=1 '\001', participants=0x566e80)
at grpcomm_basic_module.c:1196
1196OBJ_RELEASE(collection);
(gdb) bt
#0  0x002a969b9dbe in daemon_leader (jobid=2643591169, 
num_local_contributors=1,

type=1 '\001', data=0x588c40, flag=1 '\001', participants=0x566e80)
at grpcomm_basic_module.c:1196
#1  0x002a969ba316 in daemon_collective (jobid=2643591169, 
num_local_contributors=1,

type=1 '\001', data=0x588c40, flag=1 '\001', participants=0x566e80)
at grpcomm_basic_module.c:1279
#2  0x002a956a94a9 in orte_odls_base_default_collect_data 
(proc=0x588eb8, buf=0x588ef0)

at base/odls_base_default_fns.c:2183
#3  0x002a95692990 in process_commands (sender=0x588eb8, 
buffer=0x588ef0, tag=1)

at orted/orted_comm.c:485
#4  0x002a956920a0 in orte_daemon_cmd_processor (fd=-1, 
opal_event=1, data=0x588e90)

at orted/orted_comm.c:271
#5  0x002a957fe4ca in event_process_active (base=0x50d940) at 
event.c:647
#6  0x002a957fea8b in opal_event_base_loop (base=0x50d940, flags=0) 
at event.c:819

#7  0x002a957fe6c5 in opal_event_loop (flags=0) at event.c:726
#8  0x002a957fe57e in opal_event_dispatch () at event.c:662
#9  0x0040335d in orterun (argc=5, argv=0x7fb008) at 
orterun.c:551

#10 0x00402bb3 in main (argc=5, argv=0x7fb008) at main.c:13
(gdb)


I ran with
srun -N 3 -b mpirun -mca mpi_yield_when_idle 1 
~/ompi-tests/ibm/dynamic/spawn_

multiple

Thanks,

Tim
r...@osl.iu.edu wrote:

Author: rhc
Date: 2008-04-23 10:52:09 EDT (Wed, 23 Apr 2008)
New Revision: 18252
URL: https://svn.open-mpi.org/trac/ompi/changeset/18252

Log:
Add a loadbalancing feature to the round-robin mapper - more to be sent to 
devel list

Fix a potential problem with RM-provided nodenames not matching returns from 
gethostname - ensure that the HNP's nodename gets DNS-resolved when comparing 
against RM-provided hostnames. Note that this may be an issue for RM-based 
clusters that don't have local DNS resolution, but hopefully that is more 
indicative of a poorly configured system.

Text files modified: 
   trunk/orte/mca/ras/base/ras_base_node.c| 6 +++ 
   trunk/orte/mca/rmaps/base/base.h   | 4 ++  
   trunk/orte/mca/rmaps/base/rmaps_base_open.c|10 +++ 
   trunk/orte/mca/rmaps/base/rmaps_base_support_fns.c |55 +++ 
   trunk/orte/mca/rmaps/round_robin/rmaps_rr.c|50 
   trunk/orte/tools/orterun/orterun.c | 3 ++  
   6 files changed, 92 insertions(+), 36 deletions(-)


Modified: trunk/orte/mca/ras/base/ras_base_node.c
==
--- trunk/orte/mca/ras/base/ras_base_node.c (original)
+++ trunk/orte/mca/ras/base/ras_base_node.c 2008-04-23 10:52:09 EDT (Wed, 
23 Apr 2008)
@@ -23,6 +23,7 @@
 
 #include "opal/util/output.h"

 #include "opal/util/argv.h"
+#include "opal/util/if.h"
 
 #include "orte/mca/errmgr/errmgr.h"

 #include "orte/util/name_fns.h"
@@ -111,7 +112,7 @@
  * first position since it is the first one entered. We need to check 
to see
  * if this node is the same as the HNP's node so we don't double-enter 
it
  */
-if (0 == strcmp(node->name, hnp_node->name)) {
+if (0 == strcmp(node->name, hnp_node->name) || 
opal_ifislocal(node->name)) {
 OPAL_OUTPUT_VERBOSE((5, orte_ras_base.ras_output,
  "%s ras:base:node_insert updating HNP info to %ld 
slots",
  ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
@@ -124,6 +125,9 @@
 hnp_node->slots_alloc = node->slots_alloc;
 hnp_node->slots_max = node->slots_max;
 hnp_node->launch_id = node->launch_id;
+/* use the RM's name for the node */
+free(hnp_node->name);
+hnp_node->name = strdup(node->name);
 /* set the node to available for use */
 hnp_node->allocate = true;
 /* update the total slots in the job */

Modified: trunk/orte/mca/rmaps/base/base.h
==
--- trunk/orte/mca/rmaps/base/base.h(original)
+++ trunk/orte/mca/rmaps/base/base.h2008-04-23 10:52:09 EDT (Wed, 23 Apr 
2008)
@@ -57,10 +57,12 @@
 bool pernode;
 /** number of ppn for n_per_node mode */

Re: [OMPI devel] Fix for XLC + libtool issue

2008-04-25 Thread Sérgio Durigan Júnior
Hi Jeff,

On Fri, 2008-04-25 at 06:35 -0400, Jeff Squyres wrote:
> Good to hear that upgrading fixes this problem.
> 
> We actually already have an outstanding ticket to upgrade to 2.2.2 
> (https://svn.open-mpi.org/trac/ompi/ticket/1265 
> ).  We were following the Libtool development process closely and  
> waiting for at least 2.2.2 (get past 2.2.0).
> 
> This will definitely happen before OMPI v1.3 ships.

Thanks for the information. This upgrade will certainly bring a lot of
benefits.

If there's still any issue about XLC compiler, please let me know :-).

Regards,

-- 
Sérgio Durigan Júnior
Linux on Power Toolchain - Software Engineer
Linux Technology Center - LTC
IBM Brazil



Re: [OMPI devel] [OMPI svn] svn:open-mpi r18303

2008-04-25 Thread Ralph Castain
Hmmm...just to clarify, this wasn't a "bug". It was my understanding per the
MPI folks that a separate, unique port had to be created for every
invocation of Comm_accept. They didn't want a port hanging around open, and
their plan was to close the port immediately after the connection was
established.

So dpm_orte was written to that specification. When I reorganized the code,
I left the logic as it had been written - which was actually done by the MPI
side of the house, not me.

I have no problem with making the change. However, since the specification
was created on the MPI side, I just want to make sure that the MPI folks all
realize this has now been changed. Obviously, if this change in spec is
adopted, someone needs to make sure that the C and Fortran bindings -do not-
close that port any more!

Ralph



On 4/25/08 2:41 PM, "boute...@osl.iu.edu"  wrote:

> Author: bouteill
> Date: 2008-04-25 16:41:44 EDT (Fri, 25 Apr 2008)
> New Revision: 18303
> URL: https://svn.open-mpi.org/trac/ompi/changeset/18303
> 
> Log:
> Fix a bug that rpevented to use the same port (as returned by Open_port) for
> several Comm_accept)
> 
> 
> Text files modified:
>trunk/ompi/mca/dpm/orte/dpm_orte.c |19 ++-
>1 files changed, 10 insertions(+), 9 deletions(-)
> 
> Modified: trunk/ompi/mca/dpm/orte/dpm_orte.c
> ==
> --- trunk/ompi/mca/dpm/orte/dpm_orte.c (original)
> +++ trunk/ompi/mca/dpm/orte/dpm_orte.c 2008-04-25 16:41:44 EDT (Fri, 25 Apr
> 2008)
> @@ -848,8 +848,14 @@
>  {
>  char *tmp_string, *ptr;
>  
> +/* copy the RML uri so we can return a malloc'd value
> + * that can later be free'd
> + */
> +tmp_string = strdup(port_name);
> +
>  /* find the ':' demarking the RML tag we added to the end */
> -if (NULL == (ptr = strrchr(port_name, ':'))) {
> +if (NULL == (ptr = strrchr(tmp_string, ':'))) {
> +free(tmp_string);
>  return NULL;
>  }
>  
> @@ -863,15 +869,10 @@
>  /* see if the length of the RML uri is too long - if so,
>   * truncate it
>   */
> -if (strlen(port_name) > MPI_MAX_PORT_NAME) {
> -port_name[MPI_MAX_PORT_NAME] = '\0';
> +if (strlen(tmp_string) > MPI_MAX_PORT_NAME) {
> +tmp_string[MPI_MAX_PORT_NAME] = '\0';
>  }
> -
> -/* copy the RML uri so we can return a malloc'd value
> - * that can later be free'd
> - */
> -tmp_string = strdup(port_name);
> -
> +
>  return tmp_string;
>  }
>  
> ___
> svn mailing list
> s...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/svn




Re: [OMPI devel] [OMPI svn] svn:open-mpi r18303

2008-04-25 Thread Aurélien Bouteiller
Actually, the port was still left open forever before the change. The  
bug damaged the port string, and it was not usable anymore, not only  
in subsequent Comm_accept, but also in Close_port or Unpublish_name.


To more specifically answer to your open port concern, if the user  
does not want to have an open port anymore, he should specifically  
call MPI_Close_port and not rely on MPI_Comm_accept to close it.  
Actually the standard suggests the exact contrary: section 5.4.2  
states "it must call MPI_Open_port to establish a port [...] it must
call MPI_Comm_accept to accept connections from clients". Because  
there is multiple clients AND multiple connections in that sentence, I  
assume the port can be used in multiple accepts.


Aurelien

Le 25 avr. 08 à 16:53, Ralph Castain a écrit :

Hmmm...just to clarify, this wasn't a "bug". It was my understanding  
per the

MPI folks that a separate, unique port had to be created for every
invocation of Comm_accept. They didn't want a port hanging around  
open, and

their plan was to close the port immediately after the connection was
established.

So dpm_orte was written to that specification. When I reorganized  
the code,
I left the logic as it had been written - which was actually done by  
the MPI

side of the house, not me.

I have no problem with making the change. However, since the  
specification
was created on the MPI side, I just want to make sure that the MPI  
folks all
realize this has now been changed. Obviously, if this change in spec  
is
adopted, someone needs to make sure that the C and Fortran bindings - 
do not-

close that port any more!

Ralph



On 4/25/08 2:41 PM, "boute...@osl.iu.edu"  wrote:


Author: bouteill
Date: 2008-04-25 16:41:44 EDT (Fri, 25 Apr 2008)
New Revision: 18303
URL: https://svn.open-mpi.org/trac/ompi/changeset/18303

Log:
Fix a bug that rpevented to use the same port (as returned by  
Open_port) for

several Comm_accept)


Text files modified:
  trunk/ompi/mca/dpm/orte/dpm_orte.c |19 ++-
  1 files changed, 10 insertions(+), 9 deletions(-)

Modified: trunk/ompi/mca/dpm/orte/dpm_orte.c
= 
= 
= 
= 
= 
= 
= 
= 
= 
=

--- trunk/ompi/mca/dpm/orte/dpm_orte.c (original)
+++ trunk/ompi/mca/dpm/orte/dpm_orte.c 2008-04-25 16:41:44 EDT  
(Fri, 25 Apr

2008)
@@ -848,8 +848,14 @@
{
char *tmp_string, *ptr;

+/* copy the RML uri so we can return a malloc'd value
+ * that can later be free'd
+ */
+tmp_string = strdup(port_name);
+
/* find the ':' demarking the RML tag we added to the end */
-if (NULL == (ptr = strrchr(port_name, ':'))) {
+if (NULL == (ptr = strrchr(tmp_string, ':'))) {
+free(tmp_string);
return NULL;
}

@@ -863,15 +869,10 @@
/* see if the length of the RML uri is too long - if so,
 * truncate it
 */
-if (strlen(port_name) > MPI_MAX_PORT_NAME) {
-port_name[MPI_MAX_PORT_NAME] = '\0';
+if (strlen(tmp_string) > MPI_MAX_PORT_NAME) {
+tmp_string[MPI_MAX_PORT_NAME] = '\0';
}
-
-/* copy the RML uri so we can return a malloc'd value
- * that can later be free'd
- */
-tmp_string = strdup(port_name);
-
+
return tmp_string;
}

___
svn mailing list
s...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/svn



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] [OMPI svn] svn:open-mpi r18303

2008-04-25 Thread Ralph Castain
As I said, it makes no difference to me. I just want to ensure that everyone
agrees on the interpretation of the MPI standard. We have had these
discussion in the past, with differing views. My guess here is that the port
was left open mostly because the person who wrote the C-binding forgot to
close it. ;-)

So, you MPI folks: do we allow multiple connections against a single port,
and leave the port open until explicitly closed? If so, then do we generate
an error if someone calls MPI_Finalize without first closing the port? Or do
we automatically close any open ports when finalize is called?

Or do we automatically close the port after the connect/accept is completed?

Thanks
Ralph



On 4/25/08 3:13 PM, "Aurélien Bouteiller"  wrote:

> Actually, the port was still left open forever before the change. The
> bug damaged the port string, and it was not usable anymore, not only
> in subsequent Comm_accept, but also in Close_port or Unpublish_name.
> 
> To more specifically answer to your open port concern, if the user
> does not want to have an open port anymore, he should specifically
> call MPI_Close_port and not rely on MPI_Comm_accept to close it.
> Actually the standard suggests the exact contrary: section 5.4.2
> states "it must call MPI_Open_port to establish a port [...] it must
> call MPI_Comm_accept to accept connections from clients". Because
> there is multiple clients AND multiple connections in that sentence, I
> assume the port can be used in multiple accepts.
> 
> Aurelien
> 
> Le 25 avr. 08 à 16:53, Ralph Castain a écrit :
> 
>> Hmmm...just to clarify, this wasn't a "bug". It was my understanding
>> per the
>> MPI folks that a separate, unique port had to be created for every
>> invocation of Comm_accept. They didn't want a port hanging around
>> open, and
>> their plan was to close the port immediately after the connection was
>> established.
>> 
>> So dpm_orte was written to that specification. When I reorganized
>> the code,
>> I left the logic as it had been written - which was actually done by
>> the MPI
>> side of the house, not me.
>> 
>> I have no problem with making the change. However, since the
>> specification
>> was created on the MPI side, I just want to make sure that the MPI
>> folks all
>> realize this has now been changed. Obviously, if this change in spec
>> is
>> adopted, someone needs to make sure that the C and Fortran bindings -
>> do not-
>> close that port any more!
>> 
>> Ralph
>> 
>> 
>> 
>> On 4/25/08 2:41 PM, "boute...@osl.iu.edu"  wrote:
>> 
>>> Author: bouteill
>>> Date: 2008-04-25 16:41:44 EDT (Fri, 25 Apr 2008)
>>> New Revision: 18303
>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/18303
>>> 
>>> Log:
>>> Fix a bug that rpevented to use the same port (as returned by
>>> Open_port) for
>>> several Comm_accept)
>>> 
>>> 
>>> Text files modified:
>>>   trunk/ompi/mca/dpm/orte/dpm_orte.c |19 ++-
>>>   1 files changed, 10 insertions(+), 9 deletions(-)
>>> 
>>> Modified: trunk/ompi/mca/dpm/orte/dpm_orte.c
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> =
>>> --- trunk/ompi/mca/dpm/orte/dpm_orte.c (original)
>>> +++ trunk/ompi/mca/dpm/orte/dpm_orte.c 2008-04-25 16:41:44 EDT
>>> (Fri, 25 Apr
>>> 2008)
>>> @@ -848,8 +848,14 @@
>>> {
>>> char *tmp_string, *ptr;
>>> 
>>> +/* copy the RML uri so we can return a malloc'd value
>>> + * that can later be free'd
>>> + */
>>> +tmp_string = strdup(port_name);
>>> +
>>> /* find the ':' demarking the RML tag we added to the end */
>>> -if (NULL == (ptr = strrchr(port_name, ':'))) {
>>> +if (NULL == (ptr = strrchr(tmp_string, ':'))) {
>>> +free(tmp_string);
>>> return NULL;
>>> }
>>> 
>>> @@ -863,15 +869,10 @@
>>> /* see if the length of the RML uri is too long - if so,
>>>  * truncate it
>>>  */
>>> -if (strlen(port_name) > MPI_MAX_PORT_NAME) {
>>> -port_name[MPI_MAX_PORT_NAME] = '\0';
>>> +if (strlen(tmp_string) > MPI_MAX_PORT_NAME) {
>>> +tmp_string[MPI_MAX_PORT_NAME] = '\0';
>>> }
>>> -
>>> -/* copy the RML uri so we can return a malloc'd value
>>> - * that can later be free'd
>>> - */
>>> -tmp_string = strdup(port_name);
>>> -
>>> +
>>> return tmp_string;
>>> }
>>> 
>>> ___
>>> svn mailing list
>>> s...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/svn
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] [OMPI svn] svn:open-mpi r18303

2008-04-25 Thread George Bosilca

Ralph,

Thanks for your concern regarding the level of compliance of our  
implementation of the MPI standard. I don't know who were the MPI  
gurus you talked with about this issue, but I can tell that for once  
the MPI standard is pretty clear about this.


As stated by Aurelien in his last email, using the plural in several  
sentences, strongly suggest that the status of port should not be  
implicitly modified by MPI_Comm_accept or MPI_Comm_connect. Moreover,  
in the beginning of the chapter in the MPI standard, it is specified  
that comm/accept work exactly as in TCP. In other words, once the port  
is opened it stay open until the user explicitly close it.


However, not all corner cases are addressed by the MPI standard. What  
happens on MPI_Finalize ... it's a good question. Personally, I think  
we should stick with the TCP similarities. The port should be not only  
closed by unpublished. This will solve all issues with people trying  
to lookup a port once the originator is gone.


  george.

On Apr 25, 2008, at 5:25 PM, Ralph Castain wrote:

As I said, it makes no difference to me. I just want to ensure that  
everyone

agrees on the interpretation of the MPI standard. We have had these
discussion in the past, with differing views. My guess here is that  
the port
was left open mostly because the person who wrote the C-binding  
forgot to

close it. ;-)

So, you MPI folks: do we allow multiple connections against a single  
port,
and leave the port open until explicitly closed? If so, then do we  
generate
an error if someone calls MPI_Finalize without first closing the  
port? Or do

we automatically close any open ports when finalize is called?

Or do we automatically close the port after the connect/accept is  
completed?


Thanks
Ralph



On 4/25/08 3:13 PM, "Aurélien Bouteiller"   
wrote:



Actually, the port was still left open forever before the change. The
bug damaged the port string, and it was not usable anymore, not only
in subsequent Comm_accept, but also in Close_port or Unpublish_name.

To more specifically answer to your open port concern, if the user
does not want to have an open port anymore, he should specifically
call MPI_Close_port and not rely on MPI_Comm_accept to close it.
Actually the standard suggests the exact contrary: section 5.4.2
states "it must call MPI_Open_port to establish a port [...] it must
call MPI_Comm_accept to accept connections from clients". Because
there is multiple clients AND multiple connections in that  
sentence, I

assume the port can be used in multiple accepts.

Aurelien

Le 25 avr. 08 à 16:53, Ralph Castain a écrit :


Hmmm...just to clarify, this wasn't a "bug". It was my understanding
per the
MPI folks that a separate, unique port had to be created for every
invocation of Comm_accept. They didn't want a port hanging around
open, and
their plan was to close the port immediately after the connection  
was

established.

So dpm_orte was written to that specification. When I reorganized
the code,
I left the logic as it had been written - which was actually done by
the MPI
side of the house, not me.

I have no problem with making the change. However, since the
specification
was created on the MPI side, I just want to make sure that the MPI
folks all
realize this has now been changed. Obviously, if this change in spec
is
adopted, someone needs to make sure that the C and Fortran  
bindings -

do not-
close that port any more!

Ralph



On 4/25/08 2:41 PM, "boute...@osl.iu.edu"   
wrote:



Author: bouteill
Date: 2008-04-25 16:41:44 EDT (Fri, 25 Apr 2008)
New Revision: 18303
URL: https://svn.open-mpi.org/trac/ompi/changeset/18303

Log:
Fix a bug that rpevented to use the same port (as returned by
Open_port) for
several Comm_accept)


Text files modified:
 trunk/ompi/mca/dpm/orte/dpm_orte.c |19 ++-
 1 files changed, 10 insertions(+), 9 deletions(-)

Modified: trunk/ompi/mca/dpm/orte/dpm_orte.c
=
=
=
=
=
=
=
=
=
= 
= 
===

--- trunk/ompi/mca/dpm/orte/dpm_orte.c (original)
+++ trunk/ompi/mca/dpm/orte/dpm_orte.c 2008-04-25 16:41:44 EDT
(Fri, 25 Apr
2008)
@@ -848,8 +848,14 @@
{
   char *tmp_string, *ptr;

+/* copy the RML uri so we can return a malloc'd value
+ * that can later be free'd
+ */
+tmp_string = strdup(port_name);
+
   /* find the ':' demarking the RML tag we added to the end */
-if (NULL == (ptr = strrchr(port_name, ':'))) {
+if (NULL == (ptr = strrchr(tmp_string, ':'))) {
+free(tmp_string);
   return NULL;
   }

@@ -863,15 +869,10 @@
   /* see if the length of the RML uri is too long - if so,
* truncate it
*/
-if (strlen(port_name) > MPI_MAX_PORT_NAME) {
-port_name[MPI_MAX_PORT_NAME] = '\0';
+if (strlen(tmp_string) > MPI_MAX_PORT_NAME) {
+tmp_string[MPI_MAX_PORT_NAME] = '\0';
   }
-
-/* copy the RML uri so we can return a malloc'd value
- * that can later be free'd
- */
-

Re: [OMPI devel] [OMPI svn] svn:open-mpi r18303

2008-04-25 Thread Aurélien Bouteiller
To bounce on last George remark, currently when a job dies without  
unsubscribing a port with Unpublish(due to poor user programming,  
failure or abort), ompi-server keeps the reference forever and a new  
application can therefore not publish under the same name again. So I  
guess this is a good point to cleanup correctly all published/opened  
ports, when the application is ended (for whatever reason).


Another cool feature could be to have mpirun behave as an ompi-server,  
and publish a suitable URI if requested to do so (if the urifile does  
not exist yet ?). I know from the source code that mpirun is already  
including anything needed to offer this feature, exept the ability to  
provide a suitable URI.


  Aurelien

Le 25 avr. 08 à 19:19, George Bosilca a écrit :


Ralph,

Thanks for your concern regarding the level of compliance of our  
implementation of the MPI standard. I don't know who were the MPI  
gurus you talked with about this issue, but I can tell that for once  
the MPI standard is pretty clear about this.


As stated by Aurelien in his last email, using the plural in several  
sentences, strongly suggest that the status of port should not be  
implicitly modified by MPI_Comm_accept or MPI_Comm_connect.  
Moreover, in the beginning of the chapter in the MPI standard, it is  
specified that comm/accept work exactly as in TCP. In other words,  
once the port is opened it stay open until the user explicitly close  
it.


However, not all corner cases are addressed by the MPI standard.  
What happens on MPI_Finalize ... it's a good question. Personally, I  
think we should stick with the TCP similarities. The port should be  
not only closed by unpublished. This will solve all issues with  
people trying to lookup a port once the originator is gone.


 george.

On Apr 25, 2008, at 5:25 PM, Ralph Castain wrote:

As I said, it makes no difference to me. I just want to ensure that  
everyone

agrees on the interpretation of the MPI standard. We have had these
discussion in the past, with differing views. My guess here is that  
the port
was left open mostly because the person who wrote the C-binding  
forgot to

close it. ;-)

So, you MPI folks: do we allow multiple connections against a  
single port,
and leave the port open until explicitly closed? If so, then do we  
generate
an error if someone calls MPI_Finalize without first closing the  
port? Or do

we automatically close any open ports when finalize is called?

Or do we automatically close the port after the connect/accept is  
completed?


Thanks
Ralph



On 4/25/08 3:13 PM, "Aurélien Bouteiller"   
wrote:


Actually, the port was still left open forever before the change.  
The

bug damaged the port string, and it was not usable anymore, not only
in subsequent Comm_accept, but also in Close_port or Unpublish_name.

To more specifically answer to your open port concern, if the user
does not want to have an open port anymore, he should specifically
call MPI_Close_port and not rely on MPI_Comm_accept to close it.
Actually the standard suggests the exact contrary: section 5.4.2
states "it must call MPI_Open_port to establish a port [...] it must
call MPI_Comm_accept to accept connections from clients". Because
there is multiple clients AND multiple connections in that  
sentence, I

assume the port can be used in multiple accepts.

Aurelien

Le 25 avr. 08 à 16:53, Ralph Castain a écrit :

Hmmm...just to clarify, this wasn't a "bug". It was my  
understanding

per the
MPI folks that a separate, unique port had to be created for every
invocation of Comm_accept. They didn't want a port hanging around
open, and
their plan was to close the port immediately after the connection  
was

established.

So dpm_orte was written to that specification. When I reorganized
the code,
I left the logic as it had been written - which was actually done  
by

the MPI
side of the house, not me.

I have no problem with making the change. However, since the
specification
was created on the MPI side, I just want to make sure that the MPI
folks all
realize this has now been changed. Obviously, if this change in  
spec

is
adopted, someone needs to make sure that the C and Fortran  
bindings -

do not-
close that port any more!

Ralph



On 4/25/08 2:41 PM, "boute...@osl.iu.edu"   
wrote:



Author: bouteill
Date: 2008-04-25 16:41:44 EDT (Fri, 25 Apr 2008)
New Revision: 18303
URL: https://svn.open-mpi.org/trac/ompi/changeset/18303

Log:
Fix a bug that rpevented to use the same port (as returned by
Open_port) for
several Comm_accept)


Text files modified:
trunk/ompi/mca/dpm/orte/dpm_orte.c |19 ++-
1 files changed, 10 insertions(+), 9 deletions(-)

Modified: trunk/ompi/mca/dpm/orte/dpm_orte.c
=
=
=
=
=
=
=
=
=
= 
= 
= 
==

--- trunk/ompi/mca/dpm/orte/dpm_orte.c (original)
+++ trunk/ompi/mca/dpm/orte/dpm_orte.c 2008-04-25 16:41:44 EDT
(Fri, 25 Apr
2008)
@@ -848,8 +848,14 @@
{
 

Re: [OMPI devel] [OMPI svn] svn:open-mpi r18303

2008-04-25 Thread Ralph Castain
That sounds fine with me, George.

Just to clarify: My comment about differing interpretations didn't pertain
to this specific question, but was more an observation of some discussions
we have had about such issues in other areas. I didn't talk to anyone about
this particular question, just noted that someone on the MPI side originally
wrote the code and probably had some interpretation of how it should work in
mind.

Or maybe not... :-)

Anyway, I can easily add some code in the DPM to ensure we close any still
open ports at finalize of that framework, if you feel that is the right
place to do it. Since people are -supposed- to call MPI_Close_port to close
the port, and since that code calls the DPM to execute that function, that
framework should have a clear picture of what is still open.

Make sense?
Ralph



On 4/25/08 5:19 PM, "George Bosilca"  wrote:

> Ralph,
> 
> Thanks for your concern regarding the level of compliance of our
> implementation of the MPI standard. I don't know who were the MPI
> gurus you talked with about this issue, but I can tell that for once
> the MPI standard is pretty clear about this.
> 
> As stated by Aurelien in his last email, using the plural in several
> sentences, strongly suggest that the status of port should not be
> implicitly modified by MPI_Comm_accept or MPI_Comm_connect. Moreover,
> in the beginning of the chapter in the MPI standard, it is specified
> that comm/accept work exactly as in TCP. In other words, once the port
> is opened it stay open until the user explicitly close it.
> 
> However, not all corner cases are addressed by the MPI standard. What
> happens on MPI_Finalize ... it's a good question. Personally, I think
> we should stick with the TCP similarities. The port should be not only
> closed by unpublished. This will solve all issues with people trying
> to lookup a port once the originator is gone.
> 
>george.
> 
> On Apr 25, 2008, at 5:25 PM, Ralph Castain wrote:
> 
>> As I said, it makes no difference to me. I just want to ensure that
>> everyone
>> agrees on the interpretation of the MPI standard. We have had these
>> discussion in the past, with differing views. My guess here is that
>> the port
>> was left open mostly because the person who wrote the C-binding
>> forgot to
>> close it. ;-)
>> 
>> So, you MPI folks: do we allow multiple connections against a single
>> port,
>> and leave the port open until explicitly closed? If so, then do we
>> generate
>> an error if someone calls MPI_Finalize without first closing the
>> port? Or do
>> we automatically close any open ports when finalize is called?
>> 
>> Or do we automatically close the port after the connect/accept is
>> completed?
>> 
>> Thanks
>> Ralph
>> 
>> 
>> 
>> On 4/25/08 3:13 PM, "Aurélien Bouteiller" 
>> wrote:
>> 
>>> Actually, the port was still left open forever before the change. The
>>> bug damaged the port string, and it was not usable anymore, not only
>>> in subsequent Comm_accept, but also in Close_port or Unpublish_name.
>>> 
>>> To more specifically answer to your open port concern, if the user
>>> does not want to have an open port anymore, he should specifically
>>> call MPI_Close_port and not rely on MPI_Comm_accept to close it.
>>> Actually the standard suggests the exact contrary: section 5.4.2
>>> states "it must call MPI_Open_port to establish a port [...] it must
>>> call MPI_Comm_accept to accept connections from clients". Because
>>> there is multiple clients AND multiple connections in that
>>> sentence, I
>>> assume the port can be used in multiple accepts.
>>> 
>>> Aurelien
>>> 
>>> Le 25 avr. 08 à 16:53, Ralph Castain a écrit :
>>> 
 Hmmm...just to clarify, this wasn't a "bug". It was my understanding
 per the
 MPI folks that a separate, unique port had to be created for every
 invocation of Comm_accept. They didn't want a port hanging around
 open, and
 their plan was to close the port immediately after the connection
 was
 established.
 
 So dpm_orte was written to that specification. When I reorganized
 the code,
 I left the logic as it had been written - which was actually done by
 the MPI
 side of the house, not me.
 
 I have no problem with making the change. However, since the
 specification
 was created on the MPI side, I just want to make sure that the MPI
 folks all
 realize this has now been changed. Obviously, if this change in spec
 is
 adopted, someone needs to make sure that the C and Fortran
 bindings -
 do not-
 close that port any more!
 
 Ralph
 
 
 
 On 4/25/08 2:41 PM, "boute...@osl.iu.edu" 
 wrote:
 
> Author: bouteill
> Date: 2008-04-25 16:41:44 EDT (Fri, 25 Apr 2008)
> New Revision: 18303
> URL: https://svn.open-mpi.org/trac/ompi/changeset/18303
> 
> Log:
> Fix a bug that rpevented to use the same port (as returned by
> Open_port) for
> several Comm_accept)
>>

Re: [OMPI devel] [OMPI svn] svn:open-mpi r18303

2008-04-25 Thread Ralph Castain



On 4/25/08 5:38 PM, "Aurélien Bouteiller"  wrote:

> To bounce on last George remark, currently when a job dies without
> unsubscribing a port with Unpublish(due to poor user programming,
> failure or abort), ompi-server keeps the reference forever and a new
> application can therefore not publish under the same name again. So I
> guess this is a good point to cleanup correctly all published/opened
> ports, when the application is ended (for whatever reason).

That's a good point - in my other note, all I had addressed was closing my
local port. We should ensure that the pubsub framework does an unpublish of
anything we put out there. I'll have to create a command to do that since
pubsub doesn't actually track what it was asked to publish - we'll need
something that tells both local and global data servers to "unpublish
anything that came from me".

> 
> Another cool feature could be to have mpirun behave as an ompi-server,
> and publish a suitable URI if requested to do so (if the urifile does
> not exist yet ?). I know from the source code that mpirun is already
> including anything needed to offer this feature, exept the ability to
> provide a suitable URI.

Just to be sure I understand, since I think this is doable. Mpirun already
does serve as your "ompi-server" for any job it spawns - that is the purpose
of the MPI_Info flag "local" instead of "global" when you publish
information. You can always publish/lookup against your own mpirun.

What you are suggesting here is that we have each mpirun put its local data
server port info somewhere that another job can find it, either in the
already existing contact_info file, or perhaps in a separate "data server
uri" file?

The only reason for concern here is the obvious race condition. Since mpirun
only exists during the time a job is running, you could lookup its contact
info and attempt to publish/lookup to that mpirun, only to find it doesn't
respond because it either is already dead or on its way out. Hence the
notion of restricting inter-job operations to the system-level ompi-server.

If we can think of a way to deal with the race condition, I'm certainly
willing to publish the contact info. I'm just concerned that you may find
yourself "hung" if that mpirun goes away unexpectedly - say right in the
middle of a publish/lookup operation.

Ralph

> 
>Aurelien
> 
> Le 25 avr. 08 à 19:19, George Bosilca a écrit :
> 
>> Ralph,
>> 
>> Thanks for your concern regarding the level of compliance of our
>> implementation of the MPI standard. I don't know who were the MPI
>> gurus you talked with about this issue, but I can tell that for once
>> the MPI standard is pretty clear about this.
>> 
>> As stated by Aurelien in his last email, using the plural in several
>> sentences, strongly suggest that the status of port should not be
>> implicitly modified by MPI_Comm_accept or MPI_Comm_connect.
>> Moreover, in the beginning of the chapter in the MPI standard, it is
>> specified that comm/accept work exactly as in TCP. In other words,
>> once the port is opened it stay open until the user explicitly close
>> it.
>> 
>> However, not all corner cases are addressed by the MPI standard.
>> What happens on MPI_Finalize ... it's a good question. Personally, I
>> think we should stick with the TCP similarities. The port should be
>> not only closed by unpublished. This will solve all issues with
>> people trying to lookup a port once the originator is gone.
>> 
>>  george.
>> 
>> On Apr 25, 2008, at 5:25 PM, Ralph Castain wrote:
>> 
>>> As I said, it makes no difference to me. I just want to ensure that
>>> everyone
>>> agrees on the interpretation of the MPI standard. We have had these
>>> discussion in the past, with differing views. My guess here is that
>>> the port
>>> was left open mostly because the person who wrote the C-binding
>>> forgot to
>>> close it. ;-)
>>> 
>>> So, you MPI folks: do we allow multiple connections against a
>>> single port,
>>> and leave the port open until explicitly closed? If so, then do we
>>> generate
>>> an error if someone calls MPI_Finalize without first closing the
>>> port? Or do
>>> we automatically close any open ports when finalize is called?
>>> 
>>> Or do we automatically close the port after the connect/accept is
>>> completed?
>>> 
>>> Thanks
>>> Ralph
>>> 
>>> 
>>> 
>>> On 4/25/08 3:13 PM, "Aurélien Bouteiller" 
>>> wrote:
>>> 
 Actually, the port was still left open forever before the change.
 The
 bug damaged the port string, and it was not usable anymore, not only
 in subsequent Comm_accept, but also in Close_port or Unpublish_name.
 
 To more specifically answer to your open port concern, if the user
 does not want to have an open port anymore, he should specifically
 call MPI_Close_port and not rely on MPI_Comm_accept to close it.
 Actually the standard suggests the exact contrary: section 5.4.2
 states "it must call MPI_Open_port to establish a port [...] it must
 call

Re: [OMPI devel] [OMPI svn] svn:open-mpi r18303

2008-04-25 Thread George Bosilca
We always have the possibility to fail the MPI_Comm_connect. There is  
a specific error for this MPI_ERR_PORT. We can detect that the port is  
not available anymore (whatever the reason is), by simply using the  
TCP timeout on the connection. It's the best we can, and this will  
give us a simplified way of handling things ...


  george.

On Apr 25, 2008, at 7:52 PM, Ralph Castain wrote:

On 4/25/08 5:38 PM, "Aurélien Bouteiller"   
wrote:



To bounce on last George remark, currently when a job dies without
unsubscribing a port with Unpublish(due to poor user programming,
failure or abort), ompi-server keeps the reference forever and a new
application can therefore not publish under the same name again. So I
guess this is a good point to cleanup correctly all published/opened
ports, when the application is ended (for whatever reason).


That's a good point - in my other note, all I had addressed was  
closing my
local port. We should ensure that the pubsub framework does an  
unpublish of
anything we put out there. I'll have to create a command to do that  
since
pubsub doesn't actually track what it was asked to publish - we'll  
need

something that tells both local and global data servers to "unpublish
anything that came from me".



Another cool feature could be to have mpirun behave as an ompi- 
server,

and publish a suitable URI if requested to do so (if the urifile does
not exist yet ?). I know from the source code that mpirun is already
including anything needed to offer this feature, exept the ability to
provide a suitable URI.


Just to be sure I understand, since I think this is doable. Mpirun  
already
does serve as your "ompi-server" for any job it spawns - that is the  
purpose

of the MPI_Info flag "local" instead of "global" when you publish
information. You can always publish/lookup against your own mpirun.

What you are suggesting here is that we have each mpirun put its  
local data

server port info somewhere that another job can find it, either in the
already existing contact_info file, or perhaps in a separate "data  
server

uri" file?

The only reason for concern here is the obvious race condition.  
Since mpirun
only exists during the time a job is running, you could lookup its  
contact
info and attempt to publish/lookup to that mpirun, only to find it  
doesn't

respond because it either is already dead or on its way out. Hence the
notion of restricting inter-job operations to the system-level ompi- 
server.


If we can think of a way to deal with the race condition, I'm  
certainly
willing to publish the contact info. I'm just concerned that you may  
find
yourself "hung" if that mpirun goes away unexpectedly - say right in  
the

middle of a publish/lookup operation.

Ralph



  Aurelien

Le 25 avr. 08 à 19:19, George Bosilca a écrit :


Ralph,

Thanks for your concern regarding the level of compliance of our
implementation of the MPI standard. I don't know who were the MPI
gurus you talked with about this issue, but I can tell that for once
the MPI standard is pretty clear about this.

As stated by Aurelien in his last email, using the plural in several
sentences, strongly suggest that the status of port should not be
implicitly modified by MPI_Comm_accept or MPI_Comm_connect.
Moreover, in the beginning of the chapter in the MPI standard, it is
specified that comm/accept work exactly as in TCP. In other words,
once the port is opened it stay open until the user explicitly close
it.

However, not all corner cases are addressed by the MPI standard.
What happens on MPI_Finalize ... it's a good question. Personally, I
think we should stick with the TCP similarities. The port should be
not only closed by unpublished. This will solve all issues with
people trying to lookup a port once the originator is gone.

george.

On Apr 25, 2008, at 5:25 PM, Ralph Castain wrote:


As I said, it makes no difference to me. I just want to ensure that
everyone
agrees on the interpretation of the MPI standard. We have had these
discussion in the past, with differing views. My guess here is that
the port
was left open mostly because the person who wrote the C-binding
forgot to
close it. ;-)

So, you MPI folks: do we allow multiple connections against a
single port,
and leave the port open until explicitly closed? If so, then do we
generate
an error if someone calls MPI_Finalize without first closing the
port? Or do
we automatically close any open ports when finalize is called?

Or do we automatically close the port after the connect/accept is
completed?

Thanks
Ralph



On 4/25/08 3:13 PM, "Aurélien Bouteiller" 
wrote:


Actually, the port was still left open forever before the change.
The
bug damaged the port string, and it was not usable anymore, not  
only
in subsequent Comm_accept, but also in Close_port or  
Unpublish_name.


To more specifically answer to your open port concern, if the user
does not want to have an open port anymore, he should specifically
call MPI_Close_port and not rely

Re: [OMPI devel] [OMPI svn] svn:open-mpi r18303

2008-04-25 Thread Ralph Castain
True - and I'm all for simple!

Unless someone objects, let's just leave it that way for now.

I'll put on my list to look at this later - maybe count how many publishes
we do vs unpublishes, and if there is a residual at finalize, then send the
"unpublish all" message. Still leaves a race condition, though, so the fail
on timeout is always going to have to be there anyway.

Will ponder and revisit later...

Thanks!
Ralph



On 4/25/08 7:26 PM, "George Bosilca"  wrote:

> We always have the possibility to fail the MPI_Comm_connect. There is
> a specific error for this MPI_ERR_PORT. We can detect that the port is
> not available anymore (whatever the reason is), by simply using the
> TCP timeout on the connection. It's the best we can, and this will
> give us a simplified way of handling things ...
> 
>george.
> 
> On Apr 25, 2008, at 7:52 PM, Ralph Castain wrote:
> 
>> On 4/25/08 5:38 PM, "Aurélien Bouteiller" 
>> wrote:
>> 
>>> To bounce on last George remark, currently when a job dies without
>>> unsubscribing a port with Unpublish(due to poor user programming,
>>> failure or abort), ompi-server keeps the reference forever and a new
>>> application can therefore not publish under the same name again. So I
>>> guess this is a good point to cleanup correctly all published/opened
>>> ports, when the application is ended (for whatever reason).
>> 
>> That's a good point - in my other note, all I had addressed was
>> closing my
>> local port. We should ensure that the pubsub framework does an
>> unpublish of
>> anything we put out there. I'll have to create a command to do that
>> since
>> pubsub doesn't actually track what it was asked to publish - we'll
>> need
>> something that tells both local and global data servers to "unpublish
>> anything that came from me".
>> 
>>> 
>>> Another cool feature could be to have mpirun behave as an ompi-
>>> server,
>>> and publish a suitable URI if requested to do so (if the urifile does
>>> not exist yet ?). I know from the source code that mpirun is already
>>> including anything needed to offer this feature, exept the ability to
>>> provide a suitable URI.
>> 
>> Just to be sure I understand, since I think this is doable. Mpirun
>> already
>> does serve as your "ompi-server" for any job it spawns - that is the
>> purpose
>> of the MPI_Info flag "local" instead of "global" when you publish
>> information. You can always publish/lookup against your own mpirun.
>> 
>> What you are suggesting here is that we have each mpirun put its
>> local data
>> server port info somewhere that another job can find it, either in the
>> already existing contact_info file, or perhaps in a separate "data
>> server
>> uri" file?
>> 
>> The only reason for concern here is the obvious race condition.
>> Since mpirun
>> only exists during the time a job is running, you could lookup its
>> contact
>> info and attempt to publish/lookup to that mpirun, only to find it
>> doesn't
>> respond because it either is already dead or on its way out. Hence the
>> notion of restricting inter-job operations to the system-level ompi-
>> server.
>> 
>> If we can think of a way to deal with the race condition, I'm
>> certainly
>> willing to publish the contact info. I'm just concerned that you may
>> find
>> yourself "hung" if that mpirun goes away unexpectedly - say right in
>> the
>> middle of a publish/lookup operation.
>> 
>> Ralph
>> 
>>> 
>>>   Aurelien
>>> 
>>> Le 25 avr. 08 à 19:19, George Bosilca a écrit :
>>> 
 Ralph,
 
 Thanks for your concern regarding the level of compliance of our
 implementation of the MPI standard. I don't know who were the MPI
 gurus you talked with about this issue, but I can tell that for once
 the MPI standard is pretty clear about this.
 
 As stated by Aurelien in his last email, using the plural in several
 sentences, strongly suggest that the status of port should not be
 implicitly modified by MPI_Comm_accept or MPI_Comm_connect.
 Moreover, in the beginning of the chapter in the MPI standard, it is
 specified that comm/accept work exactly as in TCP. In other words,
 once the port is opened it stay open until the user explicitly close
 it.
 
 However, not all corner cases are addressed by the MPI standard.
 What happens on MPI_Finalize ... it's a good question. Personally, I
 think we should stick with the TCP similarities. The port should be
 not only closed by unpublished. This will solve all issues with
 people trying to lookup a port once the originator is gone.
 
 george.
 
 On Apr 25, 2008, at 5:25 PM, Ralph Castain wrote:
 
> As I said, it makes no difference to me. I just want to ensure that
> everyone
> agrees on the interpretation of the MPI standard. We have had these
> discussion in the past, with differing views. My guess here is that
> the port
> was left open mostly because the person who wrote the C-binding
> forgot to
>