Re: [OMPI devel] init_thread + spawn error

2008-04-03 Thread Ralph Castain
I believe we have stated several times that we are not thread safe at this
time. You are welcome to try it, but shouldn't be surprised when it fails.

Ralph


On 4/3/08 4:18 PM, "Joao Vicente Lima"  wrote:

> Hi,
> I getting a error on call init_thread and comm_spawn on this code:
> 
> #include "mpi.h"
> #include 
> 
> int
> main (int argc, char *argv[])
> {
>   int provided;
>   MPI_Comm parentcomm, intercomm;
> 
>   MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
>   MPI_Comm_get_parent (&parentcomm);
> 
>   if (parentcomm == MPI_COMM_NULL)
> {
>   printf ("spawning ... \n");
>   MPI_Comm_spawn ("./spawn1", MPI_ARGV_NULL, 1,
>  MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm, MPI_ERRCODES_IGNORE);
>   MPI_Comm_disconnect (&intercomm);
> }
>   else
>   {
> printf ("child!\n");
> MPI_Comm_disconnect (&parentcomm);
>   }
> 
>   MPI_Finalize ();
>   return 0;
> }
> 
> and the error is:
> 
> spawning ...
> opal_mutex_lock(): Resource deadlock avoided
> [localhost:18718] *** Process received signal ***
> [localhost:18718] Signal: Aborted (6)
> [localhost:18718] Signal code:  (-6)
> [localhost:18718] [ 0] /lib/libpthread.so.0 [0x2b6e5d9fced0]
> [localhost:18718] [ 1] /lib/libc.so.6(gsignal+0x35) [0x2b6e5dc3b3c5]
> [localhost:18718] [ 2] /lib/libc.so.6(abort+0x10e) [0x2b6e5dc3c73e]
> [localhost:18718] [ 3] /usr/local/mpi/ompi-svn/lib/libmpi.so.0
> [0x2b6e5c9560ff]
> [localhost:18718] [ 4] /usr/local/mpi/ompi-svn/lib/libmpi.so.0
> [0x2b6e5c95601d]
> [localhost:18718] [ 5] /usr/local/mpi/ompi-svn/lib/libmpi.so.0
> [0x2b6e5c9560ac]
> [localhost:18718] [ 6] /usr/local/mpi/ompi-svn/lib/libmpi.so.0
> [0x2b6e5c956a93]
> [localhost:18718] [ 7] /usr/local/mpi/ompi-svn/lib/libmpi.so.0
> [0x2b6e5c9569dd]
> [localhost:18718] [ 8] /usr/local/mpi/ompi-svn/lib/libmpi.so.0
> [0x2b6e5c95797d]
> [localhost:18718] [ 9]
> /usr/local/mpi/ompi-svn/lib/libmpi.so.0(ompi_proc_unpack+0x1ec)
> [0x2b6e5c957dd9]
> [localhost:18718] [10]
> /usr/local/mpi/ompi-svn/lib/openmpi/mca_dpm_orte.so [0x2b6e607f05cf]
> [localhost:18718] [11]
> /usr/local/mpi/ompi-svn/lib/libmpi.so.0(MPI_Comm_spawn+0x459)
> [0x2b6e5c98ede9]
> [localhost:18718] [12] ./spawn1(main+0x7a) [0x400ae2]
> [localhost:18718] [13] /lib/libc.so.6(__libc_start_main+0xf4) [0x2b6e5dc28b74]
> [localhost:18718] [14] ./spawn1 [0x4009d9]
> [localhost:18718] *** End of error message ***
> opal_mutex_lock(): Resource deadlock avoided
> [localhost:18719] *** Process received signal ***
> [localhost:18719] Signal: Aborted (6)
> [localhost:18719] Signal code:  (-6)
> [localhost:18719] [ 0] /lib/libpthread.so.0 [0x2b9317a17ed0]
> [localhost:18719] [ 1] /lib/libc.so.6(gsignal+0x35) [0x2b9317c563c5]
> [localhost:18719] [ 2] /lib/libc.so.6(abort+0x10e) [0x2b9317c5773e]
> [localhost:18719] [ 3] /usr/local/mpi/ompi-svn/lib/libmpi.so.0
> [0x2b93169710ff]
> [localhost:18719] [ 4] /usr/local/mpi/ompi-svn/lib/libmpi.so.0
> [0x2b931697101d]
> [localhost:18719] [ 5] /usr/local/mpi/ompi-svn/lib/libmpi.so.0
> [0x2b93169710ac]
> [localhost:18719] [ 6] /usr/local/mpi/ompi-svn/lib/libmpi.so.0
> [0x2b9316971a93]
> [localhost:18719] [ 7] /usr/local/mpi/ompi-svn/lib/libmpi.so.0
> [0x2b93169719dd]
> [localhost:18719] [ 8] /usr/local/mpi/ompi-svn/lib/libmpi.so.0
> [0x2b931697297d]
> [localhost:18719] [ 9]
> /usr/local/mpi/ompi-svn/lib/libmpi.so.0(ompi_proc_unpack+0x1ec)
> [0x2b9316972dd9]
> [localhost:18719] [10]
> /usr/local/mpi/ompi-svn/lib/openmpi/mca_dpm_orte.so [0x2b931a80b5cf]
> [localhost:18719] [11]
> /usr/local/mpi/ompi-svn/lib/openmpi/mca_dpm_orte.so [0x2b931a80dad7]
> [localhost:18719] [12] /usr/local/mpi/ompi-svn/lib/libmpi.so.0
> [0x2b9316977207]
> [localhost:18719] [13]
> /usr/local/mpi/ompi-svn/lib/libmpi.so.0(PMPI_Init_thread+0x166)
> [0x2b93169b8622]
> [localhost:18719] [14] ./spawn1(main+0x25) [0x400a8d]
> [localhost:18719] [15] /lib/libc.so.6(__libc_start_main+0xf4) [0x2b9317c43b74]
> [localhost:18719] [16] ./spawn1 [0x4009d9]
> [localhost:18719] *** End of error message ***
> --
> mpirun noticed that process rank 0 with PID 18719 on node localhost
> exited on signal 6 (Aborted).
> --
> 
> if I change MPI_Init_thread to MPI_Init all works.
> some suggest ?
> The attachments contain my ompi_info (r18077) and config.log.
> 
> thanks in advance,
> Joao.
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




[OMPI devel] init_thread + spawn error

2008-04-03 Thread Joao Vicente Lima
Hi,
I getting a error on call init_thread and comm_spawn on this code:

#include "mpi.h"
#include 

int
main (int argc, char *argv[])
{
  int provided;
  MPI_Comm parentcomm, intercomm;

  MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
  MPI_Comm_get_parent (&parentcomm);

  if (parentcomm == MPI_COMM_NULL)
{
  printf ("spawning ... \n");
  MPI_Comm_spawn ("./spawn1", MPI_ARGV_NULL, 1,
  MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm, 
MPI_ERRCODES_IGNORE);
  MPI_Comm_disconnect (&intercomm);
}
  else
  {
printf ("child!\n");
MPI_Comm_disconnect (&parentcomm);
  }

  MPI_Finalize ();
  return 0;
}

and the error is:

spawning ...
opal_mutex_lock(): Resource deadlock avoided
[localhost:18718] *** Process received signal ***
[localhost:18718] Signal: Aborted (6)
[localhost:18718] Signal code:  (-6)
[localhost:18718] [ 0] /lib/libpthread.so.0 [0x2b6e5d9fced0]
[localhost:18718] [ 1] /lib/libc.so.6(gsignal+0x35) [0x2b6e5dc3b3c5]
[localhost:18718] [ 2] /lib/libc.so.6(abort+0x10e) [0x2b6e5dc3c73e]
[localhost:18718] [ 3] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b6e5c9560ff]
[localhost:18718] [ 4] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b6e5c95601d]
[localhost:18718] [ 5] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b6e5c9560ac]
[localhost:18718] [ 6] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b6e5c956a93]
[localhost:18718] [ 7] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b6e5c9569dd]
[localhost:18718] [ 8] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b6e5c95797d]
[localhost:18718] [ 9]
/usr/local/mpi/ompi-svn/lib/libmpi.so.0(ompi_proc_unpack+0x1ec)
[0x2b6e5c957dd9]
[localhost:18718] [10]
/usr/local/mpi/ompi-svn/lib/openmpi/mca_dpm_orte.so [0x2b6e607f05cf]
[localhost:18718] [11]
/usr/local/mpi/ompi-svn/lib/libmpi.so.0(MPI_Comm_spawn+0x459)
[0x2b6e5c98ede9]
[localhost:18718] [12] ./spawn1(main+0x7a) [0x400ae2]
[localhost:18718] [13] /lib/libc.so.6(__libc_start_main+0xf4) [0x2b6e5dc28b74]
[localhost:18718] [14] ./spawn1 [0x4009d9]
[localhost:18718] *** End of error message ***
opal_mutex_lock(): Resource deadlock avoided
[localhost:18719] *** Process received signal ***
[localhost:18719] Signal: Aborted (6)
[localhost:18719] Signal code:  (-6)
[localhost:18719] [ 0] /lib/libpthread.so.0 [0x2b9317a17ed0]
[localhost:18719] [ 1] /lib/libc.so.6(gsignal+0x35) [0x2b9317c563c5]
[localhost:18719] [ 2] /lib/libc.so.6(abort+0x10e) [0x2b9317c5773e]
[localhost:18719] [ 3] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b93169710ff]
[localhost:18719] [ 4] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b931697101d]
[localhost:18719] [ 5] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b93169710ac]
[localhost:18719] [ 6] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b9316971a93]
[localhost:18719] [ 7] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b93169719dd]
[localhost:18719] [ 8] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b931697297d]
[localhost:18719] [ 9]
/usr/local/mpi/ompi-svn/lib/libmpi.so.0(ompi_proc_unpack+0x1ec)
[0x2b9316972dd9]
[localhost:18719] [10]
/usr/local/mpi/ompi-svn/lib/openmpi/mca_dpm_orte.so [0x2b931a80b5cf]
[localhost:18719] [11]
/usr/local/mpi/ompi-svn/lib/openmpi/mca_dpm_orte.so [0x2b931a80dad7]
[localhost:18719] [12] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b9316977207]
[localhost:18719] [13]
/usr/local/mpi/ompi-svn/lib/libmpi.so.0(PMPI_Init_thread+0x166)
[0x2b93169b8622]
[localhost:18719] [14] ./spawn1(main+0x25) [0x400a8d]
[localhost:18719] [15] /lib/libc.so.6(__libc_start_main+0xf4) [0x2b9317c43b74]
[localhost:18719] [16] ./spawn1 [0x4009d9]
[localhost:18719] *** End of error message ***
--
mpirun noticed that process rank 0 with PID 18719 on node localhost
exited on signal 6 (Aborted).
--

if I change MPI_Init_thread to MPI_Init all works.
some suggest ?
The attachments contain my ompi_info (r18077) and config.log.

thanks in advance,
Joao.


config.log.gz
Description: GNU Zip compressed data


ompi_info.txt.gz
Description: GNU Zip compressed data


Re: [OMPI devel] MPI_Comm_connect/Accept

2008-04-03 Thread Ralph Castain
Take a gander at ompi/tools/ompi-server - I believe I put a man page in
there. You might just try "man ompi-server" and see if it shows up.

Holler if you have a question - not sure I documented it very thoroughly at
the time.


On 4/3/08 3:10 PM, "Aurélien Bouteiller"  wrote:

> Ralph,
> 
> 
> I am using trunk. Is there a documentation for ompi-server ? Sounds
> exactly like what I need to fix point 1.
> 
> Aurelien
> 
> Le 3 avr. 08 à 17:06, Ralph Castain a écrit :
>> I guess I'll have to ask the basic question: what version are you
>> using?
>> 
>> If you are talking about the trunk, there no longer is a "universe"
>> concept
>> anywhere in the code. Two mpiruns can connect/accept to each other
>> as long
>> as they can make contact. To facilitate that, we created an "ompi-
>> server"
>> tool that is supposed to be run by the sys-admin (or a user, doesn't
>> matter
>> which) on the head node - there are various ways to tell mpirun how to
>> contact the server, or it can self-discover it.
>> 
>> I have tested publish/lookup pretty thoroughly and it seems to work. I
>> haven't spent much time testing connect/accept except via
>> comm_spawn, which
>> seems to be working. Since that uses the same mechanism, I would have
>> expected connect/accept to work as well.
>> 
>> If you are talking about 1.2.x, then the story is totally different.
>> 
>> Ralph
>> 
>> 
>> 
>> On 4/3/08 2:29 PM, "Aurélien Bouteiller" 
>> wrote:
>> 
>>> Hi everyone,
>>> 
>>> I'm trying to figure out how complete is the implementation of
>>> Comm_connect/Accept. I found two problematic cases.
>>> 
>>> 1) Two different programs are started in two different mpirun. One
>>> makes accept, the second one use connect. I would not expect
>>> MPI_Publish_name/Lookup_name to work because they do not share the
>>> HNP. Still I would expect to be able to connect by copying (with
>>> printf-scanf) the port_name string generated by Open_port; especially
>>> considering that in Open MPI, the port_name is a string containing
>>> the
>>> tcp address and port of the rank 0 in the server communicator.
>>> However, doing so results in "no route to host" and the connecting
>>> application aborts. Is the problem related to an explicit check of
>>> the
>>> universes on the accept HNP ? Do I expect too much from the MPI
>>> standard ? Is it because my two applications does not share the same
>>> universe ? Should we (re) add the ability to use the same universe
>>> for
>>> several mpirun ?
>>> 
>>> 2) Second issue is when the program setup a port, and then accept
>>> multiple clients on this port. Everything works fine for the first
>>> client, and then accept stalls forever when waiting for the second
>>> one. My understanding of the standard is that it should work: 5.4.2
>>> states "it must call MPI_Open_port to establish a port [...] it must
>>> call MPI_Comm_accept to accept connections from clients". I
>>> understand
>>> that for one MPI_Open_port I should be able to manage several MPI
>>> clients. Am I understanding correctly the standard here and should we
>>> fix this ?
>>> 
>>> Here is a copy of the non-working code for reference.
>>> 
>>> /*
>>>  * Copyright (c) 2004-2007 The Trustees of the University of
>>> Tennessee.
>>>  * All rights reserved.
>>>  * $COPYRIGHT$
>>>  *
>>>  * Additional copyrights may follow
>>>  *
>>>  * $HEADER$
>>>  */
>>> #include 
>>> #include 
>>> #include 
>>> 
>>> int main(int argc, char *argv[])
>>> {
>>> char port[MPI_MAX_PORT_NAME];
>>> int rank;
>>> int np;
>>> 
>>> 
>>> MPI_Init(&argc, &argv);
>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>> MPI_Comm_size(MPI_COMM_WORLD, &np);
>>> 
>>> if(rank)
>>> {
>>> MPI_Comm comm;
>>> /* client */
>>> MPI_Recv(port, MPI_MAX_PORT_NAME, MPI_CHAR, 0, 0,
>>> MPI_COMM_WORLD, MPI_STATUS_IGNORE);
>>> printf("Read port: %s\n", port);
>>> MPI_Comm_connect(port, MPI_INFO_NULL, 0, MPI_COMM_SELF,
>>> &comm);
>>> 
>>> MPI_Send(&rank, 1, MPI_INT, 0, 1, comm);
>>> MPI_Comm_disconnect(&comm);
>>> }
>>> else
>>> {
>>> int nc = np - 1;
>>> MPI_Comm *comm_nodes = (MPI_Comm *) calloc(nc,
>>> sizeof(MPI_Comm));
>>> MPI_Request *reqs = (MPI_Request *) calloc(nc,
>>> sizeof(MPI_Request));
>>> int *event = (int *) calloc(nc, sizeof(int));
>>> int i;
>>> 
>>> MPI_Open_port(MPI_INFO_NULL, port);
>>> /*MPI_Publish_name("test_service_el", MPI_INFO_NULL, port);*/
>>> printf("Port name: %s\n", port);
>>> for(i = 1; i < np; i++)
>>> MPI_Send(port, MPI_MAX_PORT_NAME, MPI_CHAR, i, 0,
>>> MPI_COMM_WORLD);
>>> 
>>> for(i = 0; i < nc; i++)
>>> {
>>> MPI_Comm_accept(port, MPI_INFO_NULL, 0, MPI_COMM_SELF,
>>> &comm_nodes[i]);
>>> printf("Accept %d\n", i);
>>> MPI_Irecv(&event[i], 1, MPI_INT, 0, 1, comm_nodes[i],
>>> &reqs[i]);
>>> printf("IRecv %d\n", 

Re: [OMPI devel] MPI_Comm_connect/Accept

2008-04-03 Thread Aurélien Bouteiller

Ralph,


I am using trunk. Is there a documentation for ompi-server ? Sounds  
exactly like what I need to fix point 1.


Aurelien

Le 3 avr. 08 à 17:06, Ralph Castain a écrit :
I guess I'll have to ask the basic question: what version are you  
using?


If you are talking about the trunk, there no longer is a "universe"  
concept
anywhere in the code. Two mpiruns can connect/accept to each other  
as long
as they can make contact. To facilitate that, we created an "ompi- 
server"
tool that is supposed to be run by the sys-admin (or a user, doesn't  
matter

which) on the head node - there are various ways to tell mpirun how to
contact the server, or it can self-discover it.

I have tested publish/lookup pretty thoroughly and it seems to work. I
haven't spent much time testing connect/accept except via  
comm_spawn, which

seems to be working. Since that uses the same mechanism, I would have
expected connect/accept to work as well.

If you are talking about 1.2.x, then the story is totally different.

Ralph



On 4/3/08 2:29 PM, "Aurélien Bouteiller"   
wrote:



Hi everyone,

I'm trying to figure out how complete is the implementation of
Comm_connect/Accept. I found two problematic cases.

1) Two different programs are started in two different mpirun. One
makes accept, the second one use connect. I would not expect
MPI_Publish_name/Lookup_name to work because they do not share the
HNP. Still I would expect to be able to connect by copying (with
printf-scanf) the port_name string generated by Open_port; especially
considering that in Open MPI, the port_name is a string containing  
the

tcp address and port of the rank 0 in the server communicator.
However, doing so results in "no route to host" and the connecting
application aborts. Is the problem related to an explicit check of  
the

universes on the accept HNP ? Do I expect too much from the MPI
standard ? Is it because my two applications does not share the same
universe ? Should we (re) add the ability to use the same universe  
for

several mpirun ?

2) Second issue is when the program setup a port, and then accept
multiple clients on this port. Everything works fine for the first
client, and then accept stalls forever when waiting for the second
one. My understanding of the standard is that it should work: 5.4.2
states "it must call MPI_Open_port to establish a port [...] it must
call MPI_Comm_accept to accept connections from clients". I  
understand

that for one MPI_Open_port I should be able to manage several MPI
clients. Am I understanding correctly the standard here and should we
fix this ?

Here is a copy of the non-working code for reference.

/*
 * Copyright (c) 2004-2007 The Trustees of the University of  
Tennessee.

 * All rights reserved.
 * $COPYRIGHT$
 *
 * Additional copyrights may follow
 *
 * $HEADER$
 */
#include 
#include 
#include 

int main(int argc, char *argv[])
{
char port[MPI_MAX_PORT_NAME];
int rank;
int np;


MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &np);

if(rank)
{
MPI_Comm comm;
/* client */
MPI_Recv(port, MPI_MAX_PORT_NAME, MPI_CHAR, 0, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Read port: %s\n", port);
MPI_Comm_connect(port, MPI_INFO_NULL, 0, MPI_COMM_SELF,  
&comm);


MPI_Send(&rank, 1, MPI_INT, 0, 1, comm);
MPI_Comm_disconnect(&comm);
}
else
{
int nc = np - 1;
MPI_Comm *comm_nodes = (MPI_Comm *) calloc(nc,
sizeof(MPI_Comm));
MPI_Request *reqs = (MPI_Request *) calloc(nc,
sizeof(MPI_Request));
int *event = (int *) calloc(nc, sizeof(int));
int i;

MPI_Open_port(MPI_INFO_NULL, port);
/*MPI_Publish_name("test_service_el", MPI_INFO_NULL, port);*/
printf("Port name: %s\n", port);
for(i = 1; i < np; i++)
MPI_Send(port, MPI_MAX_PORT_NAME, MPI_CHAR, i, 0,
MPI_COMM_WORLD);

for(i = 0; i < nc; i++)
{
MPI_Comm_accept(port, MPI_INFO_NULL, 0, MPI_COMM_SELF,
&comm_nodes[i]);
printf("Accept %d\n", i);
MPI_Irecv(&event[i], 1, MPI_INT, 0, 1, comm_nodes[i],
&reqs[i]);
printf("IRecv %d\n", i);
}
MPI_Close_port(port);
MPI_Waitall(nc, reqs, MPI_STATUSES_IGNORE);
for(i = 0; i < nc; i++)
{
printf("event[%d] = %d\n", i, event[i]);
MPI_Comm_disconnect(&comm_nodes[i]);
printf("Disconnect %d\n", i);
}
}

MPI_Finalize();
return EXIT_SUCCESS;
}




--
* Dr. Aurélien Bouteiller
* Sr. Research Associate at Innovative Computing Laboratory
* University of Tennessee
* 1122 Volunteer Boulevard, suite 350
* Knoxville, TN 37996
* 865 974 6321





___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
de

Re: [OMPI devel] MPI_Comm_connect/Accept

2008-04-03 Thread Ralph Castain
I guess I'll have to ask the basic question: what version are you using?

If you are talking about the trunk, there no longer is a "universe" concept
anywhere in the code. Two mpiruns can connect/accept to each other as long
as they can make contact. To facilitate that, we created an "ompi-server"
tool that is supposed to be run by the sys-admin (or a user, doesn't matter
which) on the head node - there are various ways to tell mpirun how to
contact the server, or it can self-discover it.

I have tested publish/lookup pretty thoroughly and it seems to work. I
haven't spent much time testing connect/accept except via comm_spawn, which
seems to be working. Since that uses the same mechanism, I would have
expected connect/accept to work as well.

If you are talking about 1.2.x, then the story is totally different.

Ralph



On 4/3/08 2:29 PM, "Aurélien Bouteiller"  wrote:

> Hi everyone,
> 
> I'm trying to figure out how complete is the implementation of
> Comm_connect/Accept. I found two problematic cases.
> 
> 1) Two different programs are started in two different mpirun. One
> makes accept, the second one use connect. I would not expect
> MPI_Publish_name/Lookup_name to work because they do not share the
> HNP. Still I would expect to be able to connect by copying (with
> printf-scanf) the port_name string generated by Open_port; especially
> considering that in Open MPI, the port_name is a string containing the
> tcp address and port of the rank 0 in the server communicator.
> However, doing so results in "no route to host" and the connecting
> application aborts. Is the problem related to an explicit check of the
> universes on the accept HNP ? Do I expect too much from the MPI
> standard ? Is it because my two applications does not share the same
> universe ? Should we (re) add the ability to use the same universe for
> several mpirun ?
> 
> 2) Second issue is when the program setup a port, and then accept
> multiple clients on this port. Everything works fine for the first
> client, and then accept stalls forever when waiting for the second
> one. My understanding of the standard is that it should work: 5.4.2
> states "it must call MPI_Open_port to establish a port [...] it must
> call MPI_Comm_accept to accept connections from clients". I understand
> that for one MPI_Open_port I should be able to manage several MPI
> clients. Am I understanding correctly the standard here and should we
> fix this ?
> 
> Here is a copy of the non-working code for reference.
> 
> /*
>   * Copyright (c) 2004-2007 The Trustees of the University of Tennessee.
>   * All rights reserved.
>   * $COPYRIGHT$
>   *
>   * Additional copyrights may follow
>   *
>   * $HEADER$
>   */
> #include 
> #include 
> #include 
> 
> int main(int argc, char *argv[])
> {
>  char port[MPI_MAX_PORT_NAME];
>  int rank;
>  int np;
> 
> 
>  MPI_Init(&argc, &argv);
>  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>  MPI_Comm_size(MPI_COMM_WORLD, &np);
> 
>  if(rank)
>  {
>  MPI_Comm comm;
>  /* client */
>  MPI_Recv(port, MPI_MAX_PORT_NAME, MPI_CHAR, 0, 0,
> MPI_COMM_WORLD, MPI_STATUS_IGNORE);
>  printf("Read port: %s\n", port);
>  MPI_Comm_connect(port, MPI_INFO_NULL, 0, MPI_COMM_SELF, &comm);
> 
>  MPI_Send(&rank, 1, MPI_INT, 0, 1, comm);
>  MPI_Comm_disconnect(&comm);
>  }
>  else
>  {
>  int nc = np - 1;
>  MPI_Comm *comm_nodes = (MPI_Comm *) calloc(nc,
> sizeof(MPI_Comm));
>  MPI_Request *reqs = (MPI_Request *) calloc(nc,
> sizeof(MPI_Request));
>  int *event = (int *) calloc(nc, sizeof(int));
>  int i;
> 
>  MPI_Open_port(MPI_INFO_NULL, port);
> /*MPI_Publish_name("test_service_el", MPI_INFO_NULL, port);*/
>  printf("Port name: %s\n", port);
>  for(i = 1; i < np; i++)
>  MPI_Send(port, MPI_MAX_PORT_NAME, MPI_CHAR, i, 0,
> MPI_COMM_WORLD);
> 
>  for(i = 0; i < nc; i++)
>  {
>  MPI_Comm_accept(port, MPI_INFO_NULL, 0, MPI_COMM_SELF,
> &comm_nodes[i]);
>  printf("Accept %d\n", i);
>  MPI_Irecv(&event[i], 1, MPI_INT, 0, 1, comm_nodes[i],
> &reqs[i]);
>  printf("IRecv %d\n", i);
>  }
>  MPI_Close_port(port);
>  MPI_Waitall(nc, reqs, MPI_STATUSES_IGNORE);
>  for(i = 0; i < nc; i++)
>  {
>  printf("event[%d] = %d\n", i, event[i]);
>  MPI_Comm_disconnect(&comm_nodes[i]);
>  printf("Disconnect %d\n", i);
>  }
>  }
> 
>  MPI_Finalize();
>  return EXIT_SUCCESS;
> }
> 
> 
> 
> 
> --
> * Dr. Aurélien Bouteiller
> * Sr. Research Associate at Innovative Computing Laboratory
> * University of Tennessee
> * 1122 Volunteer Boulevard, suite 350
> * Knoxville, TN 37996
> * 865 974 6321
> 
> 
> 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-

[OMPI devel] MPI_Comm_connect/Accept

2008-04-03 Thread Aurélien Bouteiller

Hi everyone,

I'm trying to figure out how complete is the implementation of  
Comm_connect/Accept. I found two problematic cases.


1) Two different programs are started in two different mpirun. One  
makes accept, the second one use connect. I would not expect  
MPI_Publish_name/Lookup_name to work because they do not share the  
HNP. Still I would expect to be able to connect by copying (with  
printf-scanf) the port_name string generated by Open_port; especially  
considering that in Open MPI, the port_name is a string containing the  
tcp address and port of the rank 0 in the server communicator.  
However, doing so results in "no route to host" and the connecting  
application aborts. Is the problem related to an explicit check of the  
universes on the accept HNP ? Do I expect too much from the MPI  
standard ? Is it because my two applications does not share the same  
universe ? Should we (re) add the ability to use the same universe for  
several mpirun ?


2) Second issue is when the program setup a port, and then accept  
multiple clients on this port. Everything works fine for the first  
client, and then accept stalls forever when waiting for the second  
one. My understanding of the standard is that it should work: 5.4.2  
states "it must call MPI_Open_port to establish a port [...] it must  
call MPI_Comm_accept to accept connections from clients". I understand  
that for one MPI_Open_port I should be able to manage several MPI  
clients. Am I understanding correctly the standard here and should we  
fix this ?


Here is a copy of the non-working code for reference.

/*
 * Copyright (c) 2004-2007 The Trustees of the University of Tennessee.
 * All rights reserved.
 * $COPYRIGHT$
 *
 * Additional copyrights may follow
 *
 * $HEADER$
 */
#include 
#include 
#include 

int main(int argc, char *argv[])
{
char port[MPI_MAX_PORT_NAME];
int rank;
int np;


MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &np);

if(rank)
{
MPI_Comm comm;
/* client */
MPI_Recv(port, MPI_MAX_PORT_NAME, MPI_CHAR, 0, 0,  
MPI_COMM_WORLD, MPI_STATUS_IGNORE);

printf("Read port: %s\n", port);
MPI_Comm_connect(port, MPI_INFO_NULL, 0, MPI_COMM_SELF, &comm);

MPI_Send(&rank, 1, MPI_INT, 0, 1, comm);
MPI_Comm_disconnect(&comm);
}
else
{
int nc = np - 1;
MPI_Comm *comm_nodes = (MPI_Comm *) calloc(nc,  
sizeof(MPI_Comm));
MPI_Request *reqs = (MPI_Request *) calloc(nc,  
sizeof(MPI_Request));

int *event = (int *) calloc(nc, sizeof(int));
int i;

MPI_Open_port(MPI_INFO_NULL, port);
/*MPI_Publish_name("test_service_el", MPI_INFO_NULL, port);*/
printf("Port name: %s\n", port);
for(i = 1; i < np; i++)
MPI_Send(port, MPI_MAX_PORT_NAME, MPI_CHAR, i, 0,  
MPI_COMM_WORLD);


for(i = 0; i < nc; i++)
{
MPI_Comm_accept(port, MPI_INFO_NULL, 0, MPI_COMM_SELF,  
&comm_nodes[i]);

printf("Accept %d\n", i);
MPI_Irecv(&event[i], 1, MPI_INT, 0, 1, comm_nodes[i],  
&reqs[i]);

printf("IRecv %d\n", i);
}
MPI_Close_port(port);
MPI_Waitall(nc, reqs, MPI_STATUSES_IGNORE);
for(i = 0; i < nc; i++)
{
printf("event[%d] = %d\n", i, event[i]);
MPI_Comm_disconnect(&comm_nodes[i]);
printf("Disconnect %d\n", i);
}
}

MPI_Finalize();
return EXIT_SUCCESS;
}




--
* Dr. Aurélien Bouteiller
* Sr. Research Associate at Innovative Computing Laboratory
* University of Tennessee
* 1122 Volunteer Boulevard, suite 350
* Knoxville, TN 37996
* 865 974 6321







Re: [OMPI devel] RFC: changes to modex

2008-04-03 Thread Jeff Squyres

On Apr 3, 2008, at 11:16 AM, Jeff Squyres wrote:


The size of the openib modex is explained in btl_openib_component.c in
the branch.  It's a packed message now; we don't just blindly copy an
entire struct.  Here's the comment:

/* The message is packed into multiple parts:
 * 1. a uint8_t indicating the number of modules (ports) in the
message
 * 2. for each module:
 *a. the common module data
 *b. a uint8_t indicating how many CPCs follow
 *c. for each CPC:
 *   a. a uint8_t indicating the index of the CPC in the all[]
 *  array in btl_openib_connect_base.c
 *   b. a uint8_t indicating the priority of this CPC
 *   c. a uint8_t indicating the length of the blob to follow
 *   d. a blob that is only meaningful to that CPC
 */

The common module data is what I sent in the other message.



Gaa.. I forgot to finish explaining the spreadsheet before I sent  
this; sorry...


The 4 lines of oob/xoob/ibcm/rdmacm cpc sizes are how many bytes those  
cpc's contribute (on a per-port basis) to the modex.  "size 1" is what  
they currently contribute.  "size 2" is if Jon and I are able to shave  
off a few more bytes (not entirely sure that's possible yet).


The machine 1 and machine 2 are three configurations each of two  
sample machines.


The first block of numbers is how big the openib part of the modex is  
when only using the ibcm cpc, when only using the rdmacm cpc, and when  
using both the ibcm and rdmacm cpc's (i.e., both are sent in the  
modex; one will "win" and be used at run-time).  The overall number is  
a result of plugging in the numbers from the machine parameters  
(nodes, ppn, num ports) and the ibcm/rdmacm cpc sizes to the formula  
at the top of the spreadsheet.


The second block of numbers if modifying the formula at the top of the  
spreadsheet to calculate basically sending the per-port information  
only once (this modified formula did not include sending a per-port  
bitmap as came up later in the thread).  The green numbers in that  
block are the differences between these numbers and the first block.


The third block of numbers is the same as the second block, but using  
the "size 2" cpc sizes.  The green numbers are the differences between  
these numbers and the first block; the blue numbers are the  
differences between these numbers and the second block.


-

Note: based on what came up later in the thread (e.g., not taking into  
account carto and whatnot), the 2nd and 3rd blocks of numbers are not  
entirely accurate.  But they're likely still in the right ballpark.   
My point was that the size differences from the 1st block and the 2nd/ 
3rd blocks seemed to be significant enough to warrant moving ahead  
with a "reduce replication in the modex" scheme.


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] RFC: changes to modex

2008-04-03 Thread Jeff Squyres

On Apr 3, 2008, at 8:52 AM, Gleb Natapov wrote:

It'll increase it compared to the optimization that we're about to
make.  But it will certainly be a large decrease compared to what
we're doing today


May be I don't understand something in what you propose then.  
Currently

when I run two procs on the same node and each proc uses different HCA
each one of them sends message that describes the HCA in use by the
proc. The message is of the form .
Each proc send one of those so there are two message total on the  
wire.

You propose that one of them should send description of both
available ports (that is one of them sends two messages of the form
above) and then each proc send additional message with the index of  
the

HCA that it is going to use. And this is more data on the wire after
proposed optimization than we have now.


I guess what I'm trying to address is optimizing the common case.   
What I perceive the common case to be is:


- high PPN values (4, 8, 16, ...)
- PPN is larger than the number of verbs-capable ports
- homogeneous openfabrics network

Yes, you will definitely find other cases.  But I'd guess that this  
is, by far, the most common case (especially at scale).  I don't want  
to penalize the common case for the sake of some one-off installations.


I'm basing this optimization on the assumption that PPN's will be  
larger than the number of available ports, so there will guarantee to  
be duplication in the modex message.  Removing that duplication is the  
main goal of this optimization.



 (see the spreadsheet that I sent last week).

I've looked at it but I could not decipher it :( I don't understand
where all these numbers a come from.


Why didn't you ask?  :-)

The size of the openib modex is explained in btl_openib_component.c in  
the branch.  It's a packed message now; we don't just blindly copy an  
entire struct.  Here's the comment:


/* The message is packed into multiple parts:
 * 1. a uint8_t indicating the number of modules (ports) in the  
message

 * 2. for each module:
 *a. the common module data
 *b. a uint8_t indicating how many CPCs follow
 *c. for each CPC:
 *   a. a uint8_t indicating the index of the CPC in the all[]
 *  array in btl_openib_connect_base.c
 *   b. a uint8_t indicating the priority of this CPC
 *   c. a uint8_t indicating the length of the blob to follow
 *   d. a blob that is only meaningful to that CPC
 */

The common module data is what I sent in the other message.


I guess I don't see the problem...?

I like things to be simple. KISS principle I guess.


I can see your point that this is getting fairly complicated.  :-\   
See below.



And I do care about
heterogeneous include/exclude too.


How much?  I still think we can support it just fine; I just want to  
make [what I perceive to be] the common case better.


I looked at what kind of data we send during openib modex and I  
created
file with 1 openib modex messages. mtu, subnet id and cpc list  
where

the same in each message but lid/apm_lid where different, this is
pretty close approximation of the data that is sent from HN to each
process. The uncompressed file size is 489K compressed file size is  
43K.

More then 10 times smaller.



Was this the full modex message, or just the openib part?

Those are promising sizes (43k), though; how long does it take to  
compress/uncompress this data in memory?  That also must be factored  
into the overall time.


How about a revised and combined proposal:

- openib: Use a simplified "send all ACTIVE ports" per-host message  
(i.e., before include/exclude and carto is applied)
- openib: Send a small bitmap for each proc indicating which ports  
each btl module will use
- modex: Compress the result (probably only if it's larger than some  
threshhold size?) when sending, decompress upon receive


This keeps it simple -- no special cases for heterogeneous include/ 
exclude, etc.  And if compression is cheap (can you do some  
experiments to find out?), perhaps we can link against libz (I see the  
libz in at least RHEL4 is BSD licensed, so there's no issue there) and  
de/compress in memory.


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] Ssh tunnelling broken in trunk?

2008-04-03 Thread Jon Mason
On Wednesday 02 April 2008 08:04:10 pm Ralph Castain wrote:
> Hmmm...something isn't making sense. Can I see the command line you used to
> generate this?

mpirun --n 2 --host vic12,vic20 -mca btl openib,self --mca 
btl_openib_receive_queues P,65536,256,128,128 -d xterm -e 
gdb /usr/mpi/gcc/openmpi-trunk/tests/IMB-2.3/IMB-MPI1  

> I'll tell you why I'm puzzled. If orte_debug_flag is set, then the
> "--daemonize" should NOT be there, and you should see "--debug" on that
> command line. What I see is the reverse, which implies to me that
> orte_debug_flag is NOT being set to "true".
>
> When I tested here and on odin, though, I found that the -d option
> correctly set the flag and everything works just fine.
>
> So there is something in your environment or setup that is messing up that
> orte_debug_flag. I have no idea what it could be - the command line should
> override anything in your environment, but you could check. Otherwise, if
> this diagnostic output came from a command line that included -d or
> --debug-devel, or had OMPI_MCA_orte_debug=1 in the environment, then I am
> at a loss - everywhere I've tried it, it works fine.

I'll double check and do a completely fresh svn pull and install and see where 
that gets me.

Thanks for the help,
Jon


> Ralph
>
> On 4/2/08 5:41 PM, "Jon Mason"  wrote:
> > On Wednesday 02 April 2008 05:04:47 pm Ralph Castain wrote:
> >> Here's a real simple diagnostic you can do: set -mca plm_base_verbose 1
> >> and look at the cmd line being executed (send it here). It will look
> >> like:
> >>
> >> [[xxx,1],0] plm:rsh: executing: jjkljks;jldfsaj;
> >>
> >> If the cmd line has --daemonize on it, then the ssh will close and xterm
> >> won't work.
> >
> > [vic20:01863] [[40388,0],0] plm:rsh: executing: (//usr/bin/ssh)
> > [/usr/bin/ssh vic12 orted --daemonize -mca ess env -mca orte_ess_jobid
> > 2646867968 -mca orte_ess_vpid 1 -mca orte_ess_num_procs
> > 2 --hnp-uri
> > "2646867968.0;tcp://192.168.70.150:39057;tcp://10.10.0.150:39057;tcp://86
> >.75.3 0.10:39057" --nodename
> > vic12 -mca btl openib,self --mca btl_openib_receive_queues
> > P,65536,256,128,128 -mca plm_base_verbose 1 -mca
> > mca_base_param_file_path
> > /usr/mpi/gcc/ompi-trunk/share/openmpi/amca-param-sets:/root -mca
> > mca_base_param_file_path_force /root]
> >
> >
> > It looks like what you say is happening.  Is this configured somewhere,
> > so that I can remove it?
> >
> > Thanks,
> > Jon
> >
> >> Ralph
> >>
> >> On 4/2/08 3:14 PM, "Jeff Squyres"  wrote:
> >>> Can you diagnose a little further:
> >>>
> >>> 1. in the case where it works, can you verify that the ssh to launch
> >>> the orteds is still running?
> >>>
> >>> 2. in the case where it doesn't work, can you verify that the ssh to
> >>> launch the orteds has actually died?
> >>>
> >>> On Apr 2, 2008, at 4:58 PM, Jon Mason wrote:
>  On Wednesday 02 April 2008 01:21:31 pm Jon Mason wrote:
> > On Wednesday 02 April 2008 11:54:50 am Ralph H Castain wrote:
> >> I remember that someone had found a bug that caused
> >> orte_debug_flag to not
> >> get properly set (local var covering over a global one) - could be
> >> that
> >> your tmp-public branch doesn't have that patch in it.
> >>
> >> You might try updating to the latest trunk
> >
> > I updated my ompi-trunk tree, did a clean build, and I still seem
> > the same
> > problem.  I regressed trunk to rev 17589 and everything works as I
> > expect.
> > So I think the problem is still there in the top of trunk.
> 
>  I stepped through the revs of trunk and found the first failing rev
>  to be
>  17632.  Its a big patch, so I'll defer to those more in the know to
>  determine
>  what is breaking in there.
> 
> > I don't discount user error, but I don't think I am doing anyting
> > different.
> > Did some setting change that perhaps I did not modify?
> >
> > Thanks,
> > Jon
> >
> >> On 4/2/08 10:41 AM, "George Bosilca"  wrote:
> >>> I'm using this feature on the trunk with the version from
> >>> yesterday.
> >>> It works without problems ...
> >>>
> >>>   george.
> >>>
> >>> On Apr 2, 2008, at 12:14 PM, Jon Mason wrote:
>  On Wednesday 02 April 2008 11:07:18 am Jeff Squyres wrote:
> > Are these r numbers relevant on the /tmp-public branch, or the
> > trunk?
> 
>  I pulled it out of the command used to update the branch, which
>  was:
>  svn merge -r 17590:17917 https://svn.open-mpi.org/svn/ompi/trunk .
> 
>  In the cpc tmp branch, it happened at r17920.
> 
>  Thanks,
>  Jon
> 
> > On Apr 2, 2008, at 11:59 AM, Jon Mason wrote:
> >> I regressed my tree and it looks like it happened between
> >> 17590:17917
> >>
> >> On Wednesday 02 April 2008 10:22:52 am Jon Mason wrote:
> >>> I am noticing that ssh seems to be br

Re: [OMPI devel] RFC: changes to modex

2008-04-03 Thread Jeff Squyres

On Apr 3, 2008, at 9:18 AM, Gleb Natapov wrote:

I am talking about openib part of the modex. The "garbage" I am
referring to is this:


FWIW, on the openib-cpc2 branch, the base data that is sent in the  
modex is this:


uint64_t subnet_id;
/** LID of this port */
uint16_t lid;
/** APM LID for this port */
uint16_t apm_lid;
/** The MTU used by this port */
uint8_t mtu;

lid is used by both the xoob and ibcm cpc's.  We can skip packing the  
apm_lid if apm support is not used if you really want to.  The MTU has  
been changed to the 8 bit enum value.


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] RFC: changes to modex

2008-04-03 Thread Gleb Natapov
On Thu, Apr 03, 2008 at 07:05:28AM -0600, Ralph H Castain wrote:
> H...since I have no control nor involvement in what gets sent, perhaps I
> can be a disinterested third party. ;-)
> 
> Could you perhaps explain this comment:
> 
> > BTW I looked at how we do modex now on the trunk. For OOB case more
> > than half the data we send for each proc is garbage.
> 
> 
> What "garbage" are you referring to? I am working to remove the stuff
> inserted by proc.c - mostly hostname, hopefully arch, etc. If you are
> running a "debug" version, there will also be type descriptors for each
> entry, but those are eliminated for optimized builds.
> 
> So are you referring to other things?
I am talking about openib part of the modex. The "garbage" I am
referring to is this:

This is the structure that is sent by modex for each openib BTL. We send entire
structure by copying it into a message.
struct mca_btl_openib_port_info {
uint32_t mtu;
#if OMPI_ENABLE_HETEROGENEOUS_SUPPORT
uint8_t padding[4];
#endif
uint64_t subnet_id;
uint16_t lid; /* used only in xrc */
uint16_t apm_lid; /* the lid is used for APM to
 different port */
   char *cpclist;
};

The sizeof() the struct is 32 byte, but how much useful info it actually
contains?
mtu  - should really be uint8 since this is encoded value (1,2,3,4)
padding - is garbage.
sibnet_id - is ok
lid - should be sent only for XRC case
apm_lid - should be sent only if apm is enabled
cpclist - is pure garbage and should not be in this struct at all.

So we send 32 bytes with only 9 bytes of useful info (for non XRC case
without APM enabled).

--
Gleb.


Re: [OMPI devel] RFC: changes to modex

2008-04-03 Thread Ralph H Castain
H...since I have no control nor involvement in what gets sent, perhaps I
can be a disinterested third party. ;-)

Could you perhaps explain this comment:

> BTW I looked at how we do modex now on the trunk. For OOB case more
> than half the data we send for each proc is garbage.


What "garbage" are you referring to? I am working to remove the stuff
inserted by proc.c - mostly hostname, hopefully arch, etc. If you are
running a "debug" version, there will also be type descriptors for each
entry, but those are eliminated for optimized builds.

So are you referring to other things?

Thanks
Ralph


On 4/3/08 6:52 AM, "Gleb Natapov"  wrote:

> On Wed, Apr 02, 2008 at 08:41:14PM -0400, Jeff Squyres wrote:
 that it's the same for all procs on all hosts.  I guess there's a few
 cases:
 
 1. homogeneous include/exclude, no carto: send all in node info; no
 proc info
 2. homogeneous include/exclude, carto is used: send all ports in node
 info; send index in proc info for which node info port index it
 will use
>>> This may actually increase modex size. Think about two procs using two
>>> different hcas. We'll send all the data we send today + indexes.
>> 
>> It'll increase it compared to the optimization that we're about to
>> make.  But it will certainly be a large decrease compared to what
>> we're doing today
> 
> May be I don't understand something in what you propose then. Currently
> when I run two procs on the same node and each proc uses different HCA
> each one of them sends message that describes the HCA in use by the
> proc. The message is of the form .
> Each proc send one of those so there are two message total on the wire.
> You propose that one of them should send description of both
> available ports (that is one of them sends two messages of the form
> above) and then each proc send additional message with the index of the
> HCA that it is going to use. And this is more data on the wire after
> proposed optimization than we have now.
> 
> 
>>   (see the spreadsheet that I sent last week).
> I've looked at it but I could not decipher it :( I don't understand
> where all these numbers a come from.
> 
>> 
>> Indeed, we can even put in the optimization that if there's only one
>> process on a host, it can only publish the ports that it will use (and
>> therefore there's no need for the proc data).
> More special cases :(
> 
>> 
 3. heterogeneous include/exclude, no cart: need user to tell us that
 this situation exists (e.g., use another MCA param), but then is same
 as #2
 4. heterogeneous include/exclude, cart is used, same as #3
 
 Right?
 
>>> Looks like it. FWIW I don't like the idea to code all those special
>>> cases. The way it works now I can be pretty sure that any crazy setup
>>> I'll come up with will work.
>> 
>> And so it will with the new scheme.  The only place it won't work is
>> if the user specifies a heterogeneous include/exclude (i.e., we'll
>> require that the user tells us when they do that), which nobody does.
>> 
>> I guess I don't see the problem...?
> I like things to be simple. KISS principle I guess. And I do care about
> heterogeneous include/exclude too.
> 
> BTW I looked at how we do modex now on the trunk. For OOB case more
> than half the data we send for each proc is garbage.
> 
>> 
>>> By the way how much data are moved during modex stage? What if modex
>>> will use compression?
>> 
>> 
>> The spreadsheet I listed was just the openib part of the modex, and it
>> was fairly hefty.  I have no idea how well (or not) it would compress.
>> 
> I looked at what kind of data we send during openib modex and I created
> file with 1 openib modex messages. mtu, subnet id and cpc list where
> the same in each message but lid/apm_lid where different, this is
> pretty close approximation of the data that is sent from HN to each
> process. The uncompressed file size is 489K compressed file size is 43K.
> More then 10 times smaller.
> 
> --
> Gleb.
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] RFC: changes to modex

2008-04-03 Thread Gleb Natapov
On Wed, Apr 02, 2008 at 08:41:14PM -0400, Jeff Squyres wrote:
> >> that it's the same for all procs on all hosts.  I guess there's a few
> >> cases:
> >>
> >> 1. homogeneous include/exclude, no carto: send all in node info; no
> >> proc info
> >> 2. homogeneous include/exclude, carto is used: send all ports in node
> >> info; send index in proc info for which node info port index it  
> >> will use
> > This may actually increase modex size. Think about two procs using two
> > different hcas. We'll send all the data we send today + indexes.
> 
> It'll increase it compared to the optimization that we're about to  
> make.  But it will certainly be a large decrease compared to what  
> we're doing today

May be I don't understand something in what you propose then. Currently
when I run two procs on the same node and each proc uses different HCA
each one of them sends message that describes the HCA in use by the
proc. The message is of the form .
Each proc send one of those so there are two message total on the wire.
You propose that one of them should send description of both
available ports (that is one of them sends two messages of the form
above) and then each proc send additional message with the index of the
HCA that it is going to use. And this is more data on the wire after
proposed optimization than we have now.


>   (see the spreadsheet that I sent last week).
I've looked at it but I could not decipher it :( I don't understand
where all these numbers a come from.

> 
> Indeed, we can even put in the optimization that if there's only one  
> process on a host, it can only publish the ports that it will use (and  
> therefore there's no need for the proc data).
More special cases :(

> 
> >> 3. heterogeneous include/exclude, no cart: need user to tell us that
> >> this situation exists (e.g., use another MCA param), but then is same
> >> as #2
> >> 4. heterogeneous include/exclude, cart is used, same as #3
> >>
> >> Right?
> >>
> > Looks like it. FWIW I don't like the idea to code all those special
> > cases. The way it works now I can be pretty sure that any crazy setup
> > I'll come up with will work.
> 
> And so it will with the new scheme.  The only place it won't work is  
> if the user specifies a heterogeneous include/exclude (i.e., we'll  
> require that the user tells us when they do that), which nobody does.
> 
> I guess I don't see the problem...?
I like things to be simple. KISS principle I guess. And I do care about
heterogeneous include/exclude too.

BTW I looked at how we do modex now on the trunk. For OOB case more
than half the data we send for each proc is garbage.

> 
> > By the way how much data are moved during modex stage? What if modex
> > will use compression?
> 
> 
> The spreadsheet I listed was just the openib part of the modex, and it  
> was fairly hefty.  I have no idea how well (or not) it would compress.
> 
I looked at what kind of data we send during openib modex and I created
file with 1 openib modex messages. mtu, subnet id and cpc list where
the same in each message but lid/apm_lid where different, this is
pretty close approximation of the data that is sent from HN to each
process. The uncompressed file size is 489K compressed file size is 43K.
More then 10 times smaller.

--
Gleb.