Okay, I have a partial fix in there now. You'll have to use -mca routed
unity as I still need to fix it for routed tree.

Couple of things:

1. I fixed the --debug flag so it automatically turns on the debug output
from the data server code itself. Now ompi-server will tell you when it is
accessed.

2. remember, we added an MPI_Info key that specifies if you want the data
stored locally (on your own mpirun) or globally (on the ompi-server). If you
specify nothing, there is a precedence built into the code that defaults to
"local". So you have to tell us that this data is to be published "global"
if you want to connect multiple mpiruns.

I believe Jeff wrote all that up somewhere - could be in an email thread,
though. Been too long ago for me to remember... ;-) You can look it up in
the code though as a last resort - it is in
ompi/mca/pubsub/orte/pubsub_orte.c.

Ralph



On 4/4/08 12:55 PM, "Ralph H Castain" <r...@lanl.gov> wrote:

> Well, something got borked in here - will have to fix it, so this will
> probably not get done until next week.
> 
> 
> On 4/4/08 12:26 PM, "Ralph H Castain" <r...@lanl.gov> wrote:
> 
>> Yeah, you didn't specify the file correctly...plus I found a bug in the code
>> when I looked (out-of-date a little in orterun).
>> 
>> I am updating orterun (commit soon) and will include a better help message
>> about the proper format of the orterun cmd-line option. The syntax is:
>> 
>> -ompi-server uri
>> 
>> or -ompi-server file:filename-where-uri-exists
>> 
>> Problem here is that you gave it a uri of "test", which means nothing. ;-)
>> 
>> Should have it up-and-going soon.
>> Ralph
>> 
>> On 4/4/08 12:02 PM, "Aurélien Bouteiller" <boute...@eecs.utk.edu> wrote:
>> 
>>> Ralph,
>>> 
>>> I've not been very successful at using ompi-server. I tried this :
>>> 
>>> xterm1$ ompi-server --debug-devel -d --report-uri test
>>> [grosse-pomme.local:01097] proc_info: hnp_uri NULL
>>> daemon uri NULL
>>> [grosse-pomme.local:01097] [[34900,0],0] ompi-server: up and running!
>>> 
>>> 
>>> xterm2$ mpirun -ompi-server test -np 1 mpi_accept_test
>>> Port name:
>>> 2285895681.0;tcp://192.168.0.101:50065;tcp://192.168.0.150:50065:300
>>> 
>>> xterm3$ mpirun -ompi-server test  -np 1 simple_connect
>>> --------------------------------------------------------------------------
>>> Process rank 0 attempted to lookup from a global ompi_server that
>>> could not be contacted. This is typically caused by either not
>>> specifying the contact info for the server, or by the server not
>>> currently executing. If you did specify the contact info for a
>>> server, please check to see that the server is running and start
>>> it again (or have your sys admin start it) if it isn't.
>>> 
>>> --------------------------------------------------------------------------
>>> [grosse-pomme.local:01122] *** An error occurred in MPI_Lookup_name
>>> [grosse-pomme.local:01122] *** on communicator MPI_COMM_WORLD
>>> [grosse-pomme.local:01122] *** MPI_ERR_NAME: invalid name argument
>>> [grosse-pomme.local:01122] *** MPI_ERRORS_ARE_FATAL (goodbye)
>>> --------------------------------------------------------------------------
>>> 
>>> 
>>> 
>>> The server code Open_port, and then PublishName. Looks like the
>>> LookupName function cannot reach the ompi-server. The ompi-server in
>>> debug mode does not show any output when a new event occurs (like when
>>> the server is launched). Is there something wrong in the way I use it ?
>>> 
>>> Aurelien
>>> 
>>> Le 3 avr. 08 à 17:21, Ralph Castain a écrit :
>>>> Take a gander at ompi/tools/ompi-server - I believe I put a man page
>>>> in
>>>> there. You might just try "man ompi-server" and see if it shows up.
>>>> 
>>>> Holler if you have a question - not sure I documented it very
>>>> thoroughly at
>>>> the time.
>>>> 
>>>> 
>>>> On 4/3/08 3:10 PM, "Aurélien Bouteiller" <boute...@eecs.utk.edu>
>>>> wrote:
>>>> 
>>>>> Ralph,
>>>>> 
>>>>> 
>>>>> I am using trunk. Is there a documentation for ompi-server ? Sounds
>>>>> exactly like what I need to fix point 1.
>>>>> 
>>>>> Aurelien
>>>>> 
>>>>> Le 3 avr. 08 à 17:06, Ralph Castain a écrit :
>>>>>> I guess I'll have to ask the basic question: what version are you
>>>>>> using?
>>>>>> 
>>>>>> If you are talking about the trunk, there no longer is a "universe"
>>>>>> concept
>>>>>> anywhere in the code. Two mpiruns can connect/accept to each other
>>>>>> as long
>>>>>> as they can make contact. To facilitate that, we created an "ompi-
>>>>>> server"
>>>>>> tool that is supposed to be run by the sys-admin (or a user, doesn't
>>>>>> matter
>>>>>> which) on the head node - there are various ways to tell mpirun
>>>>>> how to
>>>>>> contact the server, or it can self-discover it.
>>>>>> 
>>>>>> I have tested publish/lookup pretty thoroughly and it seems to
>>>>>> work. I
>>>>>> haven't spent much time testing connect/accept except via
>>>>>> comm_spawn, which
>>>>>> seems to be working. Since that uses the same mechanism, I would
>>>>>> have
>>>>>> expected connect/accept to work as well.
>>>>>> 
>>>>>> If you are talking about 1.2.x, then the story is totally different.
>>>>>> 
>>>>>> Ralph
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 4/3/08 2:29 PM, "Aurélien Bouteiller" <boute...@eecs.utk.edu>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi everyone,
>>>>>>> 
>>>>>>> I'm trying to figure out how complete is the implementation of
>>>>>>> Comm_connect/Accept. I found two problematic cases.
>>>>>>> 
>>>>>>> 1) Two different programs are started in two different mpirun. One
>>>>>>> makes accept, the second one use connect. I would not expect
>>>>>>> MPI_Publish_name/Lookup_name to work because they do not share the
>>>>>>> HNP. Still I would expect to be able to connect by copying (with
>>>>>>> printf-scanf) the port_name string generated by Open_port;
>>>>>>> especially
>>>>>>> considering that in Open MPI, the port_name is a string containing
>>>>>>> the
>>>>>>> tcp address and port of the rank 0 in the server communicator.
>>>>>>> However, doing so results in "no route to host" and the connecting
>>>>>>> application aborts. Is the problem related to an explicit check of
>>>>>>> the
>>>>>>> universes on the accept HNP ? Do I expect too much from the MPI
>>>>>>> standard ? Is it because my two applications does not share the
>>>>>>> same
>>>>>>> universe ? Should we (re) add the ability to use the same universe
>>>>>>> for
>>>>>>> several mpirun ?
>>>>>>> 
>>>>>>> 2) Second issue is when the program setup a port, and then accept
>>>>>>> multiple clients on this port. Everything works fine for the first
>>>>>>> client, and then accept stalls forever when waiting for the second
>>>>>>> one. My understanding of the standard is that it should work: 5.4.2
>>>>>>> states "it must call MPI_Open_port to establish a port [...] it
>>>>>>> must
>>>>>>> call MPI_Comm_accept to accept connections from clients". I
>>>>>>> understand
>>>>>>> that for one MPI_Open_port I should be able to manage several MPI
>>>>>>> clients. Am I understanding correctly the standard here and
>>>>>>> should we
>>>>>>> fix this ?
>>>>>>> 
>>>>>>> Here is a copy of the non-working code for reference.
>>>>>>> 
>>>>>>> /*
>>>>>>> * Copyright (c) 2004-2007 The Trustees of the University of
>>>>>>> Tennessee.
>>>>>>> *                         All rights reserved.
>>>>>>> * $COPYRIGHT$
>>>>>>> *
>>>>>>> * Additional copyrights may follow
>>>>>>> *
>>>>>>> * $HEADER$
>>>>>>> */
>>>>>>> #include <stdlib.h>
>>>>>>> #include <stdio.h>
>>>>>>> #include <mpi.h>
>>>>>>> 
>>>>>>> int main(int argc, char *argv[])
>>>>>>> {
>>>>>>>    char port[MPI_MAX_PORT_NAME];
>>>>>>>    int rank;
>>>>>>>    int np;
>>>>>>> 
>>>>>>> 
>>>>>>>    MPI_Init(&argc, &argv);
>>>>>>>    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>>>>>>    MPI_Comm_size(MPI_COMM_WORLD, &np);
>>>>>>> 
>>>>>>>    if(rank)
>>>>>>>    {
>>>>>>>        MPI_Comm comm;
>>>>>>>        /* client */
>>>>>>>        MPI_Recv(port, MPI_MAX_PORT_NAME, MPI_CHAR, 0, 0,
>>>>>>> MPI_COMM_WORLD, MPI_STATUS_IGNORE);
>>>>>>>        printf("Read port: %s\n", port);
>>>>>>>        MPI_Comm_connect(port, MPI_INFO_NULL, 0, MPI_COMM_SELF,
>>>>>>> &comm);
>>>>>>> 
>>>>>>>        MPI_Send(&rank, 1, MPI_INT, 0, 1, comm);
>>>>>>>        MPI_Comm_disconnect(&comm);
>>>>>>>    }
>>>>>>>    else
>>>>>>>    {
>>>>>>>        int nc = np - 1;
>>>>>>>        MPI_Comm *comm_nodes = (MPI_Comm *) calloc(nc,
>>>>>>> sizeof(MPI_Comm));
>>>>>>>        MPI_Request *reqs = (MPI_Request *) calloc(nc,
>>>>>>> sizeof(MPI_Request));
>>>>>>>        int *event = (int *) calloc(nc, sizeof(int));
>>>>>>>        int i;
>>>>>>> 
>>>>>>>        MPI_Open_port(MPI_INFO_NULL, port);
>>>>>>> /*        MPI_Publish_name("test_service_el", MPI_INFO_NULL,
>>>>>>> port);*/
>>>>>>>        printf("Port name: %s\n", port);
>>>>>>>        for(i = 1; i < np; i++)
>>>>>>>            MPI_Send(port, MPI_MAX_PORT_NAME, MPI_CHAR, i, 0,
>>>>>>> MPI_COMM_WORLD);
>>>>>>> 
>>>>>>>        for(i = 0; i < nc; i++)
>>>>>>>        {
>>>>>>>            MPI_Comm_accept(port, MPI_INFO_NULL, 0, MPI_COMM_SELF,
>>>>>>> &comm_nodes[i]);
>>>>>>>            printf("Accept %d\n", i);
>>>>>>>            MPI_Irecv(&event[i], 1, MPI_INT, 0, 1, comm_nodes[i],
>>>>>>> &reqs[i]);
>>>>>>>            printf("IRecv %d\n", i);
>>>>>>>        }
>>>>>>>        MPI_Close_port(port);
>>>>>>>        MPI_Waitall(nc, reqs, MPI_STATUSES_IGNORE);
>>>>>>>        for(i = 0; i < nc; i++)
>>>>>>>        {
>>>>>>>            printf("event[%d] = %d\n", i, event[i]);
>>>>>>>            MPI_Comm_disconnect(&comm_nodes[i]);
>>>>>>>            printf("Disconnect %d\n", i);
>>>>>>>        }
>>>>>>>    }
>>>>>>> 
>>>>>>>    MPI_Finalize();
>>>>>>>    return EXIT_SUCCESS;
>>>>>>> }
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> * Dr. Aurélien Bouteiller
>>>>>>> * Sr. Research Associate at Innovative Computing Laboratory
>>>>>>> * University of Tennessee
>>>>>>> * 1122 Volunteer Boulevard, suite 350
>>>>>>> * Knoxville, TN 37996
>>>>>>> * 865 974 6321
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Reply via email to