Okay, I have a partial fix in there now. You'll have to use -mca routed unity as I still need to fix it for routed tree.
Couple of things: 1. I fixed the --debug flag so it automatically turns on the debug output from the data server code itself. Now ompi-server will tell you when it is accessed. 2. remember, we added an MPI_Info key that specifies if you want the data stored locally (on your own mpirun) or globally (on the ompi-server). If you specify nothing, there is a precedence built into the code that defaults to "local". So you have to tell us that this data is to be published "global" if you want to connect multiple mpiruns. I believe Jeff wrote all that up somewhere - could be in an email thread, though. Been too long ago for me to remember... ;-) You can look it up in the code though as a last resort - it is in ompi/mca/pubsub/orte/pubsub_orte.c. Ralph On 4/4/08 12:55 PM, "Ralph H Castain" <r...@lanl.gov> wrote: > Well, something got borked in here - will have to fix it, so this will > probably not get done until next week. > > > On 4/4/08 12:26 PM, "Ralph H Castain" <r...@lanl.gov> wrote: > >> Yeah, you didn't specify the file correctly...plus I found a bug in the code >> when I looked (out-of-date a little in orterun). >> >> I am updating orterun (commit soon) and will include a better help message >> about the proper format of the orterun cmd-line option. The syntax is: >> >> -ompi-server uri >> >> or -ompi-server file:filename-where-uri-exists >> >> Problem here is that you gave it a uri of "test", which means nothing. ;-) >> >> Should have it up-and-going soon. >> Ralph >> >> On 4/4/08 12:02 PM, "Aurélien Bouteiller" <boute...@eecs.utk.edu> wrote: >> >>> Ralph, >>> >>> I've not been very successful at using ompi-server. I tried this : >>> >>> xterm1$ ompi-server --debug-devel -d --report-uri test >>> [grosse-pomme.local:01097] proc_info: hnp_uri NULL >>> daemon uri NULL >>> [grosse-pomme.local:01097] [[34900,0],0] ompi-server: up and running! >>> >>> >>> xterm2$ mpirun -ompi-server test -np 1 mpi_accept_test >>> Port name: >>> 2285895681.0;tcp://192.168.0.101:50065;tcp://192.168.0.150:50065:300 >>> >>> xterm3$ mpirun -ompi-server test -np 1 simple_connect >>> -------------------------------------------------------------------------- >>> Process rank 0 attempted to lookup from a global ompi_server that >>> could not be contacted. This is typically caused by either not >>> specifying the contact info for the server, or by the server not >>> currently executing. If you did specify the contact info for a >>> server, please check to see that the server is running and start >>> it again (or have your sys admin start it) if it isn't. >>> >>> -------------------------------------------------------------------------- >>> [grosse-pomme.local:01122] *** An error occurred in MPI_Lookup_name >>> [grosse-pomme.local:01122] *** on communicator MPI_COMM_WORLD >>> [grosse-pomme.local:01122] *** MPI_ERR_NAME: invalid name argument >>> [grosse-pomme.local:01122] *** MPI_ERRORS_ARE_FATAL (goodbye) >>> -------------------------------------------------------------------------- >>> >>> >>> >>> The server code Open_port, and then PublishName. Looks like the >>> LookupName function cannot reach the ompi-server. The ompi-server in >>> debug mode does not show any output when a new event occurs (like when >>> the server is launched). Is there something wrong in the way I use it ? >>> >>> Aurelien >>> >>> Le 3 avr. 08 à 17:21, Ralph Castain a écrit : >>>> Take a gander at ompi/tools/ompi-server - I believe I put a man page >>>> in >>>> there. You might just try "man ompi-server" and see if it shows up. >>>> >>>> Holler if you have a question - not sure I documented it very >>>> thoroughly at >>>> the time. >>>> >>>> >>>> On 4/3/08 3:10 PM, "Aurélien Bouteiller" <boute...@eecs.utk.edu> >>>> wrote: >>>> >>>>> Ralph, >>>>> >>>>> >>>>> I am using trunk. Is there a documentation for ompi-server ? Sounds >>>>> exactly like what I need to fix point 1. >>>>> >>>>> Aurelien >>>>> >>>>> Le 3 avr. 08 à 17:06, Ralph Castain a écrit : >>>>>> I guess I'll have to ask the basic question: what version are you >>>>>> using? >>>>>> >>>>>> If you are talking about the trunk, there no longer is a "universe" >>>>>> concept >>>>>> anywhere in the code. Two mpiruns can connect/accept to each other >>>>>> as long >>>>>> as they can make contact. To facilitate that, we created an "ompi- >>>>>> server" >>>>>> tool that is supposed to be run by the sys-admin (or a user, doesn't >>>>>> matter >>>>>> which) on the head node - there are various ways to tell mpirun >>>>>> how to >>>>>> contact the server, or it can self-discover it. >>>>>> >>>>>> I have tested publish/lookup pretty thoroughly and it seems to >>>>>> work. I >>>>>> haven't spent much time testing connect/accept except via >>>>>> comm_spawn, which >>>>>> seems to be working. Since that uses the same mechanism, I would >>>>>> have >>>>>> expected connect/accept to work as well. >>>>>> >>>>>> If you are talking about 1.2.x, then the story is totally different. >>>>>> >>>>>> Ralph >>>>>> >>>>>> >>>>>> >>>>>> On 4/3/08 2:29 PM, "Aurélien Bouteiller" <boute...@eecs.utk.edu> >>>>>> wrote: >>>>>> >>>>>>> Hi everyone, >>>>>>> >>>>>>> I'm trying to figure out how complete is the implementation of >>>>>>> Comm_connect/Accept. I found two problematic cases. >>>>>>> >>>>>>> 1) Two different programs are started in two different mpirun. One >>>>>>> makes accept, the second one use connect. I would not expect >>>>>>> MPI_Publish_name/Lookup_name to work because they do not share the >>>>>>> HNP. Still I would expect to be able to connect by copying (with >>>>>>> printf-scanf) the port_name string generated by Open_port; >>>>>>> especially >>>>>>> considering that in Open MPI, the port_name is a string containing >>>>>>> the >>>>>>> tcp address and port of the rank 0 in the server communicator. >>>>>>> However, doing so results in "no route to host" and the connecting >>>>>>> application aborts. Is the problem related to an explicit check of >>>>>>> the >>>>>>> universes on the accept HNP ? Do I expect too much from the MPI >>>>>>> standard ? Is it because my two applications does not share the >>>>>>> same >>>>>>> universe ? Should we (re) add the ability to use the same universe >>>>>>> for >>>>>>> several mpirun ? >>>>>>> >>>>>>> 2) Second issue is when the program setup a port, and then accept >>>>>>> multiple clients on this port. Everything works fine for the first >>>>>>> client, and then accept stalls forever when waiting for the second >>>>>>> one. My understanding of the standard is that it should work: 5.4.2 >>>>>>> states "it must call MPI_Open_port to establish a port [...] it >>>>>>> must >>>>>>> call MPI_Comm_accept to accept connections from clients". I >>>>>>> understand >>>>>>> that for one MPI_Open_port I should be able to manage several MPI >>>>>>> clients. Am I understanding correctly the standard here and >>>>>>> should we >>>>>>> fix this ? >>>>>>> >>>>>>> Here is a copy of the non-working code for reference. >>>>>>> >>>>>>> /* >>>>>>> * Copyright (c) 2004-2007 The Trustees of the University of >>>>>>> Tennessee. >>>>>>> * All rights reserved. >>>>>>> * $COPYRIGHT$ >>>>>>> * >>>>>>> * Additional copyrights may follow >>>>>>> * >>>>>>> * $HEADER$ >>>>>>> */ >>>>>>> #include <stdlib.h> >>>>>>> #include <stdio.h> >>>>>>> #include <mpi.h> >>>>>>> >>>>>>> int main(int argc, char *argv[]) >>>>>>> { >>>>>>> char port[MPI_MAX_PORT_NAME]; >>>>>>> int rank; >>>>>>> int np; >>>>>>> >>>>>>> >>>>>>> MPI_Init(&argc, &argv); >>>>>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >>>>>>> MPI_Comm_size(MPI_COMM_WORLD, &np); >>>>>>> >>>>>>> if(rank) >>>>>>> { >>>>>>> MPI_Comm comm; >>>>>>> /* client */ >>>>>>> MPI_Recv(port, MPI_MAX_PORT_NAME, MPI_CHAR, 0, 0, >>>>>>> MPI_COMM_WORLD, MPI_STATUS_IGNORE); >>>>>>> printf("Read port: %s\n", port); >>>>>>> MPI_Comm_connect(port, MPI_INFO_NULL, 0, MPI_COMM_SELF, >>>>>>> &comm); >>>>>>> >>>>>>> MPI_Send(&rank, 1, MPI_INT, 0, 1, comm); >>>>>>> MPI_Comm_disconnect(&comm); >>>>>>> } >>>>>>> else >>>>>>> { >>>>>>> int nc = np - 1; >>>>>>> MPI_Comm *comm_nodes = (MPI_Comm *) calloc(nc, >>>>>>> sizeof(MPI_Comm)); >>>>>>> MPI_Request *reqs = (MPI_Request *) calloc(nc, >>>>>>> sizeof(MPI_Request)); >>>>>>> int *event = (int *) calloc(nc, sizeof(int)); >>>>>>> int i; >>>>>>> >>>>>>> MPI_Open_port(MPI_INFO_NULL, port); >>>>>>> /* MPI_Publish_name("test_service_el", MPI_INFO_NULL, >>>>>>> port);*/ >>>>>>> printf("Port name: %s\n", port); >>>>>>> for(i = 1; i < np; i++) >>>>>>> MPI_Send(port, MPI_MAX_PORT_NAME, MPI_CHAR, i, 0, >>>>>>> MPI_COMM_WORLD); >>>>>>> >>>>>>> for(i = 0; i < nc; i++) >>>>>>> { >>>>>>> MPI_Comm_accept(port, MPI_INFO_NULL, 0, MPI_COMM_SELF, >>>>>>> &comm_nodes[i]); >>>>>>> printf("Accept %d\n", i); >>>>>>> MPI_Irecv(&event[i], 1, MPI_INT, 0, 1, comm_nodes[i], >>>>>>> &reqs[i]); >>>>>>> printf("IRecv %d\n", i); >>>>>>> } >>>>>>> MPI_Close_port(port); >>>>>>> MPI_Waitall(nc, reqs, MPI_STATUSES_IGNORE); >>>>>>> for(i = 0; i < nc; i++) >>>>>>> { >>>>>>> printf("event[%d] = %d\n", i, event[i]); >>>>>>> MPI_Comm_disconnect(&comm_nodes[i]); >>>>>>> printf("Disconnect %d\n", i); >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> MPI_Finalize(); >>>>>>> return EXIT_SUCCESS; >>>>>>> } >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> * Dr. Aurélien Bouteiller >>>>>>> * Sr. Research Associate at Innovative Computing Laboratory >>>>>>> * University of Tennessee >>>>>>> * 1122 Volunteer Boulevard, suite 350 >>>>>>> * Knoxville, TN 37996 >>>>>>> * 865 974 6321 >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel