I guess I'll have to ask the basic question: what version are you using? If you are talking about the trunk, there no longer is a "universe" concept anywhere in the code. Two mpiruns can connect/accept to each other as long as they can make contact. To facilitate that, we created an "ompi-server" tool that is supposed to be run by the sys-admin (or a user, doesn't matter which) on the head node - there are various ways to tell mpirun how to contact the server, or it can self-discover it.
I have tested publish/lookup pretty thoroughly and it seems to work. I haven't spent much time testing connect/accept except via comm_spawn, which seems to be working. Since that uses the same mechanism, I would have expected connect/accept to work as well. If you are talking about 1.2.x, then the story is totally different. Ralph On 4/3/08 2:29 PM, "Aurélien Bouteiller" <boute...@eecs.utk.edu> wrote: > Hi everyone, > > I'm trying to figure out how complete is the implementation of > Comm_connect/Accept. I found two problematic cases. > > 1) Two different programs are started in two different mpirun. One > makes accept, the second one use connect. I would not expect > MPI_Publish_name/Lookup_name to work because they do not share the > HNP. Still I would expect to be able to connect by copying (with > printf-scanf) the port_name string generated by Open_port; especially > considering that in Open MPI, the port_name is a string containing the > tcp address and port of the rank 0 in the server communicator. > However, doing so results in "no route to host" and the connecting > application aborts. Is the problem related to an explicit check of the > universes on the accept HNP ? Do I expect too much from the MPI > standard ? Is it because my two applications does not share the same > universe ? Should we (re) add the ability to use the same universe for > several mpirun ? > > 2) Second issue is when the program setup a port, and then accept > multiple clients on this port. Everything works fine for the first > client, and then accept stalls forever when waiting for the second > one. My understanding of the standard is that it should work: 5.4.2 > states "it must call MPI_Open_port to establish a port [...] it must > call MPI_Comm_accept to accept connections from clients". I understand > that for one MPI_Open_port I should be able to manage several MPI > clients. Am I understanding correctly the standard here and should we > fix this ? > > Here is a copy of the non-working code for reference. > > /* > * Copyright (c) 2004-2007 The Trustees of the University of Tennessee. > * All rights reserved. > * $COPYRIGHT$ > * > * Additional copyrights may follow > * > * $HEADER$ > */ > #include <stdlib.h> > #include <stdio.h> > #include <mpi.h> > > int main(int argc, char *argv[]) > { > char port[MPI_MAX_PORT_NAME]; > int rank; > int np; > > > MPI_Init(&argc, &argv); > MPI_Comm_rank(MPI_COMM_WORLD, &rank); > MPI_Comm_size(MPI_COMM_WORLD, &np); > > if(rank) > { > MPI_Comm comm; > /* client */ > MPI_Recv(port, MPI_MAX_PORT_NAME, MPI_CHAR, 0, 0, > MPI_COMM_WORLD, MPI_STATUS_IGNORE); > printf("Read port: %s\n", port); > MPI_Comm_connect(port, MPI_INFO_NULL, 0, MPI_COMM_SELF, &comm); > > MPI_Send(&rank, 1, MPI_INT, 0, 1, comm); > MPI_Comm_disconnect(&comm); > } > else > { > int nc = np - 1; > MPI_Comm *comm_nodes = (MPI_Comm *) calloc(nc, > sizeof(MPI_Comm)); > MPI_Request *reqs = (MPI_Request *) calloc(nc, > sizeof(MPI_Request)); > int *event = (int *) calloc(nc, sizeof(int)); > int i; > > MPI_Open_port(MPI_INFO_NULL, port); > /* MPI_Publish_name("test_service_el", MPI_INFO_NULL, port);*/ > printf("Port name: %s\n", port); > for(i = 1; i < np; i++) > MPI_Send(port, MPI_MAX_PORT_NAME, MPI_CHAR, i, 0, > MPI_COMM_WORLD); > > for(i = 0; i < nc; i++) > { > MPI_Comm_accept(port, MPI_INFO_NULL, 0, MPI_COMM_SELF, > &comm_nodes[i]); > printf("Accept %d\n", i); > MPI_Irecv(&event[i], 1, MPI_INT, 0, 1, comm_nodes[i], > &reqs[i]); > printf("IRecv %d\n", i); > } > MPI_Close_port(port); > MPI_Waitall(nc, reqs, MPI_STATUSES_IGNORE); > for(i = 0; i < nc; i++) > { > printf("event[%d] = %d\n", i, event[i]); > MPI_Comm_disconnect(&comm_nodes[i]); > printf("Disconnect %d\n", i); > } > } > > MPI_Finalize(); > return EXIT_SUCCESS; > } > > > > > -- > * Dr. Aurélien Bouteiller > * Sr. Research Associate at Innovative Computing Laboratory > * University of Tennessee > * 1122 Volunteer Boulevard, suite 350 > * Knoxville, TN 37996 > * 865 974 6321 > > > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel