I am referring to v1.6. On Sep 12, 2012, at 5:27 PM, Ralph Castain wrote:
> what version of ompi are you referring to? > > On Sep 12, 2012, at 8:13 AM, Suraj Prabhakaran <suraj.prabhaka...@gmail.com> > wrote: > >> Dear all, >> >> I observed a strange behavior with MPI_Comm_connect and MPI_Comm_disconnect. >> In short, after two processes connect to each other with a port and merge to >> create a intra comm (rank 0 and rank 1), only one of them (the root) is >> thereafter able to reach a third new process through MPI_Comm_connect. >> >> I can explain with an example: >> >> 1. Assume 3 MPI programs with separate MPI_COMM_WORLD with size=1,rank=0 - >> process1, process2 and process3. >> 2. Process2 opens a port and waits in MPI_Comm_accept >> 3. Process1 connects to process2 with MPI_Comm_connect(port,...) and creates >> a inter-comm. >> 4. Process1 and process2 participate in MPI_Intercomm_merge and create a >> intra-comm (say, newcomm). >> 5. Process3 has also opened a port and is now waiting at MPI_Comm_accept >> 6. Process1 and process2 try to connect to process3 with >> MPI_Comm_connect(port, ..., root, newcomm, new_all3_inter_comm) >> >> At this stage only the root process from newcomm is able to connect to >> process3 and the other one is unable to find the route. If the root is >> process1, then process2 fails and vice versa. >> >> I have attached a tar file with small example of this case. To observe the >> above scenario, run the examples in the following way >> >> 1. start 2 separate instances of "server" >> mpirun -np 1 ./server >> mpirun -np 1 ./server >> >> 2. They will print out the portname. Copy and paste the portname in client.c >> (in strcpy) >> 3. Compile client.c and start the client >> mpirun -np 1 ./client >> >> 4. You will see output of the first server (which is process2) during the >> final MPI_Comm_connect as >> >> [[8119,0],0]:route_callback tried routing message from [[8119,1],0] to >> [[8117,1],0]:16, can't find route >> [0] func:0 libopen-rte.2.dylib 0x0000000100055afb >> opal_backtrace_print + 43 >> [1] func:1 mca_rml_oob.so 0x000000010017aec3 >> rml_oob_recv_route_callback + 739 >> [2] func:2 mca_oob_tcp.so 0x0000000100187ab9 >> mca_oob_tcp_msg_recv_complete + 825 >> [3] func:3 mca_oob_tcp.so 0x0000000100188ddd >> mca_oob_tcp_peer_recv_handler + 397 >> [4] func:4 libopen-rte.2.dylib 0x0000000100064a55 >> opal_event_base_loop + 837 >> [5] func:5 mpirun 0x00000001000018d1 orterun >> + 3428 >> [6] func:6 mpirun 0x0000000100000b6b main + 27 >> [7] func:7 mpirun 0x0000000100000b48 start + >> 52 >> [8] func:8 ??? 0x0000000000000004 0x0 + 4 >> >> Note that just to make this example simple, I am not using any >> publish/lookup and I am manually copying the portnames. >> >> Can someone please look into this problem? we really want to use this for a >> project but restricted by this bug. >> >> Thanks! >> Best, >> Suraj >> >> <ac-test.tar>_______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel