I am referring to v1.6.

On Sep 12, 2012, at 5:27 PM, Ralph Castain wrote:

> what version of ompi are you referring to?
> 
> On Sep 12, 2012, at 8:13 AM, Suraj Prabhakaran <suraj.prabhaka...@gmail.com> 
> wrote:
> 
>> Dear all,
>> 
>> I observed a strange behavior with MPI_Comm_connect and MPI_Comm_disconnect. 
>> In short, after two processes connect to each other with a port and merge to 
>> create a intra comm (rank 0 and rank 1), only one of them (the root) is 
>> thereafter able to reach a third new process through MPI_Comm_connect.
>> 
>> I can explain with an example:
>> 
>> 1. Assume 3 MPI programs with separate MPI_COMM_WORLD with size=1,rank=0 - 
>> process1, process2 and process3.
>> 2. Process2 opens a port and waits in MPI_Comm_accept
>> 3. Process1 connects to process2 with MPI_Comm_connect(port,...) and creates 
>> a inter-comm.
>> 4. Process1 and process2 participate in MPI_Intercomm_merge and create a 
>> intra-comm (say, newcomm).
>> 5. Process3 has also opened a port and  is now waiting at MPI_Comm_accept
>> 6. Process1 and process2 try to connect to process3 with 
>> MPI_Comm_connect(port, ..., root, newcomm, new_all3_inter_comm)
>> 
>> At this stage only the root process from newcomm is able to connect to 
>> process3 and the other one is unable to find the route. If the root is 
>> process1, then process2 fails and vice versa.
>> 
>> I have attached a tar file with small example of this case. To observe the 
>> above scenario, run the examples in the following way
>> 
>> 1. start 2 separate instances of "server"
>>      mpirun -np 1 ./server
>>      mpirun -np 1 ./server
>> 
>> 2. They will print out the portname. Copy and paste the portname in client.c 
>> (in strcpy)
>> 3. Compile client.c and start the client
>>      mpirun -np 1 ./client
>> 
>> 4. You will see output of the first server (which is process2) during the 
>> final MPI_Comm_connect as
>> 
>> [[8119,0],0]:route_callback tried routing message from [[8119,1],0] to 
>> [[8117,1],0]:16, can't find route
>> [0] func:0   libopen-rte.2.dylib                 0x0000000100055afb 
>> opal_backtrace_print + 43
>> [1] func:1   mca_rml_oob.so                      0x000000010017aec3 
>> rml_oob_recv_route_callback + 739
>> [2] func:2   mca_oob_tcp.so                      0x0000000100187ab9 
>> mca_oob_tcp_msg_recv_complete + 825
>> [3] func:3   mca_oob_tcp.so                      0x0000000100188ddd 
>> mca_oob_tcp_peer_recv_handler + 397
>> [4] func:4   libopen-rte.2.dylib                 0x0000000100064a55 
>> opal_event_base_loop + 837
>> [5] func:5   mpirun                              0x00000001000018d1 orterun 
>> + 3428
>> [6] func:6   mpirun                              0x0000000100000b6b main + 27
>> [7] func:7   mpirun                              0x0000000100000b48 start + 
>> 52
>> [8] func:8   ???                                 0x0000000000000004 0x0 + 4
>> 
>> Note that just to make this example simple, I am not using any 
>> publish/lookup and I am manually copying the portnames. 
>> 
>> Can someone please look into this problem? we really want to use this for a 
>> project but restricted by this bug.
>> 
>> Thanks!
>> Best,
>> Suraj
>> 
>> <ac-test.tar>_______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to