Ralph, thanks for investigating.
I've applied the two patches you mentioned earlier and ran with the ompi server. Although i was able to runn our standalone test, when I integrated the changes to our code, the processes entered a crazy loop and allocated all the memory available when calling MPI_Port_Connect. I was not able to identify why it works standalone but not integrated with our code. If I found why, I'll let your know. looking forward to your findings. We'll be happy to test any patches if you have some! p. On Sat, Jul 17, 2010 at 9:47 PM, Ralph Castain <r...@open-mpi.org> wrote: > Okay, I can reproduce this problem. Frankly, I don't think this ever worked > with OMPI, and I'm not sure how the choice of BTL makes a difference. > > The program is crashing in the communicator definition, which involves a > communication over our internal out-of-band messaging system. That system has > zero connection to any BTL, so it should crash either way. > > Regardless, I will play with this a little as time allows. Thanks for the > reproducer! > > > On Jun 25, 2010, at 7:23 AM, Philippe wrote: > >> Hi, >> >> I'm trying to run a test program which consists of a server creating a >> port using MPI_Open_port and N clients using MPI_Comm_connect to >> connect to the server. >> >> I'm able to do so with 1 server and 2 clients, but with 1 server + 3 >> clients, I get the following error message: >> >> [node003:32274] [[37084,0],0]:route_callback tried routing message >> from [[37084,1],0] to [[40912,1],0]:102, can't find route >> >> This is only happening with the openib BTL. With tcp BTL it works >> perfectly fine (ofud also works as a matter of fact...). This has been >> tested on two completely different clusters, with identical results. >> In either cases, the IB frabic works normally. >> >> Any help would be greatly appreciated! Several people in my team >> looked at the problem. Google and the mailing list archive did not >> provide any clue. I believe that from an MPI standpoint, my test >> program is valid (and it works with TCP, which make me feel better >> about the sequence of MPI calls) >> >> Regards, >> Philippe. >> >> >> >> Background: >> >> I intend to use openMPI to transport data inside a much larger >> application. Because of that, I cannot used mpiexec. Each process is >> started by our own "job management" and use a name server to find >> about each others. Once all the clients are connected, I would like >> the server to do MPI_Recv to get the data from all the client. I dont >> care about the order or which client are sending data, as long as I >> can receive it with on call. Do do that, the clients and the server >> are going through a series of Comm_accept/Conn_connect/Intercomm_merge >> so that at the end, all the clients and the server are inside the same >> intracomm. >> >> Steps: >> >> I have a sample program that show the issue. I tried to make it as >> short as possible. It needs to be executed on a shared file system >> like NFS because the server write the port info to a file that the >> client will read. To reproduce the issue, the following steps should >> be performed: >> >> 0. compile the test with "mpicc -o ben12 ben12.c" >> 1. ssh to the machine that will be the server >> 2. run ./ben12 3 1 >> 3. ssh to the machine that will be the client #1 >> 4. run ./ben12 3 0 >> 5. repeat step 3-4 for client #2 and #3 >> >> the server accept the connection from client #1 and merge it in a new >> intracomm. It then accept connection from client #2 and merge it. when >> the client #3 arrives, the server accept the connection, but that >> cause client #1 and #2 to die with the error above (see the complete >> trace in the tarball). >> >> The exact steps are: >> >> - server open port >> - server does accept >> - client #1 does connect >> - server and client #1 do merge >> - server does accept >> - client #2 does connect >> - server, client #1 and client #2 do merge >> - server does accept >> - client #3 does connect >> - server, client #1, client #2 and client #3 do merge >> >> >> My infiniband network works normally with other test programs or >> applications (MPI or others like Verbs). >> >> Info about my setup: >> >> openMPI version = 1.4.1 (I also tried 1.4.2, nightly snapshot of >> 1.4.3, nightly snapshot of 1.5 --- all show the same error) >> config.log in the tarball >> "ompi_info --all" in the tarball >> OFED version = 1.3 installed from RHEL 5.3 >> Distro = RedHat Entreprise Linux 5.3 >> Kernel = 2.6.18-128.4.1.el5 x86_64 >> subnet manager = built-in SM from the cisco/topspin switch >> output of ibv_devinfo included in the tarball (there are no "bad" nodes) >> "ulimit -l" says "unlimited" >> >> The tarball contains: >> >> - ben12.c: my test program showing the behavior >> - config.log / config.out / make.out / make-install.out / >> ifconfig.txt / ibv-devinfo.txt / ompi_info.txt >> - trace-tcp.txt: output of the server and each client when it works >> with TCP (I added "btl = tcp,self" in ~/.openmpi/mca-params.conf) >> - trace-ib.txt: output of the server and each client when it fails >> with IB (I added "btl = openib,self" in ~/.openmpi/mca-params.conf) >> >> I hope I provided enough info for somebody to reproduce the problem... >> <ompi-output.tar.bz2>_______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >