[OMPI users] Trouble executing remote processes

Robert Collyer Thu, 25 Mar 2010 12:31:40 -0400

I'm running on a Fedora Core 9 Linux cluster with the mpi and homedirectories mounted on the compute nodes via NFS. Since the executablesare on a remote server, I have configured mpi with --disable-dlopen andhave even gone as far as far enabling static and disabling shared. Inthe process of trying to work around this problem, I upgraded fromopenmpi 1.3.3 to 1.4.1. Also, the binaries were compiled with gcc4.3.0, and the interconnect is ssh over ethernet.

Running from the fileserver, which is practically identical to thecompute nodes, I can run the c++ hello world (examples/hello_cxx.cc) onup to three machines including the fileserver, but only two if thefileserver is not included in the hostname list. In other words, eitherthis


   mpirun -H filesrv,node1,node2 cpphello

or

   mpirun -H node1,node2 cpphello

for any number of processes functions correctly. However, beyond thetwo remote node limit, the application just hangs. orted shows up onthe remote systems, but nothing happens. Additionally, if I attempt todo the same thing from any of the compute nodes, any attempt to run on aremote node just hangs like before. Incidentally, this behavior is notlimited to hello world, and it occurs with non-mpi programs, likehostname, also. Alternatively, when I run the c hello world(examples/hello_c.c), I get the same hanging behavior. But, I also getmca_btl_tcp_endpoint_complete_connect "no route to host" errors, eventhough the processes appear to complete successfully. Although, I needto kill (via ctrl-c) the overall mpi process.As a further note, when testing this I also ran the boost::mpi tests,and I noticed that the all_gather_test process would eventually startremotely, but would peg the processor and never return. I have notnoticed this occur with the hello world programs.

Since they run better from the fileserver, I suspect it has something todo with the NFS mount. But, I have no idea how to test that, or what todo about it.


Any help would be greatly appreciated.

Rob

[OMPI users] Trouble executing remote processes

Reply via email to