I'm running on a Fedora Core 9 Linux cluster with the mpi and home directories mounted on the compute nodes via NFS. Since the executables are on a remote server, I have configured mpi with --disable-dlopen and have even gone as far as far enabling static and disabling shared. In the process of trying to work around this problem, I upgraded from openmpi 1.3.3 to 1.4.1. Also, the binaries were compiled with gcc 4.3.0, and the interconnect is ssh over ethernet.

Running from the fileserver, which is practically identical to the compute nodes, I can run the c++ hello world (examples/hello_cxx.cc) on up to three machines including the fileserver, but only two if the fileserver is not included in the hostname list. In other words, either this

   mpirun -H filesrv,node1,node2 cpphello

or

   mpirun -H node1,node2 cpphello

for any number of processes functions correctly. However, beyond the two remote node limit, the application just hangs. orted shows up on the remote systems, but nothing happens. Additionally, if I attempt to do the same thing from any of the compute nodes, any attempt to run on a remote node just hangs like before. Incidentally, this behavior is not limited to hello world, and it occurs with non-mpi programs, like hostname, also. Alternatively, when I run the c hello world (examples/hello_c.c), I get the same hanging behavior. But, I also get mca_btl_tcp_endpoint_complete_connect "no route to host" errors, even though the processes appear to complete successfully. Although, I need to kill (via ctrl-c) the overall mpi process. As a further note, when testing this I also ran the boost::mpi tests, and I noticed that the all_gather_test process would eventually start remotely, but would peg the processor and never return. I have not noticed this occur with the hello world programs.

Since they run better from the fileserver, I suspect it has something to do with the NFS mount. But, I have no idea how to test that, or what to do about it.

Any help would be greatly appreciated.

Rob

Reply via email to