I'm running into a hang that is very easy to reproduce. Basically, something like this:

    % mpirun -H remote_node hostname
    remote_node
    ^C

That is, I run a program (doesn't need to be MPI) on a remote node. The program runs, but my local orterun doesn't return. The problem seems to be correlated to the OS version (some very recent builds of Solaris) running on the remote node.

The problem would seem to be in the OS, though arguably it could be a long-time OMPI problem that is being exposed by a change in the OS. Regardless, does anyone have suggestions where I should be looking?

So far, it looks to me that the HNP orterun forks a child who launches an ssh process to start the remote orted. Then, the remote orted daemonizes itself (forks a child and kills the parent, thereby detaching the daemon from the controlling terminal) and runs the user binary. It seems to me that this daemonization is related to the problem. Specifically, if I use "mpirun --debug-daemons", there is no daemonization and the hang does not occur. Perhaps, with some recent OS changes, the daemonized process is no longer alerting the HNP orterun when it's done.

Any suggestions where I should focus my efforts?  I'm working with v1.5.

Reply via email to