I'm running into a hang that is very easy to reproduce. Basically,
something like this:
% mpirun -H remote_node hostname
remote_node
^C
That is, I run a program (doesn't need to be MPI) on a remote node. The
program runs, but my local orterun doesn't return. The problem seems to
be correlated to the OS version (some very recent builds of Solaris)
running on the remote node.
The problem would seem to be in the OS, though arguably it could be a
long-time OMPI problem that is being exposed by a change in the OS.
Regardless, does anyone have suggestions where I should be looking?
So far, it looks to me that the HNP orterun forks a child who launches
an ssh process to start the remote orted. Then, the remote orted
daemonizes itself (forks a child and kills the parent, thereby detaching
the daemon from the controlling terminal) and runs the user binary. It
seems to me that this daemonization is related to the problem.
Specifically, if I use "mpirun --debug-daemons", there is no
daemonization and the hang does not occur. Perhaps, with some recent OS
changes, the daemonized process is no longer alerting the HNP orterun
when it's done.
Any suggestions where I should focus my efforts? I'm working with v1.5.