On 02/22/12 14:54, Ralph Castain wrote:
That doesn't really address the issue, though. What I want to know is:
what happens when you try to bind processes? What about
-bind-to-socket, and -persocket options? Etc. Reason I'm concerned:
I'm not sure what happens if the socket layer isn't present. The logic
in 1.5 is pretty old, but I believe it relies heavily on sockets being
present.
Okay. So,
*) "out of the box", basically nothing works. For example, "mpirun
hostname" segfaults.
*) With "--mca orte_num_sockets 1", stuff appears to work.
*) With "--mca orte_num_sockets 1" and adding either "--bysocket
--bind-to-socket" or "--npersocket <n>", I get:
--------------------------------------------------------------------------
Unable to bind to socket -13 on node burl-ct-v20z-10.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun was unable to start the specified application as it encountered
an error:
Error name: Fatal
Node: burl-ct-v20z-10
when attempting to start process rank 0.
--------------------------------------------------------------------------
2 total processes failed to start
So, I hear Brice's comment that this is an old kernel. And, I hear what
you're saying about a "real" fix being expensive. Nevertheless, to my
taste, automatically setting num_sockets==1 when num_sockets==0 is
detected makes a lot of sense. It makes things "basically" work,
turning a situation where everything including "mpirun hostname"
segfaults into a situation where default usage works just fine. What
remains broken is binding, which generates an error message that gives
the user a hope of making progress (turning off binding). That's in
contrast from expecting users to go from
% mpirun hostname
Segmentation fault
to knowing that they should set orte_num_sockets==1.