On 02/22/12 14:54, Ralph Castain wrote:
That doesn't really address the issue, though. What I want to know is: what happens when you try to bind processes? What about -bind-to-socket, and -persocket options? Etc. Reason I'm concerned: I'm not sure what happens if the socket layer isn't present. The logic in 1.5 is pretty old, but I believe it relies heavily on sockets being present.
Okay.  So,

*) "out of the box", basically nothing works. For example, "mpirun hostname" segfaults.

*)  With "--mca orte_num_sockets 1", stuff appears to work.

*) With "--mca orte_num_sockets 1" and adding either "--bysocket --bind-to-socket" or "--npersocket <n>", I get:

--------------------------------------------------------------------------
Unable to bind to socket -13 on node burl-ct-v20z-10.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun was unable to start the specified application as it encountered an error:

Error name: Fatal
Node: burl-ct-v20z-10

when attempting to start process rank 0.
--------------------------------------------------------------------------
2 total processes failed to start

So, I hear Brice's comment that this is an old kernel. And, I hear what you're saying about a "real" fix being expensive. Nevertheless, to my taste, automatically setting num_sockets==1 when num_sockets==0 is detected makes a lot of sense. It makes things "basically" work, turning a situation where everything including "mpirun hostname" segfaults into a situation where default usage works just fine. What remains broken is binding, which generates an error message that gives the user a hope of making progress (turning off binding). That's in contrast from expecting users to go from

% mpirun hostname
Segmentation fault

to knowing that they should set orte_num_sockets==1.

Reply via email to