That's what we needed to know - i.e., that setting num_sockets=1 generates an error instead of segfaulting down the road. I can submit a CMR to do so.
thx! On Feb 22, 2012, at 4:12 PM, Eugene Loh wrote: > On 02/22/12 14:54, Ralph Castain wrote: >> That doesn't really address the issue, though. What I want to know is: what >> happens when you try to bind processes? What about -bind-to-socket, and >> -persocket options? Etc. Reason I'm concerned: I'm not sure what happens if >> the socket layer isn't present. The logic in 1.5 is pretty old, but I >> believe it relies heavily on sockets being present. > Okay. So, > > *) "out of the box", basically nothing works. For example, "mpirun > hostname" segfaults. > > *) With "--mca orte_num_sockets 1", stuff appears to work. > > *) With "--mca orte_num_sockets 1" and adding either "--bysocket > --bind-to-socket" or "--npersocket <n>", I get: > > -------------------------------------------------------------------------- > Unable to bind to socket -13 on node burl-ct-v20z-10. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun was unable to start the specified application as it encountered an > error: > > Error name: Fatal > Node: burl-ct-v20z-10 > > when attempting to start process rank 0. > -------------------------------------------------------------------------- > 2 total processes failed to start > > So, I hear Brice's comment that this is an old kernel. And, I hear what > you're saying about a "real" fix being expensive. Nevertheless, to my taste, > automatically setting num_sockets==1 when num_sockets==0 is detected makes a > lot of sense. It makes things "basically" work, turning a situation where > everything including "mpirun hostname" segfaults into a situation where > default usage works just fine. What remains broken is binding, which > generates an error message that gives the user a hope of making progress > (turning off binding). That's in contrast from expecting users to go from > > % mpirun hostname > Segmentation fault > > to knowing that they should set orte_num_sockets==1. > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel