That's what we needed to know - i.e., that setting num_sockets=1 generates an 
error instead of segfaulting down the road. I can submit a CMR to do so.

thx!

On Feb 22, 2012, at 4:12 PM, Eugene Loh wrote:

> On 02/22/12 14:54, Ralph Castain wrote:
>> That doesn't really address the issue, though. What I want to know is: what 
>> happens when you try to bind processes? What about -bind-to-socket, and 
>> -persocket options? Etc. Reason I'm concerned: I'm not sure what happens if 
>> the socket layer isn't present. The logic in 1.5 is pretty old, but I 
>> believe it relies heavily on sockets being present.
> Okay.  So,
> 
> *)  "out of the box", basically nothing works.  For example, "mpirun 
> hostname" segfaults.
> 
> *)  With "--mca orte_num_sockets 1", stuff appears to work.
> 
> *)  With "--mca orte_num_sockets 1" and adding either "--bysocket 
> --bind-to-socket" or "--npersocket <n>", I get:
> 
> --------------------------------------------------------------------------
> Unable to bind to socket -13 on node burl-ct-v20z-10.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun was unable to start the specified application as it encountered an 
> error:
> 
> Error name: Fatal
> Node: burl-ct-v20z-10
> 
> when attempting to start process rank 0.
> --------------------------------------------------------------------------
> 2 total processes failed to start
> 
> So, I hear Brice's comment that this is an old kernel.  And, I hear what 
> you're saying about a "real" fix being expensive.  Nevertheless, to my taste, 
> automatically setting num_sockets==1 when num_sockets==0 is detected makes a 
> lot of sense.  It makes things "basically" work, turning a situation where 
> everything including "mpirun hostname" segfaults into a situation where 
> default usage works just fine.  What remains broken is binding, which 
> generates an error message that gives the user a hope of making progress 
> (turning off binding).  That's in contrast from expecting users to go from
> 
> % mpirun hostname
> Segmentation fault
> 
> to knowing that they should set orte_num_sockets==1.
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to