On Feb 22, 2012, at 12:24 PM, Eugene Loh wrote:

> On 2/22/2012 11:08 AM, Ralph Castain wrote:
>> On Feb 22, 2012, at 11:59 AM, Brice Goglin wrote:
>>> Le 22/02/2012 17:48, Ralph Castain a écrit :
>>>> On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote
>>>>> On 2/21/2012 10:31 PM, Eugene Loh wrote:
>>>>>> ...  "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI 
>>>>>> pukes on divide by zero.  OS info was listed in the original message 
>>>>>> (below).  Might we want to do something else?  E.g., assume 
>>>>>> num_sockets==1 when num_sockets==0 (if you know what I mean)?  So, which 
>>>>>> one (or more) of the following should be fixed?
>>>>>> 
>>>>>> *) on this platform, hwloc finds no socket level
>>>>>> *) therefore hwloc returns num_sockets==0 to OMPI
>>>>>> *) OMPI divides by 0 and barfs on basically everything
>>>>> Okay.  So, Brice's other e-mail indicates that the first two are "not 
>>>>> really uncommon":
>>>>> 
>>>>> On 2/22/2012 7:55 AM, Brice Goglin wrote:
>>>>>> Anyway, we have seen other systems (mostly non-Linux) where lstopo
>>>>>> reports nothing interesting (only one machine object with multiple PU
>>>>>> children). So numsockets==0 isn't really uncommon.
>>>>> So, it seems to me that OMPI needs to handle the num_sockets==0 case 
>>>>> rather than just dividing by num_sockets.  This is v1.5 
>>>>> orte_odls_base_open() since r25914.
>>>> Unfortunately, just artificially setting the num_sockets to 1 won't solve 
>>>> much - you'll get past that point in the code, but attempts to bind are 
>>>> likely to fail down the road. Fixing it will require some significant 
>>>> effort.
>>>> 
>>>> Given we haven't heard reports of this before, I'm not convinced it is a 
>>>> widespread problem.
> I assume we don't see the problem as widespread because it was only 
> introduced into  v1.5 in r25914.  In my mind, the real question is how common 
> it is for hwloc to decide numsockets==0.  On that one, Brice asserts it 
> "isn't really uncommon."
>>>> For now, let's just use the mca param and see what happens.
>>> I am probably missing something but: Why would setting num_sockets to 1
>>> work fine as a mca param, while artificially setting it as said above
>>> wouldn't ?
>> Because the param means that it isn't hardwired into the code base. I want 
>> to first verify that artificially forcing num_sockets to 1 doesn't break the 
>> code down the road, so the less change to find out, the better.
> That sounds a lot different to me than the earlier statement.  Thanks for 
> asking that question, Brice.  Anyhow, I tried using "--mca orte_num_sockets 
> 1" and that seems to allow basic programs to run.

That doesn't really address the issue, though. What I want to know is: what 
happens when you try to bind processes? What about -bind-to-socket, and 
-persocket options? Etc.

Reason I'm concerned: I'm not sure what happens if the socket layer isn't 
present. The logic in 1.5 is pretty old, but I believe it relies heavily on 
sockets being present.

> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to