On Feb 22, 2012, at 11:59 AM, Brice Goglin wrote:

> Le 22/02/2012 17:48, Ralph Castain a écrit :
>> On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote:
>> 
>>> On 2/21/2012 10:31 PM, Eugene Loh wrote:
>>>> ...  "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI 
>>>> pukes on divide by zero.  OS info was listed in the original message 
>>>> (below).  Might we want to do something else?  E.g., assume num_sockets==1 
>>>> when num_sockets==0 (if you know what I mean)?  So, which one (or more) of 
>>>> the following should be fixed?
>>>> 
>>>> *) on this platform, hwloc finds no socket level
>>>> *) therefore hwloc returns num_sockets==0 to OMPI
>>>> *) OMPI divides by 0 and barfs on basically everything
>>> Okay.  So, Brice's other e-mail indicates that the first two are "not 
>>> really uncommon":
>>> 
>>> On 2/22/2012 7:55 AM, Brice Goglin wrote:
>>>> Anyway, we have seen other systems (mostly non-Linux) where lstopo
>>>> reports nothing interesting (only one machine object with multiple PU
>>>> children). So numsockets==0 isn't really uncommon.
>>> So, it seems to me that OMPI needs to handle the num_sockets==0 case rather 
>>> than just dividing by num_sockets.  This is v1.5 orte_odls_base_open() 
>>> since r25914.
>> Unfortunately, just artificially setting the num_sockets to 1 won't solve 
>> much - you'll get past that point in the code, but attempts to bind are 
>> likely to fail down the road. Fixing it will require some significant effort.
>> 
>> Given we haven't heard reports of this before, I'm not convinced it is a 
>> widespread problem. For now, let's just use the mca param and see what 
>> happens.
> 
> I am probably missing something but: Why would setting num_sockets to 1
> work fine as a mca param, while artificially setting it as said above
> wouldn't ?

Because the param means that it isn't hardwired into the code base. I want to 
first verify that artificially forcing num_sockets to 1 doesn't break the code 
down the road, so the less change to find out, the better.


> 
> Brice
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to