On Feb 22, 2012, at 11:59 AM, Brice Goglin wrote: > Le 22/02/2012 17:48, Ralph Castain a écrit : >> On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote: >> >>> On 2/21/2012 10:31 PM, Eugene Loh wrote: >>>> ... "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI >>>> pukes on divide by zero. OS info was listed in the original message >>>> (below). Might we want to do something else? E.g., assume num_sockets==1 >>>> when num_sockets==0 (if you know what I mean)? So, which one (or more) of >>>> the following should be fixed? >>>> >>>> *) on this platform, hwloc finds no socket level >>>> *) therefore hwloc returns num_sockets==0 to OMPI >>>> *) OMPI divides by 0 and barfs on basically everything >>> Okay. So, Brice's other e-mail indicates that the first two are "not >>> really uncommon": >>> >>> On 2/22/2012 7:55 AM, Brice Goglin wrote: >>>> Anyway, we have seen other systems (mostly non-Linux) where lstopo >>>> reports nothing interesting (only one machine object with multiple PU >>>> children). So numsockets==0 isn't really uncommon. >>> So, it seems to me that OMPI needs to handle the num_sockets==0 case rather >>> than just dividing by num_sockets. This is v1.5 orte_odls_base_open() >>> since r25914. >> Unfortunately, just artificially setting the num_sockets to 1 won't solve >> much - you'll get past that point in the code, but attempts to bind are >> likely to fail down the road. Fixing it will require some significant effort. >> >> Given we haven't heard reports of this before, I'm not convinced it is a >> widespread problem. For now, let's just use the mca param and see what >> happens. > > I am probably missing something but: Why would setting num_sockets to 1 > work fine as a mca param, while artificially setting it as said above > wouldn't ?
Because the param means that it isn't hardwired into the code base. I want to first verify that artificially forcing num_sockets to 1 doesn't break the code down the road, so the less change to find out, the better. > > Brice > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel