Le 22/02/2012 20:24, Eugene Loh a écrit : > On 2/22/2012 11:08 AM, Ralph Castain wrote: >> On Feb 22, 2012, at 11:59 AM, Brice Goglin wrote: >>> Le 22/02/2012 17:48, Ralph Castain a écrit : >>>> On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote >>>>> On 2/21/2012 10:31 PM, Eugene Loh wrote: >>>>>> ... "sockets" is unknown and hwloc returns 0 for num_sockets and >>>>>> OMPI pukes on divide by zero. OS info was listed in the original >>>>>> message (below). Might we want to do something else? E.g., >>>>>> assume num_sockets==1 when num_sockets==0 (if you know what I >>>>>> mean)? So, which one (or more) of the following should be fixed? >>>>>> >>>>>> *) on this platform, hwloc finds no socket level >>>>>> *) therefore hwloc returns num_sockets==0 to OMPI >>>>>> *) OMPI divides by 0 and barfs on basically everything >>>>> Okay. So, Brice's other e-mail indicates that the first two are >>>>> "not really uncommon": >>>>> >>>>> On 2/22/2012 7:55 AM, Brice Goglin wrote: >>>>>> Anyway, we have seen other systems (mostly non-Linux) where lstopo >>>>>> reports nothing interesting (only one machine object with >>>>>> multiple PU >>>>>> children). So numsockets==0 isn't really uncommon. >>>>> So, it seems to me that OMPI needs to handle the num_sockets==0 >>>>> case rather than just dividing by num_sockets. This is v1.5 >>>>> orte_odls_base_open() since r25914. >>>> Unfortunately, just artificially setting the num_sockets to 1 won't >>>> solve much - you'll get past that point in the code, but attempts >>>> to bind are likely to fail down the road. Fixing it will require >>>> some significant effort. >>>> >>>> Given we haven't heard reports of this before, I'm not convinced it >>>> is a widespread problem. > I assume we don't see the problem as widespread because it was only > introduced into v1.5 in r25914. In my mind, the real question is how > common it is for hwloc to decide numsockets==0. On that one, Brice > asserts it "isn't really uncommon."
On Linux, it's uncommon: it only happens on some platforms with very old kernels (2.6.10 or so). Solaris, Darwin and Windows should get sockets in some/most cases. FreeBSD should get x86 sockets correctly because we use cpuid directly there. Unless I am missing something, others have nothing related to sockets in their driver: AIX, HPUX, OSF. Brice
