On Feb 22, 2012, at 12:24 PM, Eugene Loh wrote: > On 2/22/2012 11:08 AM, Ralph Castain wrote: >> On Feb 22, 2012, at 11:59 AM, Brice Goglin wrote: >>> Le 22/02/2012 17:48, Ralph Castain a écrit : >>>> On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote >>>>> On 2/21/2012 10:31 PM, Eugene Loh wrote: >>>>>> ... "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI >>>>>> pukes on divide by zero. OS info was listed in the original message >>>>>> (below). Might we want to do something else? E.g., assume >>>>>> num_sockets==1 when num_sockets==0 (if you know what I mean)? So, which >>>>>> one (or more) of the following should be fixed? >>>>>> >>>>>> *) on this platform, hwloc finds no socket level >>>>>> *) therefore hwloc returns num_sockets==0 to OMPI >>>>>> *) OMPI divides by 0 and barfs on basically everything >>>>> Okay. So, Brice's other e-mail indicates that the first two are "not >>>>> really uncommon": >>>>> >>>>> On 2/22/2012 7:55 AM, Brice Goglin wrote: >>>>>> Anyway, we have seen other systems (mostly non-Linux) where lstopo >>>>>> reports nothing interesting (only one machine object with multiple PU >>>>>> children). So numsockets==0 isn't really uncommon. >>>>> So, it seems to me that OMPI needs to handle the num_sockets==0 case >>>>> rather than just dividing by num_sockets. This is v1.5 >>>>> orte_odls_base_open() since r25914. >>>> Unfortunately, just artificially setting the num_sockets to 1 won't solve >>>> much - you'll get past that point in the code, but attempts to bind are >>>> likely to fail down the road. Fixing it will require some significant >>>> effort. >>>> >>>> Given we haven't heard reports of this before, I'm not convinced it is a >>>> widespread problem. > I assume we don't see the problem as widespread because it was only > introduced into v1.5 in r25914. In my mind, the real question is how common > it is for hwloc to decide numsockets==0. On that one, Brice asserts it > "isn't really uncommon." >>>> For now, let's just use the mca param and see what happens. >>> I am probably missing something but: Why would setting num_sockets to 1 >>> work fine as a mca param, while artificially setting it as said above >>> wouldn't ? >> Because the param means that it isn't hardwired into the code base. I want >> to first verify that artificially forcing num_sockets to 1 doesn't break the >> code down the road, so the less change to find out, the better. > That sounds a lot different to me than the earlier statement. Thanks for > asking that question, Brice. Anyhow, I tried using "--mca orte_num_sockets > 1" and that seems to allow basic programs to run.
That doesn't really address the issue, though. What I want to know is: what happens when you try to bind processes? What about -bind-to-socket, and -persocket options? Etc. Reason I'm concerned: I'm not sure what happens if the socket layer isn't present. The logic in 1.5 is pretty old, but I believe it relies heavily on sockets being present. > _______________________________________________ > devel mailing list > [email protected] > http://www.open-mpi.org/mailman/listinfo.cgi/devel
