Much simpler solution - on that platform, you should add "orte_num_sockets=1" to your default mca param file. Problem solved. It's why that param exists, and we added it specifically at Terry's request for an earlier, similar problem.
On Feb 22, 2012, at 8:55 AM, Brice Goglin wrote: > Le 22/02/2012 07:36, Eugene Loh a écrit : >> On 2/21/2012 5:40 PM, Paul H. Hargrove wrote: >>> Here are the first of the results of the testing I promised. >>> I am not 100% sure how to reach the code that Eugene reported as >>> problematic, >> I don't think you're going to see it. Somehow, hwloc on the config in >> question thinks there is no socket level and returns num_sockets==0. >> If you can run something successfully, your platform won't show the >> issue. > > (Eugene sent hwloc info offlist) > > This is an "interesting" case. Last time I used a RHEL4 2.6.9 kernel, it > had no sysfs topology info, but there was some "physical package" info > in /proc/cpuinfo. Yours has nothing. Maybe because it's an AMD and/or > single-core-processor based system. sysfs still has NUMA topology info > (this was added to the kernel around 2.5 iirc) so we get 2 NUMA nodes > with one core each but no socket at all. We could assume there one > socket per NUMA node but that's a risky hack. > > Anyway, we have seen other systems (mostly non-Linux) where lstopo > reports nothing interesting (only one machine object with multiple PU > children). So numsockets==0 isn't really uncommon. Replacing 0 with 1 > will likely work for your computations. Make sure the code isn't going > to use the first hwloc socket object later, it would get NULL obviously. > > Brice > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel