Much simpler solution - on that platform, you should add "orte_num_sockets=1" 
to your default mca param file. Problem solved. It's why that param exists, and 
we added it specifically at Terry's request for an earlier, similar problem.


On Feb 22, 2012, at 8:55 AM, Brice Goglin wrote:

> Le 22/02/2012 07:36, Eugene Loh a écrit :
>> On 2/21/2012 5:40 PM, Paul H. Hargrove wrote:
>>> Here are the first of the results of the testing I promised.
>>> I am not 100% sure how to reach the code that Eugene reported as
>>> problematic,
>> I don't think you're going to see it.  Somehow, hwloc on the config in
>> question thinks there is no socket level and returns num_sockets==0. 
>> If you can run something successfully, your platform won't show the
>> issue.
> 
> (Eugene sent hwloc info offlist)
> 
> This is an "interesting" case. Last time I used a RHEL4 2.6.9 kernel, it
> had no sysfs topology info, but there was some "physical package" info
> in /proc/cpuinfo. Yours has nothing. Maybe because it's an AMD and/or
> single-core-processor based system. sysfs still has NUMA topology info
> (this was added to the kernel around 2.5 iirc) so we get 2 NUMA nodes
> with one core each but no socket at all. We could assume there one
> socket per NUMA node but that's a risky hack.
> 
> Anyway, we have seen other systems (mostly non-Linux) where lstopo
> reports nothing interesting (only one machine object with multiple PU
> children). So numsockets==0 isn't really uncommon. Replacing 0 with 1
> will likely work for your computations. Make sure the code isn't going
> to use the first hwloc socket object later, it would get NULL obviously.
> 
> Brice
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to