Somehow Chris' mail didn't make it back to the list (perhaps it got rejected if he's not subscribed).
Begin forwarded message: > From: Christopher Yeoh <cy...@au1.ibm.com> > Date: November 3, 2011 2:59:34 AM EDT > To: Jeff Squyres <jsquy...@cisco.com> > Cc: Hardware locality development list <hwloc-de...@open-mpi.org>, Brad > Benton <brad.ben...@us.ibm.com> > Subject: Re: [hwloc-devel] hwloc problem > > Hi Jeff, > > The patch fixes the crash for me. Thanks Brice! > > Regards, > > Chris > > On Wed, 2 Nov 2011 10:23:32 -0400 > Jeff Squyres <jsquy...@cisco.com> wrote: > >> Chris -- >> >> Can you verify the attached patch? If so, I'll commit it to the SVN >> trunk and the pending OMPI v1.5 patch. >> >> >> On Nov 2, 2011, at 10:05 AM, Brice Goglin wrote: >> >>> If we can't find any other way, filtering (during export) would be >>> an easy solution. >>> >>> For the v1.2 branch, the attached patch seems to help. It just >>> prevents the creation of internal matrices with invalid relative >>> depth. No internal matrix, means no XML export, which means you >>> don't break your import. >>> >>> Brice >>> >>> >>> >>> >>> Le 02/11/2011 14:59, Jeff Squyres a écrit : >>>> Should we just filter out the "distance" attribute in the XML on >>>> the v1.2ompi branch? We're not using it (yet) in OMPI. >>>> >>>> On Nov 2, 2011, at 9:32 AM, Brice Goglin wrote: >>>> >>>>> Hello, >>>>> >>>>> The v1.2 branch has known problems with distance matrices when >>>>> the topology is asymmetric (especially when Linux cpuset make >>>>> some NUMA nodes CPU-less). This is what causes wrong >>>>> relative_depth here. It can even be negative is some cases which >>>>> is obviously wrong. >>>>> >>>>> This should be fixed in v1.3 but it's NOT easy to backport in >>>>> v1.2. Can you check that you can export and reimport with v1.3 >>>>> properly? I will see if I can find a workaround for v1.2, but it >>>>> will likely be something like ignore distance matrices if >>>>> reldepth is <= 0. >>>>> >>>>> In the meantime, you can remove "&& reldepth" from the "if" line >>>>> below. It may help. >>>>> >>>>> Brice >>>>> >>>>> >>>>> >>>>> Le 02/11/2011 13:42, Jeff Squyres (jsquyres) a écrit : >>>>>>>> Hi Jeff, >>>>>>>> >>>>>>>> Brad mentioned you might be able to help me with an OMPI hwloc >>>>>>>> issue I'm having. >>>>>>>> >>>>>>>> Its occurring on a Power 5 RHEL 6.0 machine and related to the >>>>>>>> xml representation of the topology. I've attached the xml to >>>>>>>> this email. The problem only occurs on the trunk code. >>>>>>>> >>>>>>>> The part which appears to be the problem is this: >>>>>>>> >>>>>>>> <distances nbobjs="4" relative_depth="0" >>>>>>>> latency_base="10.000000"> <latency value="1.000000"/> >>>>>>>> <latency value="1.000000"/> >>>>>>>> <latency value="1.000000"/> >>>>>>>> <latency value="1.000000"/> >>>>>>>> <latency value="1.000000"/> >>>>>>>> <latency value="1.000000"/> >>>>>>>> <latency value="1.000000"/> >>>>>>>> <latency value="1.000000"/> >>>>>>>> <latency value="1.000000"/> >>>>>>>> <latency value="1.000000"/> >>>>>>>> <latency value="1.000000"/> >>>>>>>> <latency value="1.000000"/> >>>>>>>> <latency value="1.000000"/> >>>>>>>> <latency value="1.000000"/> >>>>>>>> <latency value="1.000000"/> >>>>>>>> <latency value="1.000000"/> >>>>>>>> </distances> >>>>>>>> >>>>>>>> specifically with relative_depth having a value of 0, but >>>>>>>> still having latency children information. In >>>>>>>> hwloc__xml_import_distances in topology-xml.c there's a check >>>>>>>> that assumes there is no latency information. >>>>>>>> >>>>>>>> Around line 634 in topology-xml.c: >>>>>>>> >>>>>>>> if (nbobjs && reldepth && latbase) { >>>>>>>> ... process latency xml nodes >>>>>>>> } >>>>>>>> >>>>>>>> return hwloc__xml_import_close_tag(state); >>>>>>>> >>>>>>>> The hwloc__xml_import_close_tag function returns a failure >>>>>>>> because the latency nodes have not been processed yet. >>>>>>>> >>>>>>>> I had a look in orted where the xml is created and it does >>>>>>>> look like the xml is being assembled correctly as per the >>>>>>>> topology information it has retrieved (though I don't know if >>>>>>>> that itself is correct). The hwloc__xml_export_object function >>>>>>>> will quite happily create distance information if the relative >>>>>>>> depth is 0 even though hwloc__xml_import_distance will not be >>>>>>>> able to parse it. >>>>>>>> >>>>>>>> So there is at least a problem that the topology code will >>>>>>>> create xml that it can't parse, but I don't know enough about >>>>>>>> the hwloc library to know if relative depth should always be >>>>>>>> positive. I suspect its the former which is the problem not >>>>>>>> the latter, but I don't know for sure... >>>>>>>> >>>>>>>> If it helps, this is the output of lstopo on the machine: >>>>>>>> >>>>>>>> cyeoh@p5-40-P4-E0:~$ /home/OpenHPC/hwloc/build/bin/lstopo >>>>>>>> Machine (2048MB) >>>>>>>> NUMANode L#0 (P#0 512MB) >>>>>>>> Socket L#0 + L1 L#0 (32KB) + Core L#0 >>>>>>>> PU L#0 (P#0) >>>>>>>> PU L#1 (P#1) >>>>>>>> Socket L#1 + L1 L#1 (32KB) + Core L#1 >>>>>>>> PU L#2 (P#2) >>>>>>>> PU L#3 (P#3) >>>>>>>> NUMANode L#1 (P#1 640MB) >>>>>>>> NUMANode L#2 (P#2 512MB) >>>>>>>> NUMANode L#3 (P#3 384MB) >>>>> _______________________________________________ >>>>> hwloc-devel mailing list >>>>> hwloc-de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel >>>> >>> >>> <ignore_invalid_reldepth.patch>_______________________________________________ >>> hwloc-devel mailing list >>> hwloc-de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel >> >> > > > > -- > cy...@au.ibm.com -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/