Thanks! On Sep 24, 2011, at 2:18 PM, Brice Goglin wrote:
> I fixed one parsing bug in commit 3660 on the v1.2-ompi branch. Things > should work better now. > > Parsing XML distance matrices was broken when the XML file came from the > no-libxml exporter. That's why you had problems on your dual-amd machine > (those have distance matrices) and not on your mac (single processor, no > distances, no bug). > > The v1.2 branch doesn't report parsing failure well, so it just crashed. > Trunk exits with an error instead of crashing. > > Brice > > > > > Le 24/09/2011 20:37, Ralph Castain a écrit : >> Yep, it fails. Runs on my Mac, but not under Linux. >> >> Program terminated with signal 11, Segmentation fault. >> #0 0x00002aaaaacdbedd in hwloc_bitmap_snprintf () from >> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 >> (gdb) where >> #0 0x00002aaaaacdbedd in hwloc_bitmap_snprintf () from >> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 >> #1 0x00002aaaaacdc060 in hwloc_bitmap_asprintf () from >> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 >> #2 0x00002aaaaacd9b34 in hwloc__xml_export_object () from >> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 >> #3 0x00002aaaaacda325 in hwloc___nolibxml_prepare_export () from >> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 >> #4 0x00002aaaaacda39c in hwloc__nolibxml_prepare_export () from >> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 >> #5 0x00002aaaaacda4be in hwloc_topology_export_xmlbuffer () from >> /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 >> #6 0x00000000004009b8 in main () at xmlbuffer.c:31 >> >> On Sep 24, 2011, at 9:45 AM, Brice Goglin wrote: >> >>> Indeed, this object contains invalid pointers. >>> >>> Can you try to run tests/xmlbuffer.c from hwloc's tree? It does >>> export+import+export+compare on the same machine. It would be good to >>> know if it fails on one of the machines you're using here. >>> >>> https://svn.open-mpi.org/trac/hwloc/browser/branches/v1.2-ompi/tests/xmlbuffer.c?rev=3837&format=txt >>> >>> thanks >>> Brice >>> >>> >>> >>> Le 24/09/2011 17:07, Ralph Castain a écrit : >>>> FWIW: I tried just printing out the contents of that root object >>>> immediately after importing the xml, and it clearly has a problem: >>>> >>>> (gdb) print *obj >>>> $2 = {type = OPAL_HWLOC122_hwloc_OBJ_SYSTEM, os_index = 0, name = 0x101 >>>> <Address 0x101 out of bounds>, memory = { >>>> total_memory = 46912502995240, local_memory = 46912502995240, >>>> page_types_len = 0, page_types = 0x0}, attr = 0x2, >>>> depth = 6900112, logical_index = 0, os_level = 6571424, next_cousin = 0x0, >>>> prev_cousin = 0xffffffff, parent = 0x0, >>>> sibling_rank = 0, next_sibling = 0x0, prev_sibling = 0x0, arity = 145, >>>> children = 0x2aaaab139738, >>>> first_child = 0x2aaaab139738, last_child = 0x0, userdata = 0x0, cpuset = >>>> 0x0, complete_cpuset = 0x0, >>>> online_cpuset = 0x644700, allowed_cpuset = 0x691970, nodeset = 0x6919e0, >>>> complete_nodeset = 0x644c90, >>>> allowed_nodeset = 0x644cb0, distances = 0x6948b0, distances_count = >>>> 6900000, infos = 0x0, infos_count = 0} >>>> >>>> >>>> On Sep 24, 2011, at 9:02 AM, Ralph Castain wrote: >>>> >>>>> Here's the trace: >>>>> >>>>> #0 0x00002aaaaae61737 in hwloc__xml_export_object >>>>> (output=0x7fffffffd890, topology=0x695f10, obj=0x2aaaab139b28) >>>>> at topology-xml.c:1094 >>>>> #1 0x00002aaaaae61b69 in hwloc___nolibxml_prepare_export >>>>> (topology=0x695f10, >>>>> xmlbuffer=0x698a70 "<?xml version=\"1.0\" >>>>> encoding=\"UTF-8\"?>\n<!DOCTYPE topology SYSTEM >>>>> \"hwloc.dtd\">\n<topology>\n <object type=\"Unknown\" >>>>> os_level=\"-1424778408\" os_index=\"10922\" cpuset=\"0xf...f\" >>>>> complete_cpuset=\"0xf...f\" onl"..., >>>>> buflen=16384) at topology-xml.c:1193 >>>>> #2 0x00002aaaaae61be0 in hwloc__nolibxml_prepare_export >>>>> (topology=0x695f10, bufferp=0x7fffffffd988, buflenp=0x7fffffffd97c) >>>>> at topology-xml.c:1207 >>>>> #3 0x00002aaaaae61d02 in opal_hwloc122_hwloc_topology_export_xmlbuffer >>>>> (topology=0x695f10, xmlbuffer=0x7fffffffd988, >>>>> buflen=0x7fffffffd97c) at topology-xml.c:1281 >>>>> #4 0x00002aaaaae529f4 in opal_hwloc_compare (topo1=0x695f10, >>>>> topo2=0x6915c0, type=22 '\026') at base/hwloc_base_dt.c:183 >>>>> #5 0x00002aaaaadf348c in opal_dss_compare (value1=0x695f10, >>>>> value2=0x6915c0, type=22 '\026') at dss/dss_compare.c:39 >>>>> #6 0x00002aaaaad9b5f7 in process_orted_launch_report (fd=-1, event=1, >>>>> data=0x6444d0) at base/plm_base_launch_support.c:564 >>>>> #7 0x00002aaaaae3881f in event_process_active_single_queue >>>>> (base=0x60dd60, activeq=0x6111e0) at event.c:1329 >>>>> #8 0x00002aaaaae38c71 in event_process_active (base=0x60dd60) at >>>>> event.c:1396 >>>>> #9 0x00002aaaaae3902b in opal_libevent2012_event_base_loop >>>>> (base=0x60dd60, flags=1) at event.c:1598 >>>>> #10 0x00002aaaaadf080d in opal_progress () at runtime/opal_progress.c:189 >>>>> #11 0x00002aaaaad9bbfa in orte_plm_base_daemon_callback (num_daemons=2) >>>>> at base/plm_base_launch_support.c:666 >>>>> #12 0x00002aaaaada49e1 in plm_slurm_launch_job (jdata=0x67a500) at >>>>> plm_slurm_module.c:404 >>>>> #13 0x0000000000403822 in orterun (argc=4, argv=0x7fffffffe1d8) at >>>>> orterun.c:817 >>>>> #14 0x0000000000402aa3 in main (argc=4, argv=0x7fffffffe1d8) at main.c:13 >>>>> >>>>> And the error report >>>>> >>>>> Program received signal SIGSEGV, Segmentation fault. >>>>> 0x00002aaaaae61737 in hwloc__xml_export_object (output=0x7fffffffd890, >>>>> topology=0x695f10, obj=0x2aaaab139b28) >>>>> at topology-xml.c:1094 >>>>> 1094 sprintf(tmp, "%llu", (unsigned long long) >>>>> obj->memory.page_types[i].count); >>>>> (gdb) print obj >>>>> $1 = (opal_hwloc122_hwloc_obj_t) 0x2aaaab139b28 >>>>> (gdb) print *obj >>>>> $2 = {type = 2870188824, os_index = 10922, name = 0x2aaaab139b18 >>>>> "\b\233\023\253\252*", memory = {total_memory = 6579376, >>>>> local_memory = 6579376, page_types_len = 2870188856, page_types = >>>>> 0x2aaaab139b38}, attr = 0x2aaaab139b48, >>>>> depth = 2870188872, logical_index = 10922, os_level = -1424778408, >>>>> next_cousin = 0x2aaaab139b58, >>>>> prev_cousin = 0x2aaaab139b68, parent = 0x2aaaab139b68, sibling_rank = >>>>> 2870188920, next_sibling = 0x2aaaab139b78, >>>>> prev_sibling = 0x2aaaab139b88, arity = 2870188936, children = >>>>> 0x2aaaab139b98, first_child = 0x2aaaab139b98, >>>>> last_child = 0x2aaaab139ba8, userdata = 0x2aaaab139ba8, cpuset = >>>>> 0x2aaaab139bb8, complete_cpuset = 0x2aaaab139bb8, >>>>> online_cpuset = 0x2aaaab139bc8, allowed_cpuset = 0x2aaaab139bc8, nodeset >>>>> = 0x2aaaab139bd8, >>>>> complete_nodeset = 0x2aaaab139bd8, allowed_nodeset = 0x2aaaab139be8, >>>>> distances = 0x2aaaab139be8, >>>>> distances_count = 2870189048, infos = 0x2aaaab139bf8, infos_count = >>>>> 2870189064} >>>>> (gdb) print obj->memory >>>>> $3 = {total_memory = 6579376, local_memory = 6579376, page_types_len = >>>>> 2870188856, page_types = 0x2aaaab139b38} >>>>> (gdb) print obj->memory.page_types >>>>> $4 = (struct opal_hwloc122_hwloc_obj_memory_page_type_s *) 0x2aaaab139b38 >>>>> (gdb) print i >>>>> $5 = 1612 >>>>> (gdb) print obj->memory.page_types[1600] >>>>> $6 = {size = 0, count = 0} >>>>> (gdb) print obj->memory.page_types[1612] >>>>> Cannot access memory at address 0x2aaaab13fff8 >>>>> (gdb) print obj->memory.page_types[1611] >>>>> $7 = {size = 0, count = 0} >>>>> (gdb) >>>>> >>>>> >>>>> The whole obj looks like trash to me. I looked a little more - the object >>>>> referenced is the root object: >>>>> >>>>> 1193 hwloc__xml_export_object (&output, topology, >>>>> hwloc_get_root_obj(topology)); >>>>> >>>>> I'm continuing to look in case I'm doing something stupid, but the code >>>>> is pretty linear here - unpack, import, export for compare. >>>>> >>>>> >>>>> On Sep 24, 2011, at 8:59 AM, Jeff Squyres wrote: >>>>> >>>>>> Here's some feedback from Ralph -- any idea what's going wrong here? >>>>>> >>>>>> ----- >>>>>> >>>>>> 1. I export a topology into xml using >>>>>> >>>>>> hwloc_topology_export_xmlbuffer(t, &xmlbuffer, &len); >>>>>> >>>>>> I then pack and send the string. >>>>>> >>>>>> 2. I unpack the string on the other end and import it into a topology >>>>>> hwloc_topology_init(&t); >>>>>> if (0 != (rc = hwloc_topology_set_xmlbuffer(t, xmlbuffer, >>>>>> strlen(xmlbuffer)))) { >>>>>> hwloc_topology_destroy(t); >>>>>> goto cleanup; >>>>>> } >>>>>> hwloc_topology_load(t); >>>>>> >>>>>> 3. I then need to compare two topologies, so I export the topology I >>>>>> received into another xml string >>>>>> hwloc_topology_export_xmlbuffer(t1, &x1, &l1); >>>>>> >>>>>> It is this export that fails, which implies to me that somehow the >>>>>> import didn't work right. Note that this code worked fine with libxml2, >>>>>> so this is a regression. >>>>>> >>>>>> >>>>>> On Sep 22, 2011, at 9:39 AM, Jeff Squyres wrote: >>>>>> >>>>>>> Yes, I can get some testing of the ompi branch pretty quickly. I can >>>>>>> bring in a new copy of this later today and see what we can see. >>>>>>> >>>>>>> Many thanks! >>>>>>> >>>>>>> >>>>>>> On Sep 19, 2011, at 9:05 AM, Brice Goglin wrote: >>>>>>> >>>>>>>> I pushed the new minimalistic XML import/export implementation without >>>>>>>> libxml2 to the nolibxml branch. If libxml2 is available, it's still >>>>>>>> used >>>>>>>> by default. --disable-libxml2 or some env variables can be used for >>>>>>>> force the minimalistic implementation if needed. The minimalistic >>>>>>>> implem >>>>>>>> is only guaranteed to import XML files that were generated by hwloc >>>>>>>> (even if libxml was enabled there). >>>>>>>> >>>>>>>> I also backported most of this to the new v1.2-ompi branch (required to >>>>>>>> backport some other XML cleanups from trunk). This branch will now >>>>>>>> serve >>>>>>>> as a base for Open MPI's embedded hwloc. The idea is to have a complete >>>>>>>> v1.2 + nolibxml somewhere so that we can at least run make check (Open >>>>>>>> MPI does not embed enough to run hwloc's make check). >>>>>>>> >>>>>>>> How do we proceed now? Can we have the OMPI guys test the new code >>>>>>>> soon? >>>>>>>> Should I wait for their feedback before merging the nolibxml branch >>>>>>>> into >>>>>>>> the trunk? I'd like to merge this in v1.3 too (and basically release >>>>>>>> rc2 >>>>>>>> as the actual first feature-complete RC), so getting feedback early >>>>>>>> might be appreciated. >>>>>>>> >>>>>>>> Brice >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> hwloc-devel mailing list >>>>>>>> [email protected] >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel >>>>>>> -- >>>>>>> Jeff Squyres >>>>>>> [email protected] >>>>>>> For corporate legal information go to: >>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> hwloc-devel mailing list >>>>>>> [email protected] >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel >>>>>> -- >>>>>> Jeff Squyres >>>>>> [email protected] >>>>>> For corporate legal information go to: >>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>> >>>> _______________________________________________ >>>> hwloc-devel mailing list >>>> [email protected] >>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel >
