Yep, it fails. Runs on my Mac, but not under Linux. Program terminated with signal 11, Segmentation fault. #0 0x00002aaaaacdbedd in hwloc_bitmap_snprintf () from /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 (gdb) where #0 0x00002aaaaacdbedd in hwloc_bitmap_snprintf () from /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 #1 0x00002aaaaacdc060 in hwloc_bitmap_asprintf () from /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 #2 0x00002aaaaacd9b34 in hwloc__xml_export_object () from /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 #3 0x00002aaaaacda325 in hwloc___nolibxml_prepare_export () from /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 #4 0x00002aaaaacda39c in hwloc__nolibxml_prepare_export () from /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 #5 0x00002aaaaacda4be in hwloc_topology_export_xmlbuffer () from /nfs/rinfs/san/homedirs/rhc/lib/libhwloc.so.3 #6 0x00000000004009b8 in main () at xmlbuffer.c:31
On Sep 24, 2011, at 9:45 AM, Brice Goglin wrote: > Indeed, this object contains invalid pointers. > > Can you try to run tests/xmlbuffer.c from hwloc's tree? It does > export+import+export+compare on the same machine. It would be good to > know if it fails on one of the machines you're using here. > > https://svn.open-mpi.org/trac/hwloc/browser/branches/v1.2-ompi/tests/xmlbuffer.c?rev=3837&format=txt > > thanks > Brice > > > > Le 24/09/2011 17:07, Ralph Castain a écrit : >> FWIW: I tried just printing out the contents of that root object immediately >> after importing the xml, and it clearly has a problem: >> >> (gdb) print *obj >> $2 = {type = OPAL_HWLOC122_hwloc_OBJ_SYSTEM, os_index = 0, name = 0x101 >> <Address 0x101 out of bounds>, memory = { >> total_memory = 46912502995240, local_memory = 46912502995240, >> page_types_len = 0, page_types = 0x0}, attr = 0x2, >> depth = 6900112, logical_index = 0, os_level = 6571424, next_cousin = 0x0, >> prev_cousin = 0xffffffff, parent = 0x0, >> sibling_rank = 0, next_sibling = 0x0, prev_sibling = 0x0, arity = 145, >> children = 0x2aaaab139738, >> first_child = 0x2aaaab139738, last_child = 0x0, userdata = 0x0, cpuset = >> 0x0, complete_cpuset = 0x0, >> online_cpuset = 0x644700, allowed_cpuset = 0x691970, nodeset = 0x6919e0, >> complete_nodeset = 0x644c90, >> allowed_nodeset = 0x644cb0, distances = 0x6948b0, distances_count = >> 6900000, infos = 0x0, infos_count = 0} >> >> >> On Sep 24, 2011, at 9:02 AM, Ralph Castain wrote: >> >>> Here's the trace: >>> >>> #0 0x00002aaaaae61737 in hwloc__xml_export_object (output=0x7fffffffd890, >>> topology=0x695f10, obj=0x2aaaab139b28) >>> at topology-xml.c:1094 >>> #1 0x00002aaaaae61b69 in hwloc___nolibxml_prepare_export >>> (topology=0x695f10, >>> xmlbuffer=0x698a70 "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE >>> topology SYSTEM \"hwloc.dtd\">\n<topology>\n <object type=\"Unknown\" >>> os_level=\"-1424778408\" os_index=\"10922\" cpuset=\"0xf...f\" >>> complete_cpuset=\"0xf...f\" onl"..., >>> buflen=16384) at topology-xml.c:1193 >>> #2 0x00002aaaaae61be0 in hwloc__nolibxml_prepare_export >>> (topology=0x695f10, bufferp=0x7fffffffd988, buflenp=0x7fffffffd97c) >>> at topology-xml.c:1207 >>> #3 0x00002aaaaae61d02 in opal_hwloc122_hwloc_topology_export_xmlbuffer >>> (topology=0x695f10, xmlbuffer=0x7fffffffd988, >>> buflen=0x7fffffffd97c) at topology-xml.c:1281 >>> #4 0x00002aaaaae529f4 in opal_hwloc_compare (topo1=0x695f10, >>> topo2=0x6915c0, type=22 '\026') at base/hwloc_base_dt.c:183 >>> #5 0x00002aaaaadf348c in opal_dss_compare (value1=0x695f10, >>> value2=0x6915c0, type=22 '\026') at dss/dss_compare.c:39 >>> #6 0x00002aaaaad9b5f7 in process_orted_launch_report (fd=-1, event=1, >>> data=0x6444d0) at base/plm_base_launch_support.c:564 >>> #7 0x00002aaaaae3881f in event_process_active_single_queue (base=0x60dd60, >>> activeq=0x6111e0) at event.c:1329 >>> #8 0x00002aaaaae38c71 in event_process_active (base=0x60dd60) at >>> event.c:1396 >>> #9 0x00002aaaaae3902b in opal_libevent2012_event_base_loop (base=0x60dd60, >>> flags=1) at event.c:1598 >>> #10 0x00002aaaaadf080d in opal_progress () at runtime/opal_progress.c:189 >>> #11 0x00002aaaaad9bbfa in orte_plm_base_daemon_callback (num_daemons=2) at >>> base/plm_base_launch_support.c:666 >>> #12 0x00002aaaaada49e1 in plm_slurm_launch_job (jdata=0x67a500) at >>> plm_slurm_module.c:404 >>> #13 0x0000000000403822 in orterun (argc=4, argv=0x7fffffffe1d8) at >>> orterun.c:817 >>> #14 0x0000000000402aa3 in main (argc=4, argv=0x7fffffffe1d8) at main.c:13 >>> >>> And the error report >>> >>> Program received signal SIGSEGV, Segmentation fault. >>> 0x00002aaaaae61737 in hwloc__xml_export_object (output=0x7fffffffd890, >>> topology=0x695f10, obj=0x2aaaab139b28) >>> at topology-xml.c:1094 >>> 1094 sprintf(tmp, "%llu", (unsigned long long) >>> obj->memory.page_types[i].count); >>> (gdb) print obj >>> $1 = (opal_hwloc122_hwloc_obj_t) 0x2aaaab139b28 >>> (gdb) print *obj >>> $2 = {type = 2870188824, os_index = 10922, name = 0x2aaaab139b18 >>> "\b\233\023\253\252*", memory = {total_memory = 6579376, >>> local_memory = 6579376, page_types_len = 2870188856, page_types = >>> 0x2aaaab139b38}, attr = 0x2aaaab139b48, >>> depth = 2870188872, logical_index = 10922, os_level = -1424778408, >>> next_cousin = 0x2aaaab139b58, >>> prev_cousin = 0x2aaaab139b68, parent = 0x2aaaab139b68, sibling_rank = >>> 2870188920, next_sibling = 0x2aaaab139b78, >>> prev_sibling = 0x2aaaab139b88, arity = 2870188936, children = >>> 0x2aaaab139b98, first_child = 0x2aaaab139b98, >>> last_child = 0x2aaaab139ba8, userdata = 0x2aaaab139ba8, cpuset = >>> 0x2aaaab139bb8, complete_cpuset = 0x2aaaab139bb8, >>> online_cpuset = 0x2aaaab139bc8, allowed_cpuset = 0x2aaaab139bc8, nodeset = >>> 0x2aaaab139bd8, >>> complete_nodeset = 0x2aaaab139bd8, allowed_nodeset = 0x2aaaab139be8, >>> distances = 0x2aaaab139be8, >>> distances_count = 2870189048, infos = 0x2aaaab139bf8, infos_count = >>> 2870189064} >>> (gdb) print obj->memory >>> $3 = {total_memory = 6579376, local_memory = 6579376, page_types_len = >>> 2870188856, page_types = 0x2aaaab139b38} >>> (gdb) print obj->memory.page_types >>> $4 = (struct opal_hwloc122_hwloc_obj_memory_page_type_s *) 0x2aaaab139b38 >>> (gdb) print i >>> $5 = 1612 >>> (gdb) print obj->memory.page_types[1600] >>> $6 = {size = 0, count = 0} >>> (gdb) print obj->memory.page_types[1612] >>> Cannot access memory at address 0x2aaaab13fff8 >>> (gdb) print obj->memory.page_types[1611] >>> $7 = {size = 0, count = 0} >>> (gdb) >>> >>> >>> The whole obj looks like trash to me. I looked a little more - the object >>> referenced is the root object: >>> >>> 1193 hwloc__xml_export_object (&output, topology, >>> hwloc_get_root_obj(topology)); >>> >>> I'm continuing to look in case I'm doing something stupid, but the code is >>> pretty linear here - unpack, import, export for compare. >>> >>> >>> On Sep 24, 2011, at 8:59 AM, Jeff Squyres wrote: >>> >>>> Here's some feedback from Ralph -- any idea what's going wrong here? >>>> >>>> ----- >>>> >>>> 1. I export a topology into xml using >>>> >>>> hwloc_topology_export_xmlbuffer(t, &xmlbuffer, &len); >>>> >>>> I then pack and send the string. >>>> >>>> 2. I unpack the string on the other end and import it into a topology >>>> hwloc_topology_init(&t); >>>> if (0 != (rc = hwloc_topology_set_xmlbuffer(t, xmlbuffer, >>>> strlen(xmlbuffer)))) { >>>> hwloc_topology_destroy(t); >>>> goto cleanup; >>>> } >>>> hwloc_topology_load(t); >>>> >>>> 3. I then need to compare two topologies, so I export the topology I >>>> received into another xml string >>>> hwloc_topology_export_xmlbuffer(t1, &x1, &l1); >>>> >>>> It is this export that fails, which implies to me that somehow the import >>>> didn't work right. Note that this code worked fine with libxml2, so this >>>> is a regression. >>>> >>>> >>>> On Sep 22, 2011, at 9:39 AM, Jeff Squyres wrote: >>>> >>>>> Yes, I can get some testing of the ompi branch pretty quickly. I can >>>>> bring in a new copy of this later today and see what we can see. >>>>> >>>>> Many thanks! >>>>> >>>>> >>>>> On Sep 19, 2011, at 9:05 AM, Brice Goglin wrote: >>>>> >>>>>> I pushed the new minimalistic XML import/export implementation without >>>>>> libxml2 to the nolibxml branch. If libxml2 is available, it's still used >>>>>> by default. --disable-libxml2 or some env variables can be used for >>>>>> force the minimalistic implementation if needed. The minimalistic implem >>>>>> is only guaranteed to import XML files that were generated by hwloc >>>>>> (even if libxml was enabled there). >>>>>> >>>>>> I also backported most of this to the new v1.2-ompi branch (required to >>>>>> backport some other XML cleanups from trunk). This branch will now serve >>>>>> as a base for Open MPI's embedded hwloc. The idea is to have a complete >>>>>> v1.2 + nolibxml somewhere so that we can at least run make check (Open >>>>>> MPI does not embed enough to run hwloc's make check). >>>>>> >>>>>> How do we proceed now? Can we have the OMPI guys test the new code soon? >>>>>> Should I wait for their feedback before merging the nolibxml branch into >>>>>> the trunk? I'd like to merge this in v1.3 too (and basically release rc2 >>>>>> as the actual first feature-complete RC), so getting feedback early >>>>>> might be appreciated. >>>>>> >>>>>> Brice >>>>>> >>>>>> _______________________________________________ >>>>>> hwloc-devel mailing list >>>>>> [email protected] >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel >>>>> >>>>> -- >>>>> Jeff Squyres >>>>> [email protected] >>>>> For corporate legal information go to: >>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>> >>>>> >>>>> _______________________________________________ >>>>> hwloc-devel mailing list >>>>> [email protected] >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel >>>> >>>> -- >>>> Jeff Squyres >>>> [email protected] >>>> For corporate legal information go to: >>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>> >> >> _______________________________________________ >> hwloc-devel mailing list >> [email protected] >> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel >
