Hello When you assemble multiple nodes' topologies into a single one, the resulting topology cannot be used for binding. Binding is only possible when using objects/cpusets that correspond to the current node. The assembled topology contains many objects that can't be used for binding: objects that contain multiple nodes, and objects that don't come from the node where the current process is running.
Open-MPI does not support these cases, hence the crash. I see that individual XMLs worked fine. So why did you try this? By the way, the ability to assemble multiple topologies will be removed in 2.0. Brice Le 30/10/2015 02:13, Andrej Prsa a écrit : > Hi all, > > I have a 6-node cluster with the buggy L3 H8QG6 AMD boards. Brice > Goglin recently provided a fix to Fabian Wein and I applied the same > fix (by diffing Fabian's original and Brice's fixed XML and then > incorporating the equivalent changes to our XML). It did the trick > perfectly, using openmpi-1.10.0 and hwloc 1.11.1. I then proceeded to > produce a patched XML for each node in our cluster. > > The problem arises when I try to combine the XMLs. To test the assembly > of just two XMLs, I ran: > > hwloc-assembler combo.xml \ > --name clusty clusty_fixed.xml \ > --name node1 node1_fixed.xml > > I then set HWLOC_XMLFILE to combo.xml and, when trying to mpirun a test > program on either of the two nodes, I get a segfault: > > andrej@clusty:~/MPI$ mpirun -np 44 python testmpi.py > [clusty:19136] *** Process received signal *** > [clusty:19136] Signal: Segmentation fault (11) > [clusty:19136] Signal code: Address not mapped (1) > [clusty:19136] Failing at address: (nil) > [clusty:19136] > [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x7fdf37f38340] > [clusty:19136] > [ 1] /usr/local/hwloc/lib/libhwloc.so.5(hwloc_bitmap_and+0x17)[0x7fdf37934e77] > [clusty:19136] > [ 2] > /opt/openmpi-1.10.0/lib/libopen-pal.so.13(opal_hwloc_base_filter_cpus+0x37c)[0x7fdf381b239c] > [clusty:19136] > [ 3] > /opt/openmpi-1.10.0/lib/libopen-pal.so.13(opal_hwloc_base_get_topology+0xcb)[0x7fdf381b412b] > [clusty:19136] > [ 4] /opt/openmpi-1.10.0/lib/openmpi/mca_ess_hnp.so(+0x47ea)[0x7fdf35c1c7ea] > [clusty:19136] > [ 5] > /opt/openmpi-1.10.0/lib/libopen-rte.so.12(orte_init+0x168)[0x7fdf384062b8] > [clusty:19136] [ 6] mpirun[0x404497] [clusty:19136] [ 7] > mpirun[0x40363d] [clusty:19136] > [ 8] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fdf37b81ec5] > [clusty:19136] [ 9] mpirun[0x403559] [clusty:19136] *** End of error > message *** Segmentation fault (core dumped) > > Each individual XML file works (i.e. no hwloc complaints and mpirun > works perfectly), but the assembled one does not. I'm attaching all > three XMLs: clusty.xml, node1.xml and combo.xml. Any ideas? > > Thanks, > Andrej > > > _______________________________________________ > hwloc-users mailing list > hwloc-us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users > Link to this post: > http://www.open-mpi.org/community/lists/hwloc-users/2015/10/1214.php