You might also want to check the BIOS rev level on node14, Gus - as Brice suggested, it could be that the board came with the wrong firmware.
On Feb 28, 2014, at 11:55 AM, Gus Correa <g...@ldeo.columbia.edu> wrote: > Hi Brice and Ralph > > Many thanks for helping out with this! > > Yes, you are right about node15 being OK. > Node15 was a red herring, as along with node14 it was part of > the same job that failed. > However, after a closer look, I noticed that failure reported > by hwloc was indeed in node14. > > I attach both diagnostic files generated by hwloc-gather-topology on > node14. > > I will open the node and see if there is anything unusual with the > hardware, and perhaps reinstall the OS, as Ralph suggested. > It is awkward that the other node that had the motherboard replaced > passes the hwloc-gather-topology test. > After motherboard replacement I renistalled the OS on both, > but it doesn't hurt to do it again. > > Gus Correa > > > > > On 02/28/2014 03:26 AM, Brice Goglin wrote: >> Hello Gus, >> I'll need the tarball generated by gather-topology on node14 to debug >> this. node15 doesn't have any issue. >> We've seen issues on AMD machines because of buggy BIOS reporting >> incompatible Socket and NUMA info. If node14 doesn't have the same BIOS >> version as other nodes, that could explain things. >> Brice >> >> >> >> >> Le 28/02/2014 01:39, Gus Correa a écrit : >>> Thank you, Ralph! >>> >>> I did a bit more of homework, and found out that all jobs that had >>> the hwloc error involved one specific node (node14). >>> >>> The "report bindings" output in those jobs' stderr show >>> that node14 systematically failed to bind the processes to the cores, >>> while other nodes on the same jobs didn't fail. >>> Interestingly, the jobs continued to run, although they >>> eventually failed, but much later. >>> So, the hwloc error doesn't seem to stop the job on its tracks. >>> As a matter of policy, should it perhaps shutdown the job instead? >>> >>> In addition, when I try the hwloc-gather-topology diagnostic on node14 >>> I get the same error, a bit more verbose (see below). >>> So, now my guess is that this may be a hardware problem on that node. >>> >>> I replaced two nodes' motherboards last week, including node14's, >>> and something may have gone wrong on that one. >>> The other node that had the motherboard replaced >>> doesn't show the hwloc-gather-topology error, though. >>> >>> Does the error message below (Socket P#0 ...) >>> suggest anything that I should be looking for on the hardware side? >>> (Thermal compound on the heatsink, memory modules, etc) >>> >>> Thank you, >>> Gus Correa >>> >>> >>> >>> [root@node14 ~]# /usr/bin/hwloc-gather-topology /tmp/$(uname -n) >>> Hierarchy gathered in /tmp/node14.tar.bz2 and kept in >>> /tmp/tmp.D46Sdhcnru/node14/ >>> **************************************************************************** >>> >>> * Hwloc has encountered what looks like an error from the operating >>> system. >>> * >>> * object (Socket P#0 cpuset 0x0000ffff) intersection without inclusion! >>> * Error occurred in topology.c line 718 >>> * >>> * Please report this error message to the hwloc user's mailing list, >>> * along with the output from the hwloc-gather-topology.sh script. >>> **************************************************************************** >>> >>> Expected topology output stored in /tmp/node14.output >>> >>> >>> On 02/27/2014 06:39 PM, Ralph Castain wrote: >>>> The hwloc in 1.6.5 is very old (v1.3.2), so it's possible it is having >>> trouble with those data/instruction cache breakdowns. >>> I don't know why it wouldn't have shown up before, >>> however, as this looks to be happening when we first try to >>> assemble the topology. To check that, what happens if you just run >>> "mpiexec hostname" on the local node? >>>> >>>> >>>> On Feb 27, 2014, at 3:04 PM, Gus Correa<g...@ldeo.columbia.edu> wrote: >>>> >>>>> Dear OMPI pros >>>>> >>>>> This seems to be a question in the nowhere land between OMPI and hwloc. >>>>> However, it appeared as an OMPI error, hence it may be OK to ask the >>>>> question in this list. >>>>> >>>>> *** >>>>> >>>>> A user here got this error (or warning?) message today: >>>>> >>>>> + mpiexec -np 64 $HOME/echam-aiv_ldeo_6.1.00p1/bin/echam6 >>>>> **************************************************************************** >>>>> >>>>> * Hwloc has encountered what looks like an error from the operating >>>>> system. >>>>> * >>>>> * object intersection without inclusion! >>>>> * Error occurred in topology.c line 594 >>>>> * >>>>> * Please report this error message to the hwloc user's mailing list, >>>>> * along with the output from the hwloc-gather-topology.sh script. >>>>> **************************************************************************** >>>>> >>>>> >>>>> Additional info: >>>>> >>>>> 1) We have OMPI 1.6.5. This user is using the one built >>>>> with Intel compilers 2011.13.367. >>>>> >>>>> 2) I set these MCA parameters in $OMPI/etc/openmpi-mca-params.conf >>>>> (includes binding to core): >>>>> >>>>> btl = ^tcp >>>>> orte_tag_output = 1 >>>>> rmaps_base_schedule_policy = core >>>>> orte_process_binding = core >>>>> orte_report_bindings = 1 >>>>> opal_paffinity_alone = 1 >>>>> >>>>> >>>>> 3) The machines have dual-socket 16-core AMD Opteron 6376 (Abu-Dhabi), >>>>> which have one FPU for each pair of cores, a hierarchy of caches >>>>> serving >>>>> sub-groups of cores, etc. >>>>> The OS is Linux CentOS 6.4 with stock CentOS OFED. >>>>> Interconnect is Infiniband QDR (Mellanox HW). >>>>> >>>>> 4) We have Torque 4.2.5, built with cpuset support. >>>>> OMPI is built with Torque (tm) support. >>>>> >>>>> 5) In case it helps, I attach the output of >>>>> hwloc-gather-topology, which I ran on the node that threw the error, >>>>> although not immediately after the job failure. >>>>> I used the hwloc-gather-topology script that comes with >>>>> the hwloc (version 1.5) provided by CentOS. >>>>> As far as I can tell the hwloc nuts and bits built into OMPI >>>>> do not include the hwloc-gather-topology script (although it may be >>>>> a newer hwloc version. 1.8 perhaps?). >>>>> Hopefully the mail servers won't chop off the attachments. >>>>> >>>>> 6) I am a bit surprised by this error message, because I haven't >>>>> seen it before, although we have used OMPI 1.6.5 in >>>>> this machine with several other programs without problems. >>>>> Alas, it happened now. >>>>> >>>>> ** >>>>> >>>>> - Is this a known hwloc problem in this processor architecture? >>>>> >>>>> - Is this a known issue in this combination of HW and SW? >>>>> >>>>> - Would not binding the MPI processes (to core or socket), perhaps >>>>> help? >>>>> >>>>> - Any workarounds or suggestions? >>>>> >>>>> ** >>>>> >>>>> Thank you, >>>>> Gus Correa >>>>> <node15.output><node15.tar.bz2>_______________________________________________ >>>>> >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > <node14.output><node14.tar.bz2>_______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users