You might also want to check the BIOS rev level on node14, Gus - as Brice 
suggested, it could be that the board came with the wrong firmware.

On Feb 28, 2014, at 11:55 AM, Gus Correa <g...@ldeo.columbia.edu> wrote:

> Hi Brice and Ralph
> 
> Many thanks for helping out with this!
> 
> Yes, you are right about node15 being OK.
> Node15 was a red herring, as along with node14 it was part of
> the same job that failed.
> However, after a closer look, I noticed that failure reported
> by hwloc was indeed in node14.
> 
> I attach both diagnostic files generated by hwloc-gather-topology on
> node14.
> 
> I will open the node and see if there is anything unusual with the
> hardware, and perhaps reinstall the OS, as Ralph suggested.
> It is awkward that the other node that had the motherboard replaced
> passes the hwloc-gather-topology test.
> After motherboard replacement I renistalled the OS on both,
> but it doesn't hurt to do it again.
> 
> Gus Correa
> 
> 
> 
> 
> On 02/28/2014 03:26 AM, Brice Goglin wrote:
>> Hello Gus,
>> I'll need the tarball generated by gather-topology on node14 to debug
>> this. node15 doesn't have any issue.
>> We've seen issues on AMD machines because of buggy BIOS reporting
>> incompatible Socket and NUMA info. If node14 doesn't have the same BIOS
>> version as other nodes, that could explain things.
>> Brice
>> 
>> 
>> 
>> 
>> Le 28/02/2014 01:39, Gus Correa a écrit :
>>> Thank you, Ralph!
>>> 
>>> I did a bit more of homework, and found out that all jobs that had
>>> the hwloc error involved one specific node (node14).
>>> 
>>> The "report bindings" output in those jobs' stderr show
>>> that node14 systematically failed to bind the processes to the cores,
>>> while other nodes on the same jobs didn't fail.
>>> Interestingly, the jobs continued to run, although they
>>> eventually failed, but much later.
>>> So, the hwloc error doesn't seem to stop the job on its tracks.
>>> As a matter of policy, should it perhaps shutdown the job instead?
>>> 
>>> In addition, when I try the hwloc-gather-topology diagnostic on node14
>>> I get the same error, a bit more verbose (see below).
>>> So, now my guess is that this may be a hardware problem on that node.
>>> 
>>> I replaced two nodes' motherboards last week, including node14's,
>>> and something may have gone wrong on that one.
>>> The other node that had the motherboard replaced
>>> doesn't show the hwloc-gather-topology error, though.
>>> 
>>> Does the error message below (Socket P#0 ...)
>>> suggest anything that I should be looking for on the hardware side?
>>> (Thermal compound on the heatsink, memory modules, etc)
>>> 
>>> Thank you,
>>> Gus Correa
>>> 
>>> 
>>> 
>>> [root@node14 ~]# /usr/bin/hwloc-gather-topology /tmp/$(uname -n)
>>> Hierarchy gathered in /tmp/node14.tar.bz2 and kept in
>>> /tmp/tmp.D46Sdhcnru/node14/
>>> ****************************************************************************
>>> 
>>> * Hwloc has encountered what looks like an error from the operating
>>> system.
>>> *
>>> * object (Socket P#0 cpuset 0x0000ffff) intersection without inclusion!
>>> * Error occurred in topology.c line 718
>>> *
>>> * Please report this error message to the hwloc user's mailing list,
>>> * along with the output from the hwloc-gather-topology.sh script.
>>> ****************************************************************************
>>> 
>>> Expected topology output stored in /tmp/node14.output
>>> 
>>> 
>>> On 02/27/2014 06:39 PM, Ralph Castain wrote:
>>>> The hwloc in 1.6.5 is very old (v1.3.2), so it's possible it is having
>>> trouble with those data/instruction cache breakdowns.
>>> I don't know why it wouldn't have shown up before,
>>> however, as this looks to be happening when we first try to
>>> assemble the topology. To check that, what happens if you just run
>>> "mpiexec hostname" on the local node?
>>>> 
>>>> 
>>>> On Feb 27, 2014, at 3:04 PM, Gus Correa<g...@ldeo.columbia.edu>   wrote:
>>>> 
>>>>> Dear OMPI pros
>>>>> 
>>>>> This seems to be a question in the nowhere land between OMPI and hwloc.
>>>>> However, it appeared as an OMPI error, hence it may be OK to ask the
>>>>> question in this list.
>>>>> 
>>>>> ***
>>>>> 
>>>>> A user here got this error (or warning?) message today:
>>>>> 
>>>>> + mpiexec -np 64 $HOME/echam-aiv_ldeo_6.1.00p1/bin/echam6
>>>>> ****************************************************************************
>>>>> 
>>>>> * Hwloc has encountered what looks like an error from the operating
>>>>> system.
>>>>> *
>>>>> * object intersection without inclusion!
>>>>> * Error occurred in topology.c line 594
>>>>> *
>>>>> * Please report this error message to the hwloc user's mailing list,
>>>>> * along with the output from the hwloc-gather-topology.sh script.
>>>>> ****************************************************************************
>>>>> 
>>>>> 
>>>>> Additional info:
>>>>> 
>>>>> 1) We have OMPI 1.6.5. This user is using the one built
>>>>> with Intel compilers 2011.13.367.
>>>>> 
>>>>> 2) I set these MCA parameters in $OMPI/etc/openmpi-mca-params.conf
>>>>> (includes binding to core):
>>>>> 
>>>>> btl = ^tcp
>>>>> orte_tag_output = 1
>>>>> rmaps_base_schedule_policy = core
>>>>> orte_process_binding = core
>>>>> orte_report_bindings = 1
>>>>> opal_paffinity_alone = 1
>>>>> 
>>>>> 
>>>>> 3) The machines have dual-socket 16-core AMD Opteron 6376 (Abu-Dhabi),
>>>>> which have one FPU for each pair of cores, a hierarchy of caches
>>>>> serving
>>>>> sub-groups of cores, etc.
>>>>> The OS is  Linux CentOS 6.4 with stock CentOS OFED.
>>>>> Interconnect is Infiniband QDR (Mellanox HW).
>>>>> 
>>>>> 4) We have Torque 4.2.5, built with cpuset support.
>>>>> OMPI is built with Torque (tm) support.
>>>>> 
>>>>> 5) In case it helps, I attach the output of
>>>>> hwloc-gather-topology, which I ran on the node that threw the error,
>>>>> although not immediately after the job failure.
>>>>> I used the hwloc-gather-topology script that comes with
>>>>> the hwloc (version 1.5) provided by CentOS.
>>>>> As far as I can tell the hwloc nuts and bits built into OMPI
>>>>> do not include the hwloc-gather-topology script (although it may be
>>>>> a newer hwloc version. 1.8 perhaps?).
>>>>> Hopefully the mail servers won't chop off the attachments.
>>>>> 
>>>>> 6) I am a bit surprised by this error message, because I haven't
>>>>> seen it before, although we have used OMPI 1.6.5 in
>>>>> this machine with several other programs without problems.
>>>>> Alas, it happened now.
>>>>> 
>>>>> **
>>>>> 
>>>>> - Is this a known hwloc problem in this processor architecture?
>>>>> 
>>>>> - Is this a known issue in this combination of HW and SW?
>>>>> 
>>>>> - Would not binding the MPI processes (to core or socket), perhaps
>>>>> help?
>>>>> 
>>>>> - Any workarounds or suggestions?
>>>>> 
>>>>> **
>>>>> 
>>>>> Thank you,
>>>>> Gus Correa
>>>>> <node15.output><node15.tar.bz2>_______________________________________________
>>>>> 
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> <node14.output><node14.tar.bz2>_______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to