Re: [fm-discuss] errlog growing out of control; PCIe errors on NHM/IOH mobo

Erwin Tsaur Wed, 08 Jul 2009 16:39:30 -0700

Chris Worley wrote:

On Wed, Jul 8, 2009 at 5:02 PM, Erwin Tsaur<[email protected]> wrote:

I didn't realize that the Root Port was seeing the same thing.. :(


Add the same line to pcie_pci.conf


I did, and rebooted:

# tail /kernel/drv/pcie_pci.conf
# Force load driver to support hotplug activity
#
ddi-forceattach=1;

#
# force interrupt priorities to be one
# otherwise this driver binds as bridge device with priority 12
#
interrupt-priorities=1;
pcie_ce_mask=0x1040;

... still errors being logged (see attached).

geesh.. yet another type of CE. Change the mask from 0x1040 to 0x1041.If you get tired of this, change the mask to -1. :)

With this you shouldn't see any more ereports from these devices, unlessthere was a UE from that link.

Good news is that the leaf device is no longer spamming with CEs.


Yes, I've made a lot of progress so far on this, thanks!

You can also limit which RP's CE's get turned off.  see "driver.conf" man
page.  This will mask 0x1040 on all the RPs.

I have to warn again, though they are technically CEs and no damage was
done, there are probably performance impacts.  Masking the CE's won't
correct it but will save your harddrive and also significantly improve
performance since the OS won't be interrupted hundreds of times a second.
 Unfortunately don't know of any fix, unless there is a vendor specific
method to fix the underlying HW issue.


Intel is usually pretty good about fixing issues like this (unless
it's caused by Supermicro's layout).

Intel is pretty good about this. These are low level physical layererrors, so it could very well be a layout issue.

I'm not to worried about the NIC's performance, as long as it works...
I do need to measure system performance in other respects, so
decreasing the shower of interrupts (CPU overhead) is important.

Chris

Chris Worley wrote:

On Wed, Jul 8, 2009 at 4:29 PM, Erwin Tsaur<[email protected]> wrote:

Chris Worley wrote:

On Wed, Jul 8, 2009 at 3:42 PM, Erwin Tsaur<[email protected]> wrote:

ok I severely underestimated how much 1000 lines are, but I got the
info
I
needed, mostly.  Now I'm wondering if there is an errata in the Root
Port
causing this.

/p...@0,0/pci8086,3...@1/pci15d9,10c9, I know it's a nic device.  Both
notes
are complaining of CE errors.

The best way is to disable reporting those 2 CEs is to add the
following
line in the driver's .conf file.

This is the igb driver (SUNWigb package).  It doesn't have a conf file.

It's for the Intel 82576 Dual-Port GigE on-board NIC.  This driver
didn't work prior to the update.

So, I made a .conf file thusly and rebooted:

/usr/kernel/drv# cat >igb.conf
pcie_ce_mask=0x1040;

... no difference.  Still lots of errors reported (attached last 1000
lines).

It's not picking up the conf property...
According to the pkgdef, I think the correct place is
/kernel/drv/igb.conf

It should be in the same place as the igb driver.

Okay, it was there, and I changed it and rebooted:

r...@opensolaris:~# tail /kernel/drv/igb.conf
# For example, if you see,
#       "/p...@0,0/pci10de,5...@d/pci8086,0...@0" 0 "igb"
#       "/p...@0,0/pci10de,5...@d/pci8086,0...@0,1" 1 "igb"
#
# name = "pciex8086,10a7" parent = "/p...@0,0/pci10de,5...@d" unit-address =
"0"
# flow_control = 1;
# name = "pciex8086,10a7" parent = "/p...@0,0/pci10de,5...@d" unit-address =
"0,1"
# flow_control = 3;
pcie_ce_mask=0x1040;

Still, no joy... the last 1K lines attached.

Thanks,

Chris

Chris

pcie_ce_mask=0x1040;

It requires reboot.

If you need to do it on a live system let me know, the instructions are
a
bit more complicated.

Chris Worley wrote:

On Wed, Jul 8, 2009 at 2:57 PM, Erwin Tsaur<[email protected]>
wrote:

Chris Worley wrote:

On Wed, Jul 8, 2009 at 2:42 PM, Erwin Tsaur<[email protected]>
wrote:

Chris Worley wrote:

On Wed, Jul 8, 2009 at 1:10 PM, Erwin Tsaur<[email protected]>
wrote:

Chris Worley wrote:

(Sorry for the misleading "Subject" in the initial post.  would
like
to know a more appropriate place to post, since fm is just the
messenger here.)

More to add: fmadm faulty may be saying something about a bad
PCIe
slot or device (is there an "lspci" in OpenSolaris?):

# fmadm faulty
--------------- ------------------------------------
 --------------
---------
TIME            EVENT-ID                              MSG-ID
SEVERITY
--------------- ------------------------------------
 --------------
---------
Jul 07 07:55:42 016cf20c-d572-42c1-f217-9eb8d439b73c
 PCIEX-8000-KP
 Major

Fault class : fault.io.pciex.device-interr-corr max 29%
        fault.io.pciex.bus-linkerr-corr max 14%
Affects     : dev:////p...@0,0/pci8086,3...@1/pci15d9,1...@0
        dev:////p...@0,0/pci8086,3...@1/pci15d9,1...@0,1
        dev:////p...@0,0/pci8086,3...@1
            faulted but still in service
FRU         : "MB"






(hc://:product-id=X8DTH-i-6-iF-6F:chassis-id=1234567890:server-id=opensolaris/motherboard=0)
            faulty

Description : Too many recovered bus errors have been detected,
which
indicates
        a problem with the specified bus or with the specified
        transmitting device. This may degrade into an
unrecoverable
        fault.
        Refer to http://sun.com/msg/PCIEX-8000-KP for more
information.

Response    : One or more device instances may be disabled

Impact      : Loss of services provided by the device instances
associated
with
        this fault

Action      : If a plug-in card is involved check for
badly-seated
cards
or
        bent pins. Otherwise schedule a repair procedure to
replace
the
        affected device.  Use fmadm faulty to identify the
device
or
        contact Sun for support.

How bad is this error?  I need to put some adapters in, but it
sounds
like the OS doesn't handle the NHM's IOH (or is it really
detaining
a
HW issue?).

OS does handle these issues and unfortunately it is a HW issue.
 This
is
likely to eventually cause your system to panic or fill up your
hard
drive.
 Assuming you are seeing a lot of btlp and rto errors..  If
anything
these
errors are performance killer.  Not only is the RTO/BTLP error
telling
you
that many packets require retransmit, the OS also has to
constantly
go
out
and scan and clean up the fabric.

This system is triple boot: RHEL5.3, W2008S, and OpenSolaris.

The errors in OpenSolaris occur if no cards are installed in the
bus.

The other OSes don't report any errors w/ or w/o cards in the bus.

This doesn't happen when there are no cards installed, since the
error
is
literally complaining about a packets received between 2 devices.
 Are
you
sure it's you are correctly identifying the right slot?

I believe only OpenSolaris even detects these errors, which is why
the
other
OSes don't report any errors.  It doesn't mean that errors aren't
occurring
though.

It would also be nice to throttle the errlog so it doesn't fill
the
disk an hour after boot.  Is this possible?

no throttling possible, but you could turn it off, though highly
not
recommended, it's better to fix the issue.  It really could just
be
a
badly
seated card.

How do I disable the errors?

We need to figure out exactly what your error is first, please
provide
the
"fmdump -eV" log.  If it is huge, just tail the last 500-1000 lines
should
be enough.

Would that produce the same as the incantation shown earlier?:

I think all the CE's would produce the same message.  I also need to
know
the exact device.

Last 1000 lines (of ~20 million) attached.

There are some boards in the bus at this time, but the same error
occurs w/o them, and their drivers are not yet loaded.  Everything
else is built-in to the motherboard.

Thanks,

Chris

<snip>


_______________________________________________
fm-discuss mailing list
[email protected]

Re: [fm-discuss] errlog growing out of control; PCIe errors on NHM/IOH mobo

Reply via email to