Re: [fm-discuss] errlog growing out of control; PCIe errors on NHM/IOH mobo

Erwin Tsaur Wed, 08 Jul 2009 15:28:28 -0700

Chris Worley wrote:

On Wed, Jul 8, 2009 at 3:42 PM, Erwin Tsaur<[email protected]> wrote:

ok I severely underestimated how much 1000 lines are, but I got the info I
needed, mostly.  Now I'm wondering if there is an errata in the Root Port
causing this.


/p...@0,0/pci8086,3...@1/pci15d9,10c9, I know it's a nic device.  Both notes
are complaining of CE errors.

The best way is to disable reporting those 2 CEs is to add the following
line in the driver's .conf file.


This is the igb driver (SUNWigb package).  It doesn't have a conf file.

It's for the Intel 82576 Dual-Port GigE on-board NIC.  This driver
didn't work prior to the update.

So, I made a .conf file thusly and rebooted:

/usr/kernel/drv# cat >igb.conf
pcie_ce_mask=0x1040;

... no difference.  Still lots of errors reported (attached last 1000 lines).

It's not picking up the conf property...
According to the pkgdef, I think the correct place is /kernel/drv/igb.conf

It should be in the same place as the igb driver.

Chris

pcie_ce_mask=0x1040;

It requires reboot.

If you need to do it on a live system let me know, the instructions are a
bit more complicated.

Chris Worley wrote:

On Wed, Jul 8, 2009 at 2:57 PM, Erwin Tsaur<[email protected]> wrote:

Chris Worley wrote:

On Wed, Jul 8, 2009 at 2:42 PM, Erwin Tsaur<[email protected]> wrote:

Chris Worley wrote:

On Wed, Jul 8, 2009 at 1:10 PM, Erwin Tsaur<[email protected]>
wrote:

Chris Worley wrote:

(Sorry for the misleading "Subject" in the initial post.  would like
to know a more appropriate place to post, since fm is just the
messenger here.)

More to add: fmadm faulty may be saying something about a bad PCIe
slot or device (is there an "lspci" in OpenSolaris?):

# fmadm faulty
--------------- ------------------------------------  --------------
---------
TIME            EVENT-ID                              MSG-ID
SEVERITY
--------------- ------------------------------------  --------------
---------
Jul 07 07:55:42 016cf20c-d572-42c1-f217-9eb8d439b73c  PCIEX-8000-KP
 Major

Fault class : fault.io.pciex.device-interr-corr max 29%
          fault.io.pciex.bus-linkerr-corr max 14%
Affects     : dev:////p...@0,0/pci8086,3...@1/pci15d9,1...@0
          dev:////p...@0,0/pci8086,3...@1/pci15d9,1...@0,1
          dev:////p...@0,0/pci8086,3...@1
              faulted but still in service
FRU         : "MB"




(hc://:product-id=X8DTH-i-6-iF-6F:chassis-id=1234567890:server-id=opensolaris/motherboard=0)
              faulty

Description : Too many recovered bus errors have been detected,
which
indicates
          a problem with the specified bus or with the specified
          transmitting device. This may degrade into an
unrecoverable
          fault.
          Refer to http://sun.com/msg/PCIEX-8000-KP for more
information.

Response    : One or more device instances may be disabled

Impact      : Loss of services provided by the device instances
associated
with
          this fault

Action      : If a plug-in card is involved check for badly-seated
cards
or
          bent pins. Otherwise schedule a repair procedure to
replace
the
          affected device.  Use fmadm faulty to identify the device
or
          contact Sun for support.

How bad is this error?  I need to put some adapters in, but it
sounds
like the OS doesn't handle the NHM's IOH (or is it really detaining
a
HW issue?).

OS does handle these issues and unfortunately it is a HW issue.  This
is
likely to eventually cause your system to panic or fill up your hard
drive.
 Assuming you are seeing a lot of btlp and rto errors..  If anything
these
errors are performance killer.  Not only is the RTO/BTLP error
telling
you
that many packets require retransmit, the OS also has to constantly
go
out
and scan and clean up the fabric.

This system is triple boot: RHEL5.3, W2008S, and OpenSolaris.

The errors in OpenSolaris occur if no cards are installed in the bus.

The other OSes don't report any errors w/ or w/o cards in the bus.

This doesn't happen when there are no cards installed, since the error
is
literally complaining about a packets received between 2 devices.  Are
you
sure it's you are correctly identifying the right slot?

I believe only OpenSolaris even detects these errors, which is why the
other
OSes don't report any errors.  It doesn't mean that errors aren't
occurring
though.

It would also be nice to throttle the errlog so it doesn't fill the
disk an hour after boot.  Is this possible?

no throttling possible, but you could turn it off, though highly not
recommended, it's better to fix the issue.  It really could just be a
badly
seated card.

How do I disable the errors?

We need to figure out exactly what your error is first, please provide
the
"fmdump -eV" log.  If it is huge, just tail the last 500-1000 lines
should
be enough.

Would that produce the same as the incantation shown earlier?:

I think all the CE's would produce the same message.  I also need to know
the exact device.

Last 1000 lines (of ~20 million) attached.

There are some boards in the bus at this time, but the same error
occurs w/o them, and their drivers are not yet loaded.  Everything
else is built-in to the motherboard.

Thanks,

Chris

<snip>


_______________________________________________
fm-discuss mailing list
[email protected]

Re: [fm-discuss] errlog growing out of control; PCIe errors on NHM/IOH mobo

Reply via email to