Re: [driver-discuss] Help with a pci-e problem

Peng Liu Wed, 07 Apr 2010 23:48:52 -0700

于 2010/4/8 0:07, Ragnar Sundblad 写道:

On 6 apr 2010, at 18.51, Ragnar Sundblad wrote:

On 5 apr 2010, at 11.55, Ragnar Sundblad wrote:

On 5 apr 2010, at 06.41, pavan chandrashekar wrote:

Ragnar Sundblad wrote:

Hello,
I wonder if anyone could help me with a pci-e problem.
I have a X4150 running snv_134. It was shipped with a "STK RAID INT"
adaptec/intel/storagetek/sun SAS HBA. The machine also has a
LSI SAS card in another slot, though I don't know if that is
significant in any way.

It might help troubleshooting.

You can try putting the disks behind the LSI SAS HBA and see if you still get 
errors. That will at the least tell you if the two errors are manifestations of 
the same problem, or separate issues.

You might still have issues with the fabric. You can then take off the HBA that 
is throwing errors (STK RAID) and put the LSI SAS HBA on the slot on which the 
STK RAID rested earlier and check the behaviour.
Maybe, this will point at the culprit. If the fabric errors continue with what 
ever card on the currently faulty slot (if at all it is), it is more probable 
that the issue is with the fabric.

Thanks! The only problem right now and the last few days is that the
machine is at my workplace, some 10 kilometers away, and we have
eastern holiday right now. I was hoping to use those days off having
it running tests all by itself, but have instead been chasing hidden
easter eggs inside an intel design.

I have now discovered that the ereport.io.pci.fabric started when
I upgraded from snv_128 to 134, I totally missed that relation before.
There has been some changes in the PCI code about that time that may
or may not be related, for example:
<http://src.opensolaris.org/source/history/onnv/onnv-gate/usr/src/cmd/fm/modules/common/fabric-xlate/fabric-xlate.c>
If that means that this is a driver glitch or a hardware problem
that now became visible, and whether it can be ignored or not,
is still far beyond my knowledge.

But I will follow your advice and move the cards around and see what
happens!

I have now swapped the cards. The problem seems to remain almost identical
to before, but if I understand this it is now on another PCI bridge
(i suppose by this: pci8086,2...@2, maybe I should check out the chip set
documentation).

Can someone please tell me how I can decode the ereport information so
that I can understand what the PCI bridge complains about?

I have now also tried with another SUN_STK_INT controller (with
older firmware, as shipped form Sun) including riser board from another
X4150, and it gets the same ereports.

I have tried removing the LSI board, and it still behaves the same.

Is there anyone else out there with a Sun X4xxx running snv_134 with
a SUN_STK_INT raid controller that sees or don't see this?

For the record, the ereport.io.pci.fabric-s appears every
4 minutes 4 seconds, give and take half a second or so.

Thanks!

/ragge

Hi Ragnar,

The fma message about "sd_get_write_cache_enabled: Mode Sense cachingpage code mismatch 0" is because aac driver does not support MODE SENSEcommand with Caching mode page. Some userland program wanted to know adisks write-cache status via sd driver, so sd requested Caching modepage from aac. When it failed, sd reported it via fma, and that waslogged. Please file an aac driver bug and I'll fix it.


Thanks,
Peng

Thanks!

/ragge

Apr 06 2010 18:40:34.965687100 ereport.io.pci.fabric
nvlist version: 0
        class = ereport.io.pci.fabric
        ena = 0x28d9c49528201801
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = dev
                device-path = /p...@0,0/pci8086,2...@2
        (end detector)

        bdf = 0x10
        device_id = 0x25e2
        vendor_id = 0x8086
        rev_id = 0xb1
        dev_type = 0x40
        pcie_off = 0x6c
        pcix_off = 0x0
        aer_off = 0x100
        ecc_ver = 0x0
        pci_status = 0x10
        pci_command = 0x147
        pci_bdg_sec_status = 0x0
        pci_bdg_ctrl = 0x3
        pcie_status = 0x0
        pcie_command = 0x2027
        pcie_dev_cap = 0xfc1
        pcie_adv_ctl = 0x0
        pcie_ue_status = 0x0
        pcie_ue_mask = 0x100000
        pcie_ue_sev = 0x62031
        pcie_ue_hdr0 = 0x0
        pcie_ue_hdr1 = 0x0
        pcie_ue_hdr2 = 0x0
        pcie_ue_hdr3 = 0x0
        pcie_ce_status = 0x0
        pcie_ce_mask = 0x0
        pcie_rp_status = 0x0
        pcie_rp_control = 0x7
        pcie_adv_rp_status = 0x0
        pcie_adv_rp_command = 0x7
        pcie_adv_rp_ce_src_id = 0x0
        pcie_adv_rp_ue_src_id = 0x0
        remainder = 0x0
        severity = 0x1
        __ttl = 0x1
        __tod = 0x4bbb6402 0x398f373c

/ragge

Pavan

It logs some errors, as shown with "fmdump -e(V).
It is most often a pci bridge error (I think), about five to ten
times an hour, and occasionally a problem with accessing a
mode page on the disks behind the STK raid controller for
enabling/disabling the disks' write caches, one error for each disk,
about every three hours. I don't believe the two have to be related.
I am especially interested in understanding the ereport.io.pci.fabric
report.
I haven't seen this problem on other more or less identical
machines running sol10.
Is this a known software problem, or do I have faulty hardware?
Thanks!
/ragge
--------------
% fmdump -e
...
Apr 04 01:21:53.2244 ereport.io.pci.fabric           Apr 04 01:30:00.6999 
ereport.io.pci.fabric           Apr 04 01:30:23.4647 
ereport.io.scsi.cmd.disk.dev.uderr
Apr 04 01:30:23.4651 ereport.io.scsi.cmd.disk.dev.uderr
...
% fmdump -eV
Apr 04 2010 01:21:53.224492765 ereport.io.pci.fabric
nvlist version: 0
     class = ereport.io.pci.fabric
     ena = 0xd6a00a43be800c01
     detector = (embedded nvlist)
     nvlist version: 0
             version = 0x0
             scheme = dev
             device-path = /p...@0,0/pci8086,2...@4
     (end detector)
     bdf = 0x20
     device_id = 0x25f8
     vendor_id = 0x8086
     rev_id = 0xb1
     dev_type = 0x40
     pcie_off = 0x6c
     pcix_off = 0x0
     aer_off = 0x100
     ecc_ver = 0x0
     pci_status = 0x10
     pci_command = 0x147
     pci_bdg_sec_status = 0x0
     pci_bdg_ctrl = 0x3
     pcie_status = 0x0
     pcie_command = 0x2027
     pcie_dev_cap = 0xfc1
     pcie_adv_ctl = 0x0
     pcie_ue_status = 0x0
     pcie_ue_mask = 0x100000
     pcie_ue_sev = 0x62031
     pcie_ue_hdr0 = 0x0
     pcie_ue_hdr1 = 0x0
     pcie_ue_hdr2 = 0x0
     pcie_ue_hdr3 = 0x0
     pcie_ce_status = 0x0
     pcie_ce_mask = 0x0
     pcie_rp_status = 0x0
     pcie_rp_control = 0x7
     pcie_adv_rp_status = 0x0
     pcie_adv_rp_command = 0x7
     pcie_adv_rp_ce_src_id = 0x0
     pcie_adv_rp_ue_src_id = 0x0
     remainder = 0x0
     severity = 0x1
     __ttl = 0x1
     __tod = 0x4bb7cd91 0xd617cdd
...
Apr 04 2010 01:30:23.464768275 ereport.io.scsi.cmd.disk.dev.uderr
nvlist version: 0
     class = ereport.io.scsi.cmd.disk.dev.uderr
     ena = 0xde0cd54f84201c01
     detector = (embedded nvlist)
     nvlist version: 0
             version = 0x0
             scheme = dev
             device-path = /p...@0,0/pci8086,2...@4/pci108e,2...@0/d...@5,0
             devid = id1,s...@tsun_____stk_raid_int____ea4b6f24
     (end detector)
     driver-assessment = fail
     op-code = 0x1a
     cdb = 0x1a 0x0 0x8 0x0 0x18 0x0
     pkt-reason = 0x0
     pkt-state = 0x1f
     pkt-stats = 0x0
     stat-code = 0x0
     un-decode-info = sd_get_write_cache_enabled: Mode Sense caching page code 
mismatch 0
     un-decode-value =
     __ttl = 0x1
     __tod = 0x4bb7cf8f 0x1bb3cd13
...
_______________________________________________
driver-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/driver-discuss

_______________________________________________
driver-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/driver-discuss

_______________________________________________
driver-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/driver-discuss

_______________________________________________
driver-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/driver-discuss


_______________________________________________
driver-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/driver-discuss

Re: [driver-discuss] Help with a pci-e problem

Reply via email to