Re: [driver-discuss] Help with a pci-e problem

Ragnar Sundblad Tue, 06 Apr 2010 09:52:04 -0700

On 5 apr 2010, at 11.55, Ragnar Sundblad wrote:

> 
> On 5 apr 2010, at 06.41, pavan chandrashekar wrote:
> 
>> Ragnar Sundblad wrote:
>>> Hello,
>>> I wonder if anyone could help me with a pci-e problem.
>>> I have a X4150 running snv_134. It was shipped with a "STK RAID INT"
>>> adaptec/intel/storagetek/sun SAS HBA. The machine also has a
>>> LSI SAS card in another slot, though I don't know if that is
>>> significant in any way.
>> 
>> It might help troubleshooting.
>> 
>> You can try putting the disks behind the LSI SAS HBA and see if you still 
>> get errors. That will at the least tell you if the two errors are 
>> manifestations of the same problem, or separate issues.
>> 
>> You might still have issues with the fabric. You can then take off the HBA 
>> that is throwing errors (STK RAID) and put the LSI SAS HBA on the slot on 
>> which the STK RAID rested earlier and check the behaviour.
>> Maybe, this will point at the culprit. If the fabric errors continue with 
>> what ever card on the currently faulty slot (if at all it is), it is more 
>> probable that the issue is with the fabric.
> 
> Thanks! The only problem right now and the last few days is that the
> machine is at my workplace, some 10 kilometers away, and we have
> eastern holiday right now. I was hoping to use those days off having
> it running tests all by itself, but have instead been chasing hidden
> easter eggs inside an intel design.
> 
> I have now discovered that the ereport.io.pci.fabric started when
> I upgraded from snv_128 to 134, I totally missed that relation before.
> There has been some changes in the PCI code about that time that may
> or may not be related, for example:
> <http://src.opensolaris.org/source/history/onnv/onnv-gate/usr/src/cmd/fm/modules/common/fabric-xlate/fabric-xlate.c>
> If that means that this is a driver glitch or a hardware problem
> that now became visible, and whether it can be ignored or not,
> is still far beyond my knowledge.
> 
> But I will follow your advice and move the cards around and see what
> happens!


I have now swapped the cards. The problem seems to remain almost identical
to before, but if I understand this it is now on another PCI bridge
(i suppose by this: pci8086,2...@2, maybe I should check out the chip set
documentation).

Can someone please tell me how I can decode the ereport information so
that I can understand what the PCI bridge complains about?

Thanks!

/ragge

Apr 06 2010 18:40:34.965687100 ereport.io.pci.fabric
nvlist version: 0
        class = ereport.io.pci.fabric
        ena = 0x28d9c49528201801
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = dev
                device-path = /p...@0,0/pci8086,2...@2
        (end detector)

        bdf = 0x10
        device_id = 0x25e2
        vendor_id = 0x8086
        rev_id = 0xb1
        dev_type = 0x40
        pcie_off = 0x6c
        pcix_off = 0x0
        aer_off = 0x100
        ecc_ver = 0x0
        pci_status = 0x10
        pci_command = 0x147
        pci_bdg_sec_status = 0x0
        pci_bdg_ctrl = 0x3
        pcie_status = 0x0
        pcie_command = 0x2027
        pcie_dev_cap = 0xfc1
        pcie_adv_ctl = 0x0
        pcie_ue_status = 0x0
        pcie_ue_mask = 0x100000
        pcie_ue_sev = 0x62031
        pcie_ue_hdr0 = 0x0
        pcie_ue_hdr1 = 0x0
        pcie_ue_hdr2 = 0x0
        pcie_ue_hdr3 = 0x0
        pcie_ce_status = 0x0
        pcie_ce_mask = 0x0
        pcie_rp_status = 0x0
        pcie_rp_control = 0x7
        pcie_adv_rp_status = 0x0
        pcie_adv_rp_command = 0x7
        pcie_adv_rp_ce_src_id = 0x0
        pcie_adv_rp_ue_src_id = 0x0
        remainder = 0x0
        severity = 0x1
        __ttl = 0x1
        __tod = 0x4bbb6402 0x398f373c



> 
> /ragge
> 
>> 
>> Pavan
>> 
>>> It logs some errors, as shown with "fmdump -e(V).
>>> It is most often a pci bridge error (I think), about five to ten
>>> times an hour, and occasionally a problem with accessing a
>>> mode page on the disks behind the STK raid controller for
>>> enabling/disabling the disks' write caches, one error for each disk,
>>> about every three hours. I don't believe the two have to be related.
>>> I am especially interested in understanding the ereport.io.pci.fabric
>>> report.
>>> I haven't seen this problem on other more or less identical
>>> machines running sol10.
>>> Is this a known software problem, or do I have faulty hardware?
>>> Thanks!
>>> /ragge
>>> --------------
>>> % fmdump -e
>>> ...
>>> Apr 04 01:21:53.2244 ereport.io.pci.fabric           Apr 04 01:30:00.6999 
>>> ereport.io.pci.fabric           Apr 04 01:30:23.4647 
>>> ereport.io.scsi.cmd.disk.dev.uderr
>>> Apr 04 01:30:23.4651 ereport.io.scsi.cmd.disk.dev.uderr
>>> ...
>>> % fmdump -eV
>>> Apr 04 2010 01:21:53.224492765 ereport.io.pci.fabric
>>> nvlist version: 0
>>>      class = ereport.io.pci.fabric
>>>      ena = 0xd6a00a43be800c01
>>>      detector = (embedded nvlist)
>>>      nvlist version: 0
>>>              version = 0x0
>>>              scheme = dev
>>>              device-path = /p...@0,0/pci8086,2...@4
>>>      (end detector)
>>>      bdf = 0x20
>>>      device_id = 0x25f8
>>>      vendor_id = 0x8086
>>>      rev_id = 0xb1
>>>      dev_type = 0x40
>>>      pcie_off = 0x6c
>>>      pcix_off = 0x0
>>>      aer_off = 0x100
>>>      ecc_ver = 0x0
>>>      pci_status = 0x10
>>>      pci_command = 0x147
>>>      pci_bdg_sec_status = 0x0
>>>      pci_bdg_ctrl = 0x3
>>>      pcie_status = 0x0
>>>      pcie_command = 0x2027
>>>      pcie_dev_cap = 0xfc1
>>>      pcie_adv_ctl = 0x0
>>>      pcie_ue_status = 0x0
>>>      pcie_ue_mask = 0x100000
>>>      pcie_ue_sev = 0x62031
>>>      pcie_ue_hdr0 = 0x0
>>>      pcie_ue_hdr1 = 0x0
>>>      pcie_ue_hdr2 = 0x0
>>>      pcie_ue_hdr3 = 0x0
>>>      pcie_ce_status = 0x0
>>>      pcie_ce_mask = 0x0
>>>      pcie_rp_status = 0x0
>>>      pcie_rp_control = 0x7
>>>      pcie_adv_rp_status = 0x0
>>>      pcie_adv_rp_command = 0x7
>>>      pcie_adv_rp_ce_src_id = 0x0
>>>      pcie_adv_rp_ue_src_id = 0x0
>>>      remainder = 0x0
>>>      severity = 0x1
>>>      __ttl = 0x1
>>>      __tod = 0x4bb7cd91 0xd617cdd
>>> ...
>>> Apr 04 2010 01:30:23.464768275 ereport.io.scsi.cmd.disk.dev.uderr
>>> nvlist version: 0
>>>      class = ereport.io.scsi.cmd.disk.dev.uderr
>>>      ena = 0xde0cd54f84201c01
>>>      detector = (embedded nvlist)
>>>      nvlist version: 0
>>>              version = 0x0
>>>              scheme = dev
>>>              device-path = /p...@0,0/pci8086,2...@4/pci108e,2...@0/d...@5,0
>>>              devid = id1,s...@tsun_____stk_raid_int____ea4b6f24
>>>      (end detector)
>>>      driver-assessment = fail
>>>      op-code = 0x1a
>>>      cdb = 0x1a 0x0 0x8 0x0 0x18 0x0
>>>      pkt-reason = 0x0
>>>      pkt-state = 0x1f
>>>      pkt-stats = 0x0
>>>      stat-code = 0x0
>>>      un-decode-info = sd_get_write_cache_enabled: Mode Sense caching page 
>>> code mismatch 0
>>>      un-decode-value =
>>>      __ttl = 0x1
>>>      __tod = 0x4bb7cf8f 0x1bb3cd13
>>> ...
>>> _______________________________________________
>>> driver-discuss mailing list
>>> [email protected]
>>> http://mail.opensolaris.org/mailman/listinfo/driver-discuss
>> 
> 
> _______________________________________________
> driver-discuss mailing list
> [email protected]
> http://mail.opensolaris.org/mailman/listinfo/driver-discuss

_______________________________________________
driver-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/driver-discuss

Re: [driver-discuss] Help with a pci-e problem

Reply via email to