Re: [driver-discuss] Help with a pci-e problem

Ragnar Sundblad Wed, 07 Apr 2010 09:08:08 -0700

On 6 apr 2010, at 18.51, Ragnar Sundblad wrote:

> 
> On 5 apr 2010, at 11.55, Ragnar Sundblad wrote:
> 
>> 
>> On 5 apr 2010, at 06.41, pavan chandrashekar wrote:
>> 
>>> Ragnar Sundblad wrote:
>>>> Hello,
>>>> I wonder if anyone could help me with a pci-e problem.
>>>> I have a X4150 running snv_134. It was shipped with a "STK RAID INT"
>>>> adaptec/intel/storagetek/sun SAS HBA. The machine also has a
>>>> LSI SAS card in another slot, though I don't know if that is
>>>> significant in any way.
>>> 
>>> It might help troubleshooting.
>>> 
>>> You can try putting the disks behind the LSI SAS HBA and see if you still 
>>> get errors. That will at the least tell you if the two errors are 
>>> manifestations of the same problem, or separate issues.
>>> 
>>> You might still have issues with the fabric. You can then take off the HBA 
>>> that is throwing errors (STK RAID) and put the LSI SAS HBA on the slot on 
>>> which the STK RAID rested earlier and check the behaviour.
>>> Maybe, this will point at the culprit. If the fabric errors continue with 
>>> what ever card on the currently faulty slot (if at all it is), it is more 
>>> probable that the issue is with the fabric.
>> 
>> Thanks! The only problem right now and the last few days is that the
>> machine is at my workplace, some 10 kilometers away, and we have
>> eastern holiday right now. I was hoping to use those days off having
>> it running tests all by itself, but have instead been chasing hidden
>> easter eggs inside an intel design.
>> 
>> I have now discovered that the ereport.io.pci.fabric started when
>> I upgraded from snv_128 to 134, I totally missed that relation before.
>> There has been some changes in the PCI code about that time that may
>> or may not be related, for example:
>> <http://src.opensolaris.org/source/history/onnv/onnv-gate/usr/src/cmd/fm/modules/common/fabric-xlate/fabric-xlate.c>
>> If that means that this is a driver glitch or a hardware problem
>> that now became visible, and whether it can be ignored or not,
>> is still far beyond my knowledge.
>> 
>> But I will follow your advice and move the cards around and see what
>> happens!
> 
> I have now swapped the cards. The problem seems to remain almost identical
> to before, but if I understand this it is now on another PCI bridge
> (i suppose by this: pci8086,2...@2, maybe I should check out the chip set
> documentation).
> 
> Can someone please tell me how I can decode the ereport information so
> that I can understand what the PCI bridge complains about?


I have now also tried with another SUN_STK_INT controller (with
older firmware, as shipped form Sun) including riser board from another
X4150, and it gets the same ereports.

I have tried removing the LSI board, and it still behaves the same.

Is there anyone else out there with a Sun X4xxx running snv_134 with
a SUN_STK_INT raid controller that sees or don't see this?

For the record, the ereport.io.pci.fabric-s appears every
4 minutes 4 seconds, give and take half a second or so.

Thanks!

/ragge


> 
> Thanks!
> 
> /ragge
> 
> Apr 06 2010 18:40:34.965687100 ereport.io.pci.fabric
> nvlist version: 0
>        class = ereport.io.pci.fabric
>        ena = 0x28d9c49528201801
>        detector = (embedded nvlist)
>        nvlist version: 0
>                version = 0x0
>                scheme = dev
>                device-path = /p...@0,0/pci8086,2...@2
>        (end detector)
> 
>        bdf = 0x10
>        device_id = 0x25e2
>        vendor_id = 0x8086
>        rev_id = 0xb1
>        dev_type = 0x40
>        pcie_off = 0x6c
>        pcix_off = 0x0
>        aer_off = 0x100
>        ecc_ver = 0x0
>        pci_status = 0x10
>        pci_command = 0x147
>        pci_bdg_sec_status = 0x0
>        pci_bdg_ctrl = 0x3
>        pcie_status = 0x0
>        pcie_command = 0x2027
>        pcie_dev_cap = 0xfc1
>        pcie_adv_ctl = 0x0
>        pcie_ue_status = 0x0
>        pcie_ue_mask = 0x100000
>        pcie_ue_sev = 0x62031
>        pcie_ue_hdr0 = 0x0
>        pcie_ue_hdr1 = 0x0
>        pcie_ue_hdr2 = 0x0
>        pcie_ue_hdr3 = 0x0
>        pcie_ce_status = 0x0
>        pcie_ce_mask = 0x0
>        pcie_rp_status = 0x0
>        pcie_rp_control = 0x7
>        pcie_adv_rp_status = 0x0
>        pcie_adv_rp_command = 0x7
>        pcie_adv_rp_ce_src_id = 0x0
>        pcie_adv_rp_ue_src_id = 0x0
>        remainder = 0x0
>        severity = 0x1
>        __ttl = 0x1
>        __tod = 0x4bbb6402 0x398f373c
> 
> 
> 
>> 
>> /ragge
>> 
>>> 
>>> Pavan
>>> 
>>>> It logs some errors, as shown with "fmdump -e(V).
>>>> It is most often a pci bridge error (I think), about five to ten
>>>> times an hour, and occasionally a problem with accessing a
>>>> mode page on the disks behind the STK raid controller for
>>>> enabling/disabling the disks' write caches, one error for each disk,
>>>> about every three hours. I don't believe the two have to be related.
>>>> I am especially interested in understanding the ereport.io.pci.fabric
>>>> report.
>>>> I haven't seen this problem on other more or less identical
>>>> machines running sol10.
>>>> Is this a known software problem, or do I have faulty hardware?
>>>> Thanks!
>>>> /ragge
>>>> --------------
>>>> % fmdump -e
>>>> ...
>>>> Apr 04 01:21:53.2244 ereport.io.pci.fabric           Apr 04 01:30:00.6999 
>>>> ereport.io.pci.fabric           Apr 04 01:30:23.4647 
>>>> ereport.io.scsi.cmd.disk.dev.uderr
>>>> Apr 04 01:30:23.4651 ereport.io.scsi.cmd.disk.dev.uderr
>>>> ...
>>>> % fmdump -eV
>>>> Apr 04 2010 01:21:53.224492765 ereport.io.pci.fabric
>>>> nvlist version: 0
>>>>     class = ereport.io.pci.fabric
>>>>     ena = 0xd6a00a43be800c01
>>>>     detector = (embedded nvlist)
>>>>     nvlist version: 0
>>>>             version = 0x0
>>>>             scheme = dev
>>>>             device-path = /p...@0,0/pci8086,2...@4
>>>>     (end detector)
>>>>     bdf = 0x20
>>>>     device_id = 0x25f8
>>>>     vendor_id = 0x8086
>>>>     rev_id = 0xb1
>>>>     dev_type = 0x40
>>>>     pcie_off = 0x6c
>>>>     pcix_off = 0x0
>>>>     aer_off = 0x100
>>>>     ecc_ver = 0x0
>>>>     pci_status = 0x10
>>>>     pci_command = 0x147
>>>>     pci_bdg_sec_status = 0x0
>>>>     pci_bdg_ctrl = 0x3
>>>>     pcie_status = 0x0
>>>>     pcie_command = 0x2027
>>>>     pcie_dev_cap = 0xfc1
>>>>     pcie_adv_ctl = 0x0
>>>>     pcie_ue_status = 0x0
>>>>     pcie_ue_mask = 0x100000
>>>>     pcie_ue_sev = 0x62031
>>>>     pcie_ue_hdr0 = 0x0
>>>>     pcie_ue_hdr1 = 0x0
>>>>     pcie_ue_hdr2 = 0x0
>>>>     pcie_ue_hdr3 = 0x0
>>>>     pcie_ce_status = 0x0
>>>>     pcie_ce_mask = 0x0
>>>>     pcie_rp_status = 0x0
>>>>     pcie_rp_control = 0x7
>>>>     pcie_adv_rp_status = 0x0
>>>>     pcie_adv_rp_command = 0x7
>>>>     pcie_adv_rp_ce_src_id = 0x0
>>>>     pcie_adv_rp_ue_src_id = 0x0
>>>>     remainder = 0x0
>>>>     severity = 0x1
>>>>     __ttl = 0x1
>>>>     __tod = 0x4bb7cd91 0xd617cdd
>>>> ...
>>>> Apr 04 2010 01:30:23.464768275 ereport.io.scsi.cmd.disk.dev.uderr
>>>> nvlist version: 0
>>>>     class = ereport.io.scsi.cmd.disk.dev.uderr
>>>>     ena = 0xde0cd54f84201c01
>>>>     detector = (embedded nvlist)
>>>>     nvlist version: 0
>>>>             version = 0x0
>>>>             scheme = dev
>>>>             device-path = /p...@0,0/pci8086,2...@4/pci108e,2...@0/d...@5,0
>>>>             devid = id1,s...@tsun_____stk_raid_int____ea4b6f24
>>>>     (end detector)
>>>>     driver-assessment = fail
>>>>     op-code = 0x1a
>>>>     cdb = 0x1a 0x0 0x8 0x0 0x18 0x0
>>>>     pkt-reason = 0x0
>>>>     pkt-state = 0x1f
>>>>     pkt-stats = 0x0
>>>>     stat-code = 0x0
>>>>     un-decode-info = sd_get_write_cache_enabled: Mode Sense caching page 
>>>> code mismatch 0
>>>>     un-decode-value =
>>>>     __ttl = 0x1
>>>>     __tod = 0x4bb7cf8f 0x1bb3cd13
>>>> ...
>>>> _______________________________________________
>>>> driver-discuss mailing list
>>>> [email protected]
>>>> http://mail.opensolaris.org/mailman/listinfo/driver-discuss
>>> 
>> 
>> _______________________________________________
>> driver-discuss mailing list
>> [email protected]
>> http://mail.opensolaris.org/mailman/listinfo/driver-discuss
> 
> _______________________________________________
> driver-discuss mailing list
> [email protected]
> http://mail.opensolaris.org/mailman/listinfo/driver-discuss

_______________________________________________
driver-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/driver-discuss

Re: [driver-discuss] Help with a pci-e problem

Reply via email to