On 5 apr 2010, at 11.55, Ragnar Sundblad wrote:
>
> On 5 apr 2010, at 06.41, pavan chandrashekar wrote:
>
>> Ragnar Sundblad wrote:
>>> Hello,
>>> I wonder if anyone could help me with a pci-e problem.
>>> I have a X4150 running snv_134. It was shipped with a "STK RAID INT"
>>> adaptec/intel/storagetek/sun SAS HBA. The machine also has a
>>> LSI SAS card in another slot, though I don't know if that is
>>> significant in any way.
>>
>> It might help troubleshooting.
>>
>> You can try putting the disks behind the LSI SAS HBA and see if you still
>> get errors. That will at the least tell you if the two errors are
>> manifestations of the same problem, or separate issues.
>>
>> You might still have issues with the fabric. You can then take off the HBA
>> that is throwing errors (STK RAID) and put the LSI SAS HBA on the slot on
>> which the STK RAID rested earlier and check the behaviour.
>> Maybe, this will point at the culprit. If the fabric errors continue with
>> what ever card on the currently faulty slot (if at all it is), it is more
>> probable that the issue is with the fabric.
>
> Thanks! The only problem right now and the last few days is that the
> machine is at my workplace, some 10 kilometers away, and we have
> eastern holiday right now. I was hoping to use those days off having
> it running tests all by itself, but have instead been chasing hidden
> easter eggs inside an intel design.
>
> I have now discovered that the ereport.io.pci.fabric started when
> I upgraded from snv_128 to 134, I totally missed that relation before.
> There has been some changes in the PCI code about that time that may
> or may not be related, for example:
> <http://src.opensolaris.org/source/history/onnv/onnv-gate/usr/src/cmd/fm/modules/common/fabric-xlate/fabric-xlate.c>
> If that means that this is a driver glitch or a hardware problem
> that now became visible, and whether it can be ignored or not,
> is still far beyond my knowledge.
>
> But I will follow your advice and move the cards around and see what
> happens!
I have now swapped the cards. The problem seems to remain almost identical
to before, but if I understand this it is now on another PCI bridge
(i suppose by this: pci8086,2...@2, maybe I should check out the chip set
documentation).
Can someone please tell me how I can decode the ereport information so
that I can understand what the PCI bridge complains about?
Thanks!
/ragge
Apr 06 2010 18:40:34.965687100 ereport.io.pci.fabric
nvlist version: 0
class = ereport.io.pci.fabric
ena = 0x28d9c49528201801
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
device-path = /p...@0,0/pci8086,2...@2
(end detector)
bdf = 0x10
device_id = 0x25e2
vendor_id = 0x8086
rev_id = 0xb1
dev_type = 0x40
pcie_off = 0x6c
pcix_off = 0x0
aer_off = 0x100
ecc_ver = 0x0
pci_status = 0x10
pci_command = 0x147
pci_bdg_sec_status = 0x0
pci_bdg_ctrl = 0x3
pcie_status = 0x0
pcie_command = 0x2027
pcie_dev_cap = 0xfc1
pcie_adv_ctl = 0x0
pcie_ue_status = 0x0
pcie_ue_mask = 0x100000
pcie_ue_sev = 0x62031
pcie_ue_hdr0 = 0x0
pcie_ue_hdr1 = 0x0
pcie_ue_hdr2 = 0x0
pcie_ue_hdr3 = 0x0
pcie_ce_status = 0x0
pcie_ce_mask = 0x0
pcie_rp_status = 0x0
pcie_rp_control = 0x7
pcie_adv_rp_status = 0x0
pcie_adv_rp_command = 0x7
pcie_adv_rp_ce_src_id = 0x0
pcie_adv_rp_ue_src_id = 0x0
remainder = 0x0
severity = 0x1
__ttl = 0x1
__tod = 0x4bbb6402 0x398f373c
>
> /ragge
>
>>
>> Pavan
>>
>>> It logs some errors, as shown with "fmdump -e(V).
>>> It is most often a pci bridge error (I think), about five to ten
>>> times an hour, and occasionally a problem with accessing a
>>> mode page on the disks behind the STK raid controller for
>>> enabling/disabling the disks' write caches, one error for each disk,
>>> about every three hours. I don't believe the two have to be related.
>>> I am especially interested in understanding the ereport.io.pci.fabric
>>> report.
>>> I haven't seen this problem on other more or less identical
>>> machines running sol10.
>>> Is this a known software problem, or do I have faulty hardware?
>>> Thanks!
>>> /ragge
>>> --------------
>>> % fmdump -e
>>> ...
>>> Apr 04 01:21:53.2244 ereport.io.pci.fabric Apr 04 01:30:00.6999
>>> ereport.io.pci.fabric Apr 04 01:30:23.4647
>>> ereport.io.scsi.cmd.disk.dev.uderr
>>> Apr 04 01:30:23.4651 ereport.io.scsi.cmd.disk.dev.uderr
>>> ...
>>> % fmdump -eV
>>> Apr 04 2010 01:21:53.224492765 ereport.io.pci.fabric
>>> nvlist version: 0
>>> class = ereport.io.pci.fabric
>>> ena = 0xd6a00a43be800c01
>>> detector = (embedded nvlist)
>>> nvlist version: 0
>>> version = 0x0
>>> scheme = dev
>>> device-path = /p...@0,0/pci8086,2...@4
>>> (end detector)
>>> bdf = 0x20
>>> device_id = 0x25f8
>>> vendor_id = 0x8086
>>> rev_id = 0xb1
>>> dev_type = 0x40
>>> pcie_off = 0x6c
>>> pcix_off = 0x0
>>> aer_off = 0x100
>>> ecc_ver = 0x0
>>> pci_status = 0x10
>>> pci_command = 0x147
>>> pci_bdg_sec_status = 0x0
>>> pci_bdg_ctrl = 0x3
>>> pcie_status = 0x0
>>> pcie_command = 0x2027
>>> pcie_dev_cap = 0xfc1
>>> pcie_adv_ctl = 0x0
>>> pcie_ue_status = 0x0
>>> pcie_ue_mask = 0x100000
>>> pcie_ue_sev = 0x62031
>>> pcie_ue_hdr0 = 0x0
>>> pcie_ue_hdr1 = 0x0
>>> pcie_ue_hdr2 = 0x0
>>> pcie_ue_hdr3 = 0x0
>>> pcie_ce_status = 0x0
>>> pcie_ce_mask = 0x0
>>> pcie_rp_status = 0x0
>>> pcie_rp_control = 0x7
>>> pcie_adv_rp_status = 0x0
>>> pcie_adv_rp_command = 0x7
>>> pcie_adv_rp_ce_src_id = 0x0
>>> pcie_adv_rp_ue_src_id = 0x0
>>> remainder = 0x0
>>> severity = 0x1
>>> __ttl = 0x1
>>> __tod = 0x4bb7cd91 0xd617cdd
>>> ...
>>> Apr 04 2010 01:30:23.464768275 ereport.io.scsi.cmd.disk.dev.uderr
>>> nvlist version: 0
>>> class = ereport.io.scsi.cmd.disk.dev.uderr
>>> ena = 0xde0cd54f84201c01
>>> detector = (embedded nvlist)
>>> nvlist version: 0
>>> version = 0x0
>>> scheme = dev
>>> device-path = /p...@0,0/pci8086,2...@4/pci108e,2...@0/d...@5,0
>>> devid = id1,s...@tsun_____stk_raid_int____ea4b6f24
>>> (end detector)
>>> driver-assessment = fail
>>> op-code = 0x1a
>>> cdb = 0x1a 0x0 0x8 0x0 0x18 0x0
>>> pkt-reason = 0x0
>>> pkt-state = 0x1f
>>> pkt-stats = 0x0
>>> stat-code = 0x0
>>> un-decode-info = sd_get_write_cache_enabled: Mode Sense caching page
>>> code mismatch 0
>>> un-decode-value =
>>> __ttl = 0x1
>>> __tod = 0x4bb7cf8f 0x1bb3cd13
>>> ...
>>> _______________________________________________
>>> driver-discuss mailing list
>>> [email protected]
>>> http://mail.opensolaris.org/mailman/listinfo/driver-discuss
>>
>
> _______________________________________________
> driver-discuss mailing list
> [email protected]
> http://mail.opensolaris.org/mailman/listinfo/driver-discuss
_______________________________________________
driver-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/driver-discuss