On 6 apr 2010, at 18.51, Ragnar Sundblad wrote: > > On 5 apr 2010, at 11.55, Ragnar Sundblad wrote: > >> >> On 5 apr 2010, at 06.41, pavan chandrashekar wrote: >> >>> Ragnar Sundblad wrote: >>>> Hello, >>>> I wonder if anyone could help me with a pci-e problem. >>>> I have a X4150 running snv_134. It was shipped with a "STK RAID INT" >>>> adaptec/intel/storagetek/sun SAS HBA. The machine also has a >>>> LSI SAS card in another slot, though I don't know if that is >>>> significant in any way. >>> >>> It might help troubleshooting. >>> >>> You can try putting the disks behind the LSI SAS HBA and see if you still >>> get errors. That will at the least tell you if the two errors are >>> manifestations of the same problem, or separate issues. >>> >>> You might still have issues with the fabric. You can then take off the HBA >>> that is throwing errors (STK RAID) and put the LSI SAS HBA on the slot on >>> which the STK RAID rested earlier and check the behaviour. >>> Maybe, this will point at the culprit. If the fabric errors continue with >>> what ever card on the currently faulty slot (if at all it is), it is more >>> probable that the issue is with the fabric. >> >> Thanks! The only problem right now and the last few days is that the >> machine is at my workplace, some 10 kilometers away, and we have >> eastern holiday right now. I was hoping to use those days off having >> it running tests all by itself, but have instead been chasing hidden >> easter eggs inside an intel design. >> >> I have now discovered that the ereport.io.pci.fabric started when >> I upgraded from snv_128 to 134, I totally missed that relation before. >> There has been some changes in the PCI code about that time that may >> or may not be related, for example: >> <http://src.opensolaris.org/source/history/onnv/onnv-gate/usr/src/cmd/fm/modules/common/fabric-xlate/fabric-xlate.c> >> If that means that this is a driver glitch or a hardware problem >> that now became visible, and whether it can be ignored or not, >> is still far beyond my knowledge. >> >> But I will follow your advice and move the cards around and see what >> happens! > > I have now swapped the cards. The problem seems to remain almost identical > to before, but if I understand this it is now on another PCI bridge > (i suppose by this: pci8086,2...@2, maybe I should check out the chip set > documentation). > > Can someone please tell me how I can decode the ereport information so > that I can understand what the PCI bridge complains about?
I have now also tried with another SUN_STK_INT controller (with older firmware, as shipped form Sun) including riser board from another X4150, and it gets the same ereports. I have tried removing the LSI board, and it still behaves the same. Is there anyone else out there with a Sun X4xxx running snv_134 with a SUN_STK_INT raid controller that sees or don't see this? For the record, the ereport.io.pci.fabric-s appears every 4 minutes 4 seconds, give and take half a second or so. Thanks! /ragge > > Thanks! > > /ragge > > Apr 06 2010 18:40:34.965687100 ereport.io.pci.fabric > nvlist version: 0 > class = ereport.io.pci.fabric > ena = 0x28d9c49528201801 > detector = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = dev > device-path = /p...@0,0/pci8086,2...@2 > (end detector) > > bdf = 0x10 > device_id = 0x25e2 > vendor_id = 0x8086 > rev_id = 0xb1 > dev_type = 0x40 > pcie_off = 0x6c > pcix_off = 0x0 > aer_off = 0x100 > ecc_ver = 0x0 > pci_status = 0x10 > pci_command = 0x147 > pci_bdg_sec_status = 0x0 > pci_bdg_ctrl = 0x3 > pcie_status = 0x0 > pcie_command = 0x2027 > pcie_dev_cap = 0xfc1 > pcie_adv_ctl = 0x0 > pcie_ue_status = 0x0 > pcie_ue_mask = 0x100000 > pcie_ue_sev = 0x62031 > pcie_ue_hdr0 = 0x0 > pcie_ue_hdr1 = 0x0 > pcie_ue_hdr2 = 0x0 > pcie_ue_hdr3 = 0x0 > pcie_ce_status = 0x0 > pcie_ce_mask = 0x0 > pcie_rp_status = 0x0 > pcie_rp_control = 0x7 > pcie_adv_rp_status = 0x0 > pcie_adv_rp_command = 0x7 > pcie_adv_rp_ce_src_id = 0x0 > pcie_adv_rp_ue_src_id = 0x0 > remainder = 0x0 > severity = 0x1 > __ttl = 0x1 > __tod = 0x4bbb6402 0x398f373c > > > >> >> /ragge >> >>> >>> Pavan >>> >>>> It logs some errors, as shown with "fmdump -e(V). >>>> It is most often a pci bridge error (I think), about five to ten >>>> times an hour, and occasionally a problem with accessing a >>>> mode page on the disks behind the STK raid controller for >>>> enabling/disabling the disks' write caches, one error for each disk, >>>> about every three hours. I don't believe the two have to be related. >>>> I am especially interested in understanding the ereport.io.pci.fabric >>>> report. >>>> I haven't seen this problem on other more or less identical >>>> machines running sol10. >>>> Is this a known software problem, or do I have faulty hardware? >>>> Thanks! >>>> /ragge >>>> -------------- >>>> % fmdump -e >>>> ... >>>> Apr 04 01:21:53.2244 ereport.io.pci.fabric Apr 04 01:30:00.6999 >>>> ereport.io.pci.fabric Apr 04 01:30:23.4647 >>>> ereport.io.scsi.cmd.disk.dev.uderr >>>> Apr 04 01:30:23.4651 ereport.io.scsi.cmd.disk.dev.uderr >>>> ... >>>> % fmdump -eV >>>> Apr 04 2010 01:21:53.224492765 ereport.io.pci.fabric >>>> nvlist version: 0 >>>> class = ereport.io.pci.fabric >>>> ena = 0xd6a00a43be800c01 >>>> detector = (embedded nvlist) >>>> nvlist version: 0 >>>> version = 0x0 >>>> scheme = dev >>>> device-path = /p...@0,0/pci8086,2...@4 >>>> (end detector) >>>> bdf = 0x20 >>>> device_id = 0x25f8 >>>> vendor_id = 0x8086 >>>> rev_id = 0xb1 >>>> dev_type = 0x40 >>>> pcie_off = 0x6c >>>> pcix_off = 0x0 >>>> aer_off = 0x100 >>>> ecc_ver = 0x0 >>>> pci_status = 0x10 >>>> pci_command = 0x147 >>>> pci_bdg_sec_status = 0x0 >>>> pci_bdg_ctrl = 0x3 >>>> pcie_status = 0x0 >>>> pcie_command = 0x2027 >>>> pcie_dev_cap = 0xfc1 >>>> pcie_adv_ctl = 0x0 >>>> pcie_ue_status = 0x0 >>>> pcie_ue_mask = 0x100000 >>>> pcie_ue_sev = 0x62031 >>>> pcie_ue_hdr0 = 0x0 >>>> pcie_ue_hdr1 = 0x0 >>>> pcie_ue_hdr2 = 0x0 >>>> pcie_ue_hdr3 = 0x0 >>>> pcie_ce_status = 0x0 >>>> pcie_ce_mask = 0x0 >>>> pcie_rp_status = 0x0 >>>> pcie_rp_control = 0x7 >>>> pcie_adv_rp_status = 0x0 >>>> pcie_adv_rp_command = 0x7 >>>> pcie_adv_rp_ce_src_id = 0x0 >>>> pcie_adv_rp_ue_src_id = 0x0 >>>> remainder = 0x0 >>>> severity = 0x1 >>>> __ttl = 0x1 >>>> __tod = 0x4bb7cd91 0xd617cdd >>>> ... >>>> Apr 04 2010 01:30:23.464768275 ereport.io.scsi.cmd.disk.dev.uderr >>>> nvlist version: 0 >>>> class = ereport.io.scsi.cmd.disk.dev.uderr >>>> ena = 0xde0cd54f84201c01 >>>> detector = (embedded nvlist) >>>> nvlist version: 0 >>>> version = 0x0 >>>> scheme = dev >>>> device-path = /p...@0,0/pci8086,2...@4/pci108e,2...@0/d...@5,0 >>>> devid = id1,s...@tsun_____stk_raid_int____ea4b6f24 >>>> (end detector) >>>> driver-assessment = fail >>>> op-code = 0x1a >>>> cdb = 0x1a 0x0 0x8 0x0 0x18 0x0 >>>> pkt-reason = 0x0 >>>> pkt-state = 0x1f >>>> pkt-stats = 0x0 >>>> stat-code = 0x0 >>>> un-decode-info = sd_get_write_cache_enabled: Mode Sense caching page >>>> code mismatch 0 >>>> un-decode-value = >>>> __ttl = 0x1 >>>> __tod = 0x4bb7cf8f 0x1bb3cd13 >>>> ... >>>> _______________________________________________ >>>> driver-discuss mailing list >>>> [email protected] >>>> http://mail.opensolaris.org/mailman/listinfo/driver-discuss >>> >> >> _______________________________________________ >> driver-discuss mailing list >> [email protected] >> http://mail.opensolaris.org/mailman/listinfo/driver-discuss > > _______________________________________________ > driver-discuss mailing list > [email protected] > http://mail.opensolaris.org/mailman/listinfo/driver-discuss _______________________________________________ driver-discuss mailing list [email protected] http://mail.opensolaris.org/mailman/listinfo/driver-discuss
