Re: [driver-discuss] Help with a pci-e problem

Ragnar Sundblad Thu, 08 Apr 2010 02:10:32 -0700

Thanks Peng!

I believe it is zfs that tries to get/set the cache status in
this case.


I have filed the bugs: CR 6941996 (and also CR 6942004).

You don't happen to have any more information on the PCI bridge
error (ereport.io.pci.fabric)? After my tests, with two different
SUN-STK-INT cards in two different slots, I believe it is actually
related to the SUN-STK-INT card.

/ragge

On 8 apr 2010, at 08.44, Peng Liu wrote:

> 于 2010/4/8 0:07, Ragnar Sundblad 写道:
>> On 6 apr 2010, at 18.51, Ragnar Sundblad wrote:
>> 
>>   
>>> On 5 apr 2010, at 11.55, Ragnar Sundblad wrote:
>>> 
>>>     
>>>> On 5 apr 2010, at 06.41, pavan chandrashekar wrote:
>>>> 
>>>>       
>>>>> Ragnar Sundblad wrote:
>>>>>         
>>>>>> Hello,
>>>>>> I wonder if anyone could help me with a pci-e problem.
>>>>>> I have a X4150 running snv_134. It was shipped with a "STK RAID INT"
>>>>>> adaptec/intel/storagetek/sun SAS HBA. The machine also has a
>>>>>> LSI SAS card in another slot, though I don't know if that is
>>>>>> significant in any way.
>>>>>>           
>>>>> It might help troubleshooting.
>>>>> 
>>>>> You can try putting the disks behind the LSI SAS HBA and see if you still 
>>>>> get errors. That will at the least tell you if the two errors are 
>>>>> manifestations of the same problem, or separate issues.
>>>>> 
>>>>> You might still have issues with the fabric. You can then take off the 
>>>>> HBA that is throwing errors (STK RAID) and put the LSI SAS HBA on the 
>>>>> slot on which the STK RAID rested earlier and check the behaviour.
>>>>> Maybe, this will point at the culprit. If the fabric errors continue with 
>>>>> what ever card on the currently faulty slot (if at all it is), it is more 
>>>>> probable that the issue is with the fabric.
>>>>>         
>>>> Thanks! The only problem right now and the last few days is that the
>>>> machine is at my workplace, some 10 kilometers away, and we have
>>>> eastern holiday right now. I was hoping to use those days off having
>>>> it running tests all by itself, but have instead been chasing hidden
>>>> easter eggs inside an intel design.
>>>> 
>>>> I have now discovered that the ereport.io.pci.fabric started when
>>>> I upgraded from snv_128 to 134, I totally missed that relation before.
>>>> There has been some changes in the PCI code about that time that may
>>>> or may not be related, for example:
>>>> <http://src.opensolaris.org/source/history/onnv/onnv-gate/usr/src/cmd/fm/modules/common/fabric-xlate/fabric-xlate.c>
>>>> If that means that this is a driver glitch or a hardware problem
>>>> that now became visible, and whether it can be ignored or not,
>>>> is still far beyond my knowledge.
>>>> 
>>>> But I will follow your advice and move the cards around and see what
>>>> happens!
>>>>       
>>> I have now swapped the cards. The problem seems to remain almost identical
>>> to before, but if I understand this it is now on another PCI bridge
>>> (i suppose by this: pci8086,2...@2, maybe I should check out the chip set
>>> documentation).
>>> 
>>> Can someone please tell me how I can decode the ereport information so
>>> that I can understand what the PCI bridge complains about?
>>>     
>> I have now also tried with another SUN_STK_INT controller (with
>> older firmware, as shipped form Sun) including riser board from another
>> X4150, and it gets the same ereports.
>> 
>> I have tried removing the LSI board, and it still behaves the same.
>> 
>> Is there anyone else out there with a Sun X4xxx running snv_134 with
>> a SUN_STK_INT raid controller that sees or don't see this?
>> 
>> For the record, the ereport.io.pci.fabric-s appears every
>> 4 minutes 4 seconds, give and take half a second or so.
>> 
>> Thanks!
>> 
>> /ragge
>> 
>>   
> Hi Ragnar,
> 
> The fma message about "sd_get_write_cache_enabled: Mode Sense caching page 
> code mismatch 0" is because aac driver does not support MODE SENSE command 
> with Caching mode page. Some userland program wanted to know a disks 
> write-cache status via sd driver, so sd requested Caching mode page from aac. 
> When it failed, sd reported it via fma, and that was logged. Please file an 
> aac driver bug and I'll fix it.
> 
> Thanks,
> Peng
> 
>>   
>>> Thanks!
>>> 
>>> /ragge
>>> 
>>> Apr 06 2010 18:40:34.965687100 ereport.io.pci.fabric
>>> nvlist version: 0
>>>        class = ereport.io.pci.fabric
>>>        ena = 0x28d9c49528201801
>>>        detector = (embedded nvlist)
>>>        nvlist version: 0
>>>                version = 0x0
>>>                scheme = dev
>>>                device-path = /p...@0,0/pci8086,2...@2
>>>        (end detector)
>>> 
>>>        bdf = 0x10
>>>        device_id = 0x25e2
>>>        vendor_id = 0x8086
>>>        rev_id = 0xb1
>>>        dev_type = 0x40
>>>        pcie_off = 0x6c
>>>        pcix_off = 0x0
>>>        aer_off = 0x100
>>>        ecc_ver = 0x0
>>>        pci_status = 0x10
>>>        pci_command = 0x147
>>>        pci_bdg_sec_status = 0x0
>>>        pci_bdg_ctrl = 0x3
>>>        pcie_status = 0x0
>>>        pcie_command = 0x2027
>>>        pcie_dev_cap = 0xfc1
>>>        pcie_adv_ctl = 0x0
>>>        pcie_ue_status = 0x0
>>>        pcie_ue_mask = 0x100000
>>>        pcie_ue_sev = 0x62031
>>>        pcie_ue_hdr0 = 0x0
>>>        pcie_ue_hdr1 = 0x0
>>>        pcie_ue_hdr2 = 0x0
>>>        pcie_ue_hdr3 = 0x0
>>>        pcie_ce_status = 0x0
>>>        pcie_ce_mask = 0x0
>>>        pcie_rp_status = 0x0
>>>        pcie_rp_control = 0x7
>>>        pcie_adv_rp_status = 0x0
>>>        pcie_adv_rp_command = 0x7
>>>        pcie_adv_rp_ce_src_id = 0x0
>>>        pcie_adv_rp_ue_src_id = 0x0
>>>        remainder = 0x0
>>>        severity = 0x1
>>>        __ttl = 0x1
>>>        __tod = 0x4bbb6402 0x398f373c
>>> 
>>> 
>>> 
>>>     
>>>> /ragge
>>>> 
>>>>       
>>>>> Pavan
>>>>> 
>>>>>         
>>>>>> It logs some errors, as shown with "fmdump -e(V).
>>>>>> It is most often a pci bridge error (I think), about five to ten
>>>>>> times an hour, and occasionally a problem with accessing a
>>>>>> mode page on the disks behind the STK raid controller for
>>>>>> enabling/disabling the disks' write caches, one error for each disk,
>>>>>> about every three hours. I don't believe the two have to be related.
>>>>>> I am especially interested in understanding the ereport.io.pci.fabric
>>>>>> report.
>>>>>> I haven't seen this problem on other more or less identical
>>>>>> machines running sol10.
>>>>>> Is this a known software problem, or do I have faulty hardware?
>>>>>> Thanks!
>>>>>> /ragge
>>>>>> --------------
>>>>>> % fmdump -e
>>>>>> ...
>>>>>> Apr 04 01:21:53.2244 ereport.io.pci.fabric           Apr 04 
>>>>>> 01:30:00.6999 ereport.io.pci.fabric           Apr 04 01:30:23.4647 
>>>>>> ereport.io.scsi.cmd.disk.dev.uderr
>>>>>> Apr 04 01:30:23.4651 ereport.io.scsi.cmd.disk.dev.uderr
>>>>>> ...
>>>>>> % fmdump -eV
>>>>>> Apr 04 2010 01:21:53.224492765 ereport.io.pci.fabric
>>>>>> nvlist version: 0
>>>>>>     class = ereport.io.pci.fabric
>>>>>>     ena = 0xd6a00a43be800c01
>>>>>>     detector = (embedded nvlist)
>>>>>>     nvlist version: 0
>>>>>>             version = 0x0
>>>>>>             scheme = dev
>>>>>>             device-path = /p...@0,0/pci8086,2...@4
>>>>>>     (end detector)
>>>>>>     bdf = 0x20
>>>>>>     device_id = 0x25f8
>>>>>>     vendor_id = 0x8086
>>>>>>     rev_id = 0xb1
>>>>>>     dev_type = 0x40
>>>>>>     pcie_off = 0x6c
>>>>>>     pcix_off = 0x0
>>>>>>     aer_off = 0x100
>>>>>>     ecc_ver = 0x0
>>>>>>     pci_status = 0x10
>>>>>>     pci_command = 0x147
>>>>>>     pci_bdg_sec_status = 0x0
>>>>>>     pci_bdg_ctrl = 0x3
>>>>>>     pcie_status = 0x0
>>>>>>     pcie_command = 0x2027
>>>>>>     pcie_dev_cap = 0xfc1
>>>>>>     pcie_adv_ctl = 0x0
>>>>>>     pcie_ue_status = 0x0
>>>>>>     pcie_ue_mask = 0x100000
>>>>>>     pcie_ue_sev = 0x62031
>>>>>>     pcie_ue_hdr0 = 0x0
>>>>>>     pcie_ue_hdr1 = 0x0
>>>>>>     pcie_ue_hdr2 = 0x0
>>>>>>     pcie_ue_hdr3 = 0x0
>>>>>>     pcie_ce_status = 0x0
>>>>>>     pcie_ce_mask = 0x0
>>>>>>     pcie_rp_status = 0x0
>>>>>>     pcie_rp_control = 0x7
>>>>>>     pcie_adv_rp_status = 0x0
>>>>>>     pcie_adv_rp_command = 0x7
>>>>>>     pcie_adv_rp_ce_src_id = 0x0
>>>>>>     pcie_adv_rp_ue_src_id = 0x0
>>>>>>     remainder = 0x0
>>>>>>     severity = 0x1
>>>>>>     __ttl = 0x1
>>>>>>     __tod = 0x4bb7cd91 0xd617cdd
>>>>>> ...
>>>>>> Apr 04 2010 01:30:23.464768275 ereport.io.scsi.cmd.disk.dev.uderr
>>>>>> nvlist version: 0
>>>>>>     class = ereport.io.scsi.cmd.disk.dev.uderr
>>>>>>     ena = 0xde0cd54f84201c01
>>>>>>     detector = (embedded nvlist)
>>>>>>     nvlist version: 0
>>>>>>             version = 0x0
>>>>>>             scheme = dev
>>>>>>             device-path = 
>>>>>> /p...@0,0/pci8086,2...@4/pci108e,2...@0/d...@5,0
>>>>>>             devid = id1,s...@tsun_____stk_raid_int____ea4b6f24
>>>>>>     (end detector)
>>>>>>     driver-assessment = fail
>>>>>>     op-code = 0x1a
>>>>>>     cdb = 0x1a 0x0 0x8 0x0 0x18 0x0
>>>>>>     pkt-reason = 0x0
>>>>>>     pkt-state = 0x1f
>>>>>>     pkt-stats = 0x0
>>>>>>     stat-code = 0x0
>>>>>>     un-decode-info = sd_get_write_cache_enabled: Mode Sense caching page 
>>>>>> code mismatch 0
>>>>>>     un-decode-value =
>>>>>>     __ttl = 0x1
>>>>>>     __tod = 0x4bb7cf8f 0x1bb3cd13
>>>>>> ...
>>>>>> _______________________________________________
>>>>>> driver-discuss mailing list
>>>>>> [email protected]
>>>>>> http://mail.opensolaris.org/mailman/listinfo/driver-discuss
>>>>>>           
>>>>>         
>>>> _______________________________________________
>>>> driver-discuss mailing list
>>>> [email protected]
>>>> http://mail.opensolaris.org/mailman/listinfo/driver-discuss
>>>>       
>>> _______________________________________________
>>> driver-discuss mailing list
>>> [email protected]
>>> http://mail.opensolaris.org/mailman/listinfo/driver-discuss
>>>     
>> _______________________________________________
>> driver-discuss mailing list
>> [email protected]
>> http://mail.opensolaris.org/mailman/listinfo/driver-discuss
>>   
> 

_______________________________________________
driver-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/driver-discuss

Re: [driver-discuss] Help with a pci-e problem

Reply via email to