That's making sense now.

One server had an LSI HBA die right after we got it.   That corresponds to
the time of it's errors

The other server, I've been seeing errors on the SAS bus being reported,
but it could very well be the HBA.

There are 3 LSI 9206-16e in each host.  Possibly one flaking out or simply
needs reset.

I'll work on diagnosing them.

Thanks!
-Chip

On Fri, Mar 6, 2015 at 2:02 PM, Garrett D'Amore <[email protected]> wrote:

> Looks like PCIe fabric errors to me.  I don’t have a PCIe manual in front
> of me to decode the actual errors, but my suspicion is that you have a
> flaky bus — I’d power everything off, check and reseat all the PCIe cards,
> although it looks like this is probably implicating a PLX PCIe switch, with
> an LSI SAS controller on the far end.  (Is it an add-in card?  If so,
> reseat it.)  If the switch itself is connected via cabling or mezzanine
> card or some such, check those too, and reseat.
>
> Failing that, I’d contact the system vendor.
>
> If someone here has access to the PCIe specifications, they can try to
> decode the various ue and ce error register values for you.
>
> - Garrett
>
>
> On Mar 6, 2015, at 11:45 AM, Schweiss, Chip via illumos-discuss <
> [email protected]> wrote:
>
>
>
> On Fri, Mar 6, 2015 at 10:48 AM, Robert Mustacchi <[email protected]> wrote:
>
>> On 3/6/15 8:43 , Schweiss, Chip via illumos-discuss wrote:
>> > I have two fairly new Haswell based servers running OmniOS.  I have
>> several
>> > faults from both systems that I don't know what they are or what to do
>> > about them.
>> >
>> > I am not seeing any related issues these faults.
>> >
>> > Can anyone clarify what they are and what to do about them?
>>
>> We've received error reports that the system doesn't understand how to
>> diagnose. Here, getting the actual ereports that were generated on the
>> system and looking at them will shed more light on the problem and will
>> allow us to better understand what's happening on the systems.
>>
>>
> I'm not familiar with ereports.  After some googling, I'm assuming you
> mean the output from 'fmdump -eV'
>
> Here's reports that correspond to the first event.  If this is what you
> were asking for I'll dig out the rest of them.
>
> Feb 27 2015 18:11:17.068478684 ereport.io.pci.fabric
> nvlist version: 0
>         class = ereport.io.pci.fabric
>         ena = 0xe97c1b9f5a501401
>         detector = (embedded nvlist)
>         nvlist version: 0
>                 version = 0x0
>                 scheme = dev
>                 device-path = /pci@0,0/pci8086,2f06@2,2
>         (end detector)
>
>         bdf = 0x12
>         device_id = 0x2f06
>         vendor_id = 0x8086
>         rev_id = 0x2
>         dev_type = 0x40
>         pcie_off = 0x90
>         pcix_off = 0x0
>         aer_off = 0x148
>         ecc_ver = 0x0
>         pci_status = 0x10
>         pci_command = 0x47
>         pci_bdg_sec_status = 0x2000
>         pci_bdg_ctrl = 0x3
>         pcie_status = 0x0
>         pcie_command = 0x27
>         pcie_dev_cap = 0x8001
>         pcie_adv_ctl = 0x0
>         pcie_ue_status = 0x0
>         pcie_ue_mask = 0x100000
>         pcie_ue_sev = 0x62030
>         pcie_ue_hdr0 = 0x0
>         pcie_ue_hdr1 = 0x0
>         pcie_ue_hdr2 = 0x0
>         pcie_ue_hdr3 = 0x0
>         pcie_ce_status = 0x0
>         pcie_ce_mask = 0x0
>         pcie_rp_status = 0x0
>         pcie_rp_control = 0x0
>         pcie_adv_rp_status = 0x1
>         pcie_adv_rp_command = 0x7
>         pcie_adv_rp_ce_src_id = 0x600
>         pcie_adv_rp_ue_src_id = 0x0
>         remainder = 0x3
>         severity = 0x1
>         __ttl = 0x1
>         __tod = 0x54f107a5 0x414e6dc
>
> Feb 27 2015 18:11:17.068509897 ereport.io.pci.fabric
> nvlist version: 0
>         class = ereport.io.pci.fabric
>         ena = 0xe97c1ba6ebb01401
>         detector = (embedded nvlist)
>         nvlist version: 0
>                 version = 0x0
>                 scheme = dev
>                 device-path = /pci@0,0/pci8086,2f06@2,2/pci10b5,8724@0
>         (end detector)
>
>         bdf = 0x400
>         device_id = 0x8724
>         vendor_id = 0x10b5
>         rev_id = 0xca
>         dev_type = 0x50
>         pcie_off = 0x68
>         pcix_off = 0x0
>         aer_off = 0xfb4
>         ecc_ver = 0x0
>         pci_status = 0x10
>         pci_command = 0x147
>         pci_bdg_sec_status = 0x0
>         pci_bdg_ctrl = 0x3
>         pcie_status = 0x9
>         pcie_command = 0x37
>         pcie_dev_cap = 0x8004
>         pcie_adv_ctl = 0xbf
>         pcie_ue_status = 0x100000
>         pcie_ue_mask = 0x180000
>         pcie_ue_sev = 0x62030
>         pcie_ue_hdr0 = 0x0
>         pcie_ue_hdr1 = 0x0
>         pcie_ue_hdr2 = 0x0
>         pcie_ue_hdr3 = 0x0
>         pcie_ce_status = 0x2000
>         pcie_ce_mask = 0x0
>         remainder = 0x2
>         severity = 0x3
>         __ttl = 0x1
>         __tod = 0x54f107a5 0x41560c9
>
> Feb 27 2015 18:11:17.068526093 ereport.io.pci.fabric
> nvlist version: 0
>         class = ereport.io.pci.fabric
>         ena = 0xe97c1baaee901401
>         detector = (embedded nvlist)
>         nvlist version: 0
>                 version = 0x0
>                 scheme = dev
>                 device-path = /pci@0,0/pci8086,2f06@2,2/pci10b5,8724@0
> /pci10b5,8724@1
>         (end detector)
>
>         bdf = 0x508
>         device_id = 0x8724
>         vendor_id = 0x10b5
>         rev_id = 0xca
>         dev_type = 0x60
>         pcie_off = 0x68
>         pcix_off = 0x0
>         aer_off = 0xfb4
>         ecc_ver = 0x0
>         pci_status = 0x10
>         pci_command = 0x147
>         pci_bdg_sec_status = 0x0
>         pci_bdg_ctrl = 0x3
>         pcie_status = 0x0
>         pcie_command = 0x37
>         pcie_dev_cap = 0x8004
>         pcie_adv_ctl = 0xbf
>         pcie_ue_status = 0x0
>         pcie_ue_mask = 0x180000
>         pcie_ue_sev = 0x462030
>         pcie_ue_hdr0 = 0x0
>         pcie_ue_hdr1 = 0x0
>         pcie_ue_hdr2 = 0x0
>         pcie_ue_hdr3 = 0x0
>         pcie_ce_status = 0x0
>         pcie_ce_mask = 0x0
>         remainder = 0x1
>         severity = 0x1
>         __ttl = 0x1
>         __tod = 0x54f107a5 0x415a00d
>
> Feb 27 2015 18:11:17.068541905 ereport.io.pci.fabric
> nvlist version: 0
>         class = ereport.io.pci.fabric
>         ena = 0xe97c1baedbc01401
>         detector = (embedded nvlist)
>         nvlist version: 0
>                 version = 0x0
>                 scheme = dev
>                 device-path = /pci@0,0/pci8086,2f06@2,2/pci10b5,8724@0
> /pci10b5,8724@1/pci1000,3070@0
>         (end detector)
>
>         bdf = 0x600
>         device_id = 0x87
>         vendor_id = 0x1000
>         rev_id = 0x5
>         dev_type = 0x0
>         pcie_off = 0x68
>         pcix_off = 0x0
>         aer_off = 0x100
>         ecc_ver = 0x0
>         pci_status = 0x10
>         pci_command = 0x146
>         pcie_status = 0x1
>         pcie_command = 0x2037
>         pcie_dev_cap = 0x10008025
>         pcie_adv_ctl = 0x0
>         pcie_ue_status = 0x0
>         pcie_ue_mask = 0x180000
>         pcie_ue_sev = 0x462031
>         pcie_ue_hdr0 = 0x4000001
>         pcie_ue_hdr1 = 0x122003
>         pcie_ue_hdr2 = 0x6010000
>         pcie_ue_hdr3 = 0xb70d8120
>         pcie_ce_status = 0x1
>         pcie_ce_mask = 0x0
>         remainder = 0x0
>         severity = 0x3
>         __ttl = 0x1
>         __tod = 0x54f107a5 0x415ddd1
>
> Feb 27 2015 18:11:17.068478684 ereport.io.pciex.rc.ce-msg
> nvlist version: 0
>         ena = 0xe97c1b9f5a501401
>         detector = (embedded nvlist)
>         nvlist version: 0
>                 version = 0x0
>                 scheme = dev
>                 device-path = /pci@0,0/pci8086,2f06@2,2
>         (end detector)
>
>         class = ereport.io.pciex.rc.ce-msg
>         rc-status = 0x1
>         source-id = 0x600
>         source-valid = 1
>         __ttl = 0x1
>         __tod = 0x54f107a5 0x414e6dc
>
> Feb 27 2015 18:11:17.068509897 ereport.io.pciex.a-nonfatal
> nvlist version: 0
>         ena = 0xe97c1ba6ebb01401
>         detector = (embedded nvlist)
>         nvlist version: 0
>                 version = 0x0
>                 scheme = dev
>                 device-path = /pci@0,0/pci8086,2f06@2,2/pci10b5,8724@0
>         (end detector)
>
>         class = ereport.io.pciex.a-nonfatal
>         dev-status = 0x9
>         ce-status = 0x2000
>         __ttl = 0x1
>         __tod = 0x54f107a5 0x41560c9
>
> Feb 27 2015 18:11:17.068509897 ereport.io.pciex.rc.ce-msg
> nvlist version: 0
>         ena = 0xe97c1ba6ebb01401
>         detector = (embedded nvlist)
>         nvlist version: 0
>                 version = 0x0
>                 scheme = dev
>                 device-path = /pci@0,0
>         (end detector)
>
>         class = ereport.io.pciex.rc.ce-msg
>         rc-status = 0x1
>         source-id = 0x400
>         source-valid = 1
>         __ttl = 0x1
>         __tod = 0x54f107a5 0x41560c9
>
> Feb 27 2015 18:11:17.068541905 ereport.io.pciex.pl.re
> nvlist version: 0
>         ena = 0xe97c1baedbc01401
>         detector = (embedded nvlist)
>         nvlist version: 0
>                 version = 0x0
>                 scheme = dev
>                 device-path = /pci@0,0/pci8086,2f06@2,2/pci10b5,8724@0
> /pci10b5,8724@1/pci1000,3070@0
>         (end detector)
>
>         class = ereport.io.pciex.pl.re
>         dev-status = 0x1
>         ce-status = 0x1
>         __ttl = 0x1
>         __tod = 0x54f107a5 0x415ddd1
>
> Feb 27 2015 18:11:17.068541905 ereport.io.pciex.rc.ce-msg
> nvlist version: 0
>         ena = 0xe97c1baedbc01401
>         detector = (embedded nvlist)
>         nvlist version: 0
>                 version = 0x0
>                 scheme = dev
>                 device-path = /pci@0,0
>         (end detector)
>
>         class = ereport.io.pciex.rc.ce-msg
>         rc-status = 0x1
>         source-id = 0x600
>         source-valid = 1
>         __ttl = 0x1
>         __tod = 0x54f107a5 0x415ddd1
>
>
>
>
>
>
>> Robert
>>
>> >>From host #1:
>> >
>> > --------------- ------------------------------------  --------------
>> > ---------
>> > TIME            EVENT-ID                              MSG-ID
>> > SEVERITY
>> > --------------- ------------------------------------  --------------
>> > ---------
>> > Feb 27 18:11:19 3951b062-71f1-cccc-9fea-bbdc354f2603  SUNOS-8000-J0
>> Major
>> >
>> > Host        : mir-zfs01
>> > Platform    : SYS-6028U-TR4+    Chassis_id  : S16512424A07095
>> > Product_sn  :
>> >
>> > Fault class : defect.sunos.eft.unexpected_telemetry 50%
>> >               fault.sunos.eft.unexpected_telemetry 50%
>> > Problem in  : dev:////pci@0,0
>> >                   faulted and taken out of service
>> >
>> > Description : The diagnosis engine encountered telemetry from the listed
>> >               devices for which it was unable to perform a diagnosis -
>> >               Refer to http://illumos.org/msg/SUNOS-8000-J0 for more
>> >               information.  Refer to
>> http://illumos.org/msg/SUNOS-8000-J0
>> > for
>> >               more information.
>> >
>> > Response    : Error reports have been logged for examination by Sun.
>> >
>> > Impact      : Automated diagnosis and response for these events will not
>> > occur.
>> >
>> > Action      : Ensure that the latest Solaris Kernel and Predictive
>> > Self-Healing
>> >               (PSH) patches are installed.
>> >
>> > --------------- ------------------------------------  --------------
>> > ---------
>> > TIME            EVENT-ID                              MSG-ID
>> > SEVERITY
>> > --------------- ------------------------------------  --------------
>> > ---------
>> > Jan 15 21:53:07 2cb9f0e0-dd7f-c912-dd22-bbaa7a4ebf6c  SUNOS-8000-J0
>> Major
>> >
>> > Host        : mir-zfs01
>> > Platform    : SYS-6028U-TR4+    Chassis_id  : S16512424A07095
>> > Product_sn  :
>> >
>> > Fault class : defect.sunos.eft.unexpected_telemetry max 25%
>> >               fault.sunos.eft.unexpected_telemetry max 25%
>> > Affects     : cpu:///cpuid=6
>> >               cpu:///cpuid=16
>> >                   faulted but still in service
>> > FRU         :
>> >
>> hc://:product-id=SYS-6028U-TR4+:server-id=mir-zfs01:chassis-id=S16512424A07095/motherboard=0/chip=0
>> > 25%
>> >
>> >
>> hc://:product-id=SYS-6028U-TR4+:server-id=mir-zfs01:chassis-id=S16512424A07095/motherboard=0/chip=1
>> > 25%
>> >                   faulty
>> >
>> > Description : The diagnosis engine encountered telemetry from the listed
>> >               devices for which it was unable to perform a diagnosis -
>> >               Refer to http://illumos.org/msg/SUNOS-8000-J0 for more
>> >               information.  Refer to
>> http://illumos.org/msg/SUNOS-8000-J0
>> > for
>> >               more information.
>> >
>> > Response    : Error reports have been logged for examination by Sun.
>> >
>> > Impact      : Automated diagnosis and response for these events will not
>> > occur.
>> >
>> > Action      : Ensure that the latest Solaris Kernel and Predictive
>> > Self-Healing
>> >               (PSH) patches are installed.
>> >
>> >
>> >>From host #2:
>> >
>> > --------------- ------------------------------------  --------------
>> > ---------
>> > TIME            EVENT-ID                              MSG-ID
>> > SEVERITY
>> > --------------- ------------------------------------  --------------
>> > ---------
>> > Jan 31 12:45:54 0efc914b-7cc5-c4df-fd11-9be172d4931a  SUNOS-8000-J0
>> Major
>> >
>> > Host        : mir-zfs02
>> > Platform    : SYS-6028U-TR4+    Chassis_id  : S16512424A07109
>> > Product_sn  :
>> >
>> > Fault class : defect.sunos.eft.unexpected_telemetry 50%
>> >               fault.sunos.eft.unexpected_telemetry 50%
>> > Problem in  : dev:////pci@74,0
>> >                   faulted and taken out of service
>> >
>> > Description : The diagnosis engine encountered telemetry from the listed
>> >               devices for which it was unable to perform a diagnosis -
>> >               Refer to http://illumos.org/msg/SUNOS-8000-J0 for more
>> >               information.  Refer to
>> http://illumos.org/msg/SUNOS-8000-J0
>> > for
>> >               more information.
>> >
>> > Response    : Error reports have been logged for examination by Sun.
>> >
>> > Impact      : Automated diagnosis and response for these events will not
>> > occur.
>> >
>> > Action      : Ensure that the latest Solaris Kernel and Predictive
>> > Self-Healing
>> >               (PSH) patches are installed.
>> > --------------- ------------------------------------  --------------
>> > ---------
>> > TIME            EVENT-ID                              MSG-ID
>> > SEVERITY
>> > --------------- ------------------------------------  --------------
>> > ---------
>> > Dec 04 15:22:09 6020baed-5ab6-cdb0-95c0-ed3f9fde1172  SUNOS-8000-J0
>> Major
>> >
>> > Host        : mir-zfs02
>> > Platform    : SYS-6028U-TR4+    Chassis_id  : S16512424A07109
>> > Product_sn  :
>> >
>> > Fault class : fault.sunos.eft.unexpected_telemetry max 25%
>> >               defect.sunos.eft.unexpected_telemetry max 25%
>> > Affects     : cpu:///cpuid=41
>> >                   ok and in service
>> >               cpu:///cpuid=26
>> >                   faulted but still in service
>> > FRU         :
>> >
>> hc://:product-id=SYS-6028U-TR4+:server-id=mir-zfs02:chassis-id=S16512424A07109/motherboard=0/chip=1
>> > 25%
>> >                   acquitted
>> >
>> >
>> hc://:product-id=SYS-6028U-TR4+:server-id=mir-zfs02:chassis-id=S16512424A07109/motherboard=0/chip=0
>> > 25%
>> >                   faulty
>> >
>> > Description : The diagnosis engine encountered telemetry from the listed
>> >               devices for which it was unable to perform a diagnosis -
>> >               Refer to http://illumos.org/msg/SUNOS-8000-J0 for more
>> >               information.  Refer to
>> http://illumos.org/msg/SUNOS-8000-J0
>> > for
>> >               more information.
>> >
>> > Response    : Error reports have been logged for examination by Sun.
>> >
>> > Impact      : Automated diagnosis and response for these events will not
>> > occur.
>> >
>> > Action      : Ensure that the latest Solaris Kernel and Predictive
>> > Self-Healing
>> >               (PSH) patches are installed.
>> >
>> > --------------- ------------------------------------  --------------
>> > ---------
>> > TIME            EVENT-ID                              MSG-ID
>> > SEVERITY
>> > --------------- ------------------------------------  --------------
>> > ---------
>> > Dec 04 18:55:38 eadd4984-7c7a-490b-f6e1-b0f936b09ab7  SUNOS-8000-J0
>> Major
>> >
>> > Host        : mir-zfs02
>> > Platform    : SYS-6028U-TR4+    Chassis_id  : S16512424A07109
>> > Product_sn  :
>> >
>> > Fault class : fault.sunos.eft.unexpected_telemetry max 25%
>> >               defect.sunos.eft.unexpected_telemetry max 25%
>> > Affects     : cpu:///cpuid=6
>> >               cpu:///cpuid=18
>> >                   faulted but still in service
>> > FRU         :
>> >
>> hc://:product-id=SYS-6028U-TR4+:server-id=mir-zfs02:chassis-id=S16512424A07109/motherboard=0/chip=0
>> > 25%
>> >
>> >
>> hc://:product-id=SYS-6028U-TR4+:server-id=mir-zfs02:chassis-id=S16512424A07109/motherboard=0/chip=1
>> > 25%
>> >                   faulty
>> >
>> > Description : The diagnosis engine encountered telemetry from the listed
>> >               devices for which it was unable to perform a diagnosis -
>> >               Refer to http://illumos.org/msg/SUNOS-8000-J0 for more
>> >               information.  Refer to
>> http://illumos.org/msg/SUNOS-8000-J0
>> > for
>> >               more information.
>> >
>> > Response    : Error reports have been logged for examination by Sun.
>> >
>> > Impact      : Automated diagnosis and response for these events will not
>> > occur.
>> >
>> > Action      : Ensure that the latest Solaris Kernel and Predictive
>> > Self-Healing
>> >               (PSH) patches are installed.
>> >
>> >
>> >
>> > -------------------------------------------
>> > illumos-discuss
>> > Archives: https://www.listbox.com/member/archive/182180/=now
>> > RSS Feed:
>> https://www.listbox.com/member/archive/rss/182180/21175748-6cf9d6b5
>> > Modify Your Subscription: https://www.listbox.com/member/?&;
>> > Powered by Listbox: http://www.listbox.com
>> >
>>
>>
> *illumos-discuss* | Archives
> <https://www.listbox.com/member/archive/182180/=now>
> <https://www.listbox.com/member/archive/rss/182180/22003744-9012f59c> |
> Modify
> <https://www.listbox.com/member/?&;>
>  Your Subscription <http://www.listbox.com/>
>
>
>



-------------------------------------------
illumos-discuss
Archives: https://www.listbox.com/member/archive/182180/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4
Powered by Listbox: http://www.listbox.com

Reply via email to