I suspect memory errors on my Sol 10 u8 system, but there are no memory errors 
reported by "fmdump -eV".  All the errors and events are zfs related.

Initial symptom is starting a scrub on a freshly booted system will complete 
properly, but the same operation after the system has been up for a few days 
will cause a kernel panic. Immediately after a reboot a scrub will complete 
normally.  This behavior suggests bit fade to me.

This has been very consistent  for the last few months.  The system is an HP 
Z400 which is 10 years old and generally has run 24x7.  It was certified by Sun 
for Solaris 10 which is why I bought it and uses unbuffered, unregistered ECC 
DDR3 DIMMs.  Since my initial purchase I have bought three more Z400s.

Recently the system became unstable to the point I have not been able to 
complete a "zfs send -R" to a 12 TB WD USB drive.  My last attempt using a 
Hipster LiveImage died after ~25 hours.

My Hipster 2017.10 system shows some events which appear to be ECC related, but 
I'm not able to interpret them.  I've attached a file with the last such event. 
 Not sure that will work, but worth trying.  This is from my regular internet 
access host.  So  it is up 24x7 with few exceptions.

Except for the CPU and memory, the machines are almost identical.  The Hipster 
machine is an older 4 DIMM slot machine with the same 3 way mirror on s0 and 3 
disk RAIDZ1 on s1.  The Sol 10 system is a 6 DIMM slot model and has a 3 TB 
mirrored scratch pool in addition to the s0 & s1 root and export pools.

It seems unlikely that I could simply swap the disks between the two, but I can 
install Hipster on a single drive for rpool and attempt to copy the scratch 
pool, spool, with that and simply run it for a while for testing.

I've read everything I can find about the Fault Manager, but it has produced 
more questions than answers.

This is for Hipster 2017.10:

sun_x86%rhb {82} fmadm config
MODULE                   VERSION STATUS  DESCRIPTION
cpumem-retire            1.1     active  CPU/Memory Retire Agent
disk-lights              1.0     active  Disk Lights Agent
disk-transport           1.0     active  Disk Transport Agent
eft                      1.16    active  eft diagnosis engine
ext-event-transport      0.2     active  External FM event transport
fabric-xlate             1.0     active  Fabric Ereport Translater
fmd-self-diagnosis       1.0     active  Fault Manager Self-Diagnosis
io-retire                2.0     active  I/O Retire Agent
sensor-transport         1.1     active  Sensor Transport Agent
ses-log-transport        1.0     active  SES Log Transport Agent
software-diagnosis       0.1     active  Software Diagnosis engine
software-response        0.1     active  Software Response Agent
sysevent-transport       1.0     active  SysEvent Transport Agent
syslog-msgs              1.1     active  Syslog Messaging Agent
zfs-diagnosis            1.0     active  ZFS Diagnosis Engine
zfs-retire               1.0     active  ZFS Retire Agent


It's a little longer than for Sol 10 u8, but the cpumem-retire V 1.1 appears on 
both.

Suggestions?

Thanks,
Reg

Feb 26 2018 07:45:04.212790281 ereport.cpu.intel.quickpath.mem_ce
nvlist version: 0
        class = ereport.cpu.intel.quickpath.mem_ce
        ena = 0x98a2158a97802001
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = hc
                hc-list = (array of embedded nvlists)
                (start hc-list[0])
                nvlist version: 0
                        hc-name = motherboard
                        hc-id = 0
                (end hc-list[0])
                (start hc-list[1])
                nvlist version: 0
                        hc-name = chip
                        hc-id = 0
                (end hc-list[1])
                (start hc-list[2])
                nvlist version: 0
                        hc-name = memory-controller
                        hc-id = 0
                (end hc-list[2])

        (end detector)

        compound_errorname = MC_CH_RD_ERR
        IA32_MCG_STATUS = 0x0
        machine_check_in_progress = 0
        bank_number = 0x8
        bank_msr_offset = 0x420
        IA32_MCi_STATUS = 0x8c0000400001009f
        overflow = 0
        error_uncorrected = 0
        error_enabled = 0
        processor_context_corrupt = 0
        error_code = 0x9f
        model_specific_error_code = 0x1
        threshold_based_error_status = No tracking
        IA32_MCi_ADDR = 0xc28d2b40
        IA32_MCi_MISC = 0xe6323d8000010885
        ECC-syndrome = 0xe6323d80
        physaddr = 0xc28d2b40
        resource = (array of embedded nvlists)
        (start resource[0])
        nvlist version: 0
                version = 0x0
                scheme = hc
                hc-list = (array of embedded nvlists)
                (start hc-list[0])
                nvlist version: 0
                        hc-name = motherboard
                        hc-id = 0
                (end hc-list[0])
                (start hc-list[1])
                nvlist version: 0
                        hc-name = chip
                        hc-id = 0
                (end hc-list[1])
                (start hc-list[2])
                nvlist version: 0
                        hc-name = memory-controller
                        hc-id = 0
                (end hc-list[2])
                (start hc-list[3])
                nvlist version: 0
                        hc-name = dram-channel
                        hc-id = 0
                (end hc-list[3])
                (start hc-list[4])
                nvlist version: 0
                        hc-name = dimm
                        hc-id = 1
                (end hc-list[4])

                hc-specific = (embedded nvlist)
                nvlist version: 0
                        offset = 0xffffffffffffffff
                (end hc-specific)

        (end resource[0])

        mem_cor_ecc_counter = 0xffffffff 0xffffffff 0xffffffff 0xffffffff 
0xffffffff 0xffffffff
        mem_cor_ecc_counter_last = 0xffffffff 0xffffffff 0xffffffff 0xffffffff 
0xffffffff 0xffffffff
        __ttl = 0x1
        __tod = 0x5a940f60 0xcaeec09

_______________________________________________
openindiana-discuss mailing list
openindiana-discuss@openindiana.org
https://openindiana.org/mailman/listinfo/openindiana-discuss

Reply via email to