Re: [OpenIndiana-discuss] How do I verify that fmd is actually able to detect and log ECC errors?

Chris Tue, 16 Mar 2021 13:24:27 -0700

On 2021-03-16 12:15, Judah Richardson wrote:

AFAIK a scrub or ECC error shouldn't crash the kernel. Also, if the crash
is occurring on the error the error might not be logged. To me it sounds
like you might have a system board issue.

I felt compelled to chime on his last post, that it looked suspiciously
like it might be the PSU. Over time the diodes on the bridge(s) start
leaking AC. Which will start causing somewhat erratic behavior on OS/
applications. Resistance also increases on the circuits in the PSU.
Resulting in lower than required voltage.
If he is able to. Simply swapping out the PSU with one from one of his
other "working" Z400's would be a simple way to confirm.


Just thought I'd mention it. :-)

--Chris


Also FWIW you shouldn't have to scrub otherwise healthy pools more than
once per month.

On Tue, Mar 16, 2021, 14:08 Reginald Beardsley via openindiana-discuss <
openindiana-discuss@openindiana.org> wrote:

I suspect memory errors on my Sol 10 u8 system, but there are no memory

errors reported by "fmdump -eV". All the errors and events are zfsrelated.


Initial symptom is starting a scrub on a freshly booted system will
complete properly, but the same operation after the system has been up for
a few days will cause a kernel panic. Immediately after a reboot a scrub
will complete normally.  This behavior suggests bit fade to me.

This has been very consistent  for the last few months.  The system is an
HP Z400 which is 10 years old and generally has run 24x7.  It was certified
by Sun for Solaris 10 which is why I bought it and uses unbuffered,
unregistered ECC DDR3 DIMMs.  Since my initial purchase I have bought three
more Z400s.

Recently the system became unstable to the point I have not been able to
complete a "zfs send -R" to a 12 TB WD USB drive.  My last attempt using a
Hipster LiveImage died after ~25 hours.

My Hipster 2017.10 system shows some events which appear to be ECC
related, but I'm not able to interpret them.  I've attached a file with the
last such event.  Not sure that will work, but worth trying.  This is from
my regular internet access host.  So  it is up 24x7 with few exceptions.

Except for the CPU and memory, the machines are almost identical.  The
Hipster machine is an older 4 DIMM slot machine with the same 3 way mirror
on s0 and 3 disk RAIDZ1 on s1.  The Sol 10 system is a 6 DIMM slot model
and has a 3 TB mirrored scratch pool in addition to the s0 & s1 root and
export pools.

It seems unlikely that I could simply swap the disks between the two, but
I can install Hipster on a single drive for rpool and attempt to copy the
scratch pool, spool, with that and simply run it for a while for testing.

I've read everything I can find about the Fault Manager, but it has
produced more questions than answers.

This is for Hipster 2017.10:

sun_x86%rhb {82} fmadm config
MODULE                   VERSION STATUS  DESCRIPTION
cpumem-retire            1.1     active  CPU/Memory Retire Agent
disk-lights              1.0     active  Disk Lights Agent
disk-transport           1.0     active  Disk Transport Agent
eft                      1.16    active  eft diagnosis engine
ext-event-transport      0.2     active  External FM event transport
fabric-xlate             1.0     active  Fabric Ereport Translater
fmd-self-diagnosis       1.0     active  Fault Manager Self-Diagnosis
io-retire                2.0     active  I/O Retire Agent
sensor-transport         1.1     active  Sensor Transport Agent
ses-log-transport        1.0     active  SES Log Transport Agent
software-diagnosis       0.1     active  Software Diagnosis engine
software-response        0.1     active  Software Response Agent
sysevent-transport       1.0     active  SysEvent Transport Agent
syslog-msgs              1.1     active  Syslog Messaging Agent
zfs-diagnosis            1.0     active  ZFS Diagnosis Engine
zfs-retire               1.0     active  ZFS Retire Agent


It's a little longer than for Sol 10 u8, but the cpumem-retire V 1.1
appears on both.

Suggestions?

Thanks,
Reg

--
~10yrs a FreeBSD maintainer of ~160 ports
~40yrs of UNIX

_______________________________________________
openindiana-discuss mailing list
openindiana-discuss@openindiana.org
https://openindiana.org/mailman/listinfo/openindiana-discuss

Re: [OpenIndiana-discuss] How do I verify that fmd is actually able to detect and log ECC errors?

Reply via email to