On Jan 23, 2010, at 5:06 AM, Simon Breden wrote:

> Thanks a lot.
> 
> I'd looked at SO many different RAID boxes and never had a good feeling about 
> them from the point of data safety, that when I read the 'A Conversation with 
> Jeff Bonwick and Bill Moore – The future of file systems' article 
> (http://queue.acm.org/detail.cfm?id=1317400), I was convinced that ZFS 
> sounded like what I needed, and thought I'd try to help others see how good 
> ZFS was and how to make their own home systems that work. Publishing the 
> notes as articles had the side-benefit of allowing me to refer back to them 
> when I was reinstalling a new SXCE build etc afresh... :)
> 
> It's good to see that you've been able to set the error reporting time using 
> HDAT2 for your Samsung HD154UI drives, but it is a pity that the change does 
> not persist through cold starts.
> 
> From a brief look, it looks like like the utility runs under DOS, so I wonder 
> if it would be possible to convert the code into C and run it immediately 
> after OpenSolaris has booted? That would seem a reasonable automated 
> workaround. I might take a little look at the code.
> 
> However, the big questions still remain:
> 1. Does ZFS benefit from shorter error reporting times?

In general, any system which detects and acts upon faults, would like
to detect faults sooner rather than later.

> 2. Does having shorter error reporting times provide any significant data 
> safety through, for example, preventing ZFS from kicking a drive from a vdev?

On Solaris, ZFS doesn't kick out drives, FMA does.

You can see the currently loaded diagnosis engines using "pfexec fmadm config"
MODULE                   VERSION STATUS  DESCRIPTION
cpumem-retire            1.1     active  CPU/Memory Retire Agent
disk-transport           1.0     active  Disk Transport Agent
eft                      1.16    active  eft diagnosis engine
ext-event-transport      0.1     active  External FM event transport
fabric-xlate             1.0     active  Fabric Ereport Translater
fmd-self-diagnosis       1.0     active  Fault Manager Self-Diagnosis
io-retire                2.0     active  I/O Retire Agent
sensor-transport         1.1     active  Sensor Transport Agent
snmp-trapgen             1.0     active  SNMP Trap Generation Agent
sysevent-transport       1.0     active  SysEvent Transport Agent
syslog-msgs              1.0     active  Syslog Messaging Agent
zfs-diagnosis            1.0     active  ZFS Diagnosis Engine
zfs-retire               1.0     active  ZFS Retire Agent

Diagnosis engines relevant to ZFS include:
        disk-transport: diagnose SMART reports
        fabric-xlate: translate PCI, PCI-X, PCI-E, and bridge reports
        zfs-diagnosis: notifies FMA when checksum, IO, and probe failure errors 
                are found by ZFS activity. It also properly handles errors as a 
result
                of device removal.
        zfs-retire: manages hot spares for ZFS pools
        io-retire: retires a device which was diagnosed as faulty (NB may 
happen 
                at next reboot)
        snmp-trapgen: you do configure SNMP traps, right? :-)

Drivers, such as sd/ssd, can send FMA telemetry which will feed the diagnosis
engines.

> Those are the questions I would like to hear somebody give an authoritative 
> answer to.

This topic is broader than ZFS.  For example, a disk which has both a UFS and 
ZFS
file system could be diagnosed by UFS activity and retired, which would also 
affect
the ZFS pool that uses the disk. Similarly, the disk-transport agent can detect 
overtemp 
errors, for which a retirement is a corrective action. For more info, visit the 
FMA 
community:
http://hub.opensolaris.org/bin/view/Community+Group+fm/

As for an "authoritative answer," UTSL.
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/fm/modules/common
 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to