On Jan 23, 2010, at 5:06 AM, Simon Breden wrote: > Thanks a lot. > > I'd looked at SO many different RAID boxes and never had a good feeling about > them from the point of data safety, that when I read the 'A Conversation with > Jeff Bonwick and Bill Moore – The future of file systems' article > (http://queue.acm.org/detail.cfm?id=1317400), I was convinced that ZFS > sounded like what I needed, and thought I'd try to help others see how good > ZFS was and how to make their own home systems that work. Publishing the > notes as articles had the side-benefit of allowing me to refer back to them > when I was reinstalling a new SXCE build etc afresh... :) > > It's good to see that you've been able to set the error reporting time using > HDAT2 for your Samsung HD154UI drives, but it is a pity that the change does > not persist through cold starts. > > From a brief look, it looks like like the utility runs under DOS, so I wonder > if it would be possible to convert the code into C and run it immediately > after OpenSolaris has booted? That would seem a reasonable automated > workaround. I might take a little look at the code. > > However, the big questions still remain: > 1. Does ZFS benefit from shorter error reporting times?
In general, any system which detects and acts upon faults, would like to detect faults sooner rather than later. > 2. Does having shorter error reporting times provide any significant data > safety through, for example, preventing ZFS from kicking a drive from a vdev? On Solaris, ZFS doesn't kick out drives, FMA does. You can see the currently loaded diagnosis engines using "pfexec fmadm config" MODULE VERSION STATUS DESCRIPTION cpumem-retire 1.1 active CPU/Memory Retire Agent disk-transport 1.0 active Disk Transport Agent eft 1.16 active eft diagnosis engine ext-event-transport 0.1 active External FM event transport fabric-xlate 1.0 active Fabric Ereport Translater fmd-self-diagnosis 1.0 active Fault Manager Self-Diagnosis io-retire 2.0 active I/O Retire Agent sensor-transport 1.1 active Sensor Transport Agent snmp-trapgen 1.0 active SNMP Trap Generation Agent sysevent-transport 1.0 active SysEvent Transport Agent syslog-msgs 1.0 active Syslog Messaging Agent zfs-diagnosis 1.0 active ZFS Diagnosis Engine zfs-retire 1.0 active ZFS Retire Agent Diagnosis engines relevant to ZFS include: disk-transport: diagnose SMART reports fabric-xlate: translate PCI, PCI-X, PCI-E, and bridge reports zfs-diagnosis: notifies FMA when checksum, IO, and probe failure errors are found by ZFS activity. It also properly handles errors as a result of device removal. zfs-retire: manages hot spares for ZFS pools io-retire: retires a device which was diagnosed as faulty (NB may happen at next reboot) snmp-trapgen: you do configure SNMP traps, right? :-) Drivers, such as sd/ssd, can send FMA telemetry which will feed the diagnosis engines. > Those are the questions I would like to hear somebody give an authoritative > answer to. This topic is broader than ZFS. For example, a disk which has both a UFS and ZFS file system could be diagnosed by UFS activity and retired, which would also affect the ZFS pool that uses the disk. Similarly, the disk-transport agent can detect overtemp errors, for which a retirement is a corrective action. For more info, visit the FMA community: http://hub.opensolaris.org/bin/view/Community+Group+fm/ As for an "authoritative answer," UTSL. http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/fm/modules/common -- richard _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss