On Fri, Sep 12, 2008 at 03:34:30PM +0100, Karl Pielorz wrote: > --On 12 September 2008 06:21 -0700 Jeremy Chadwick <[EMAIL PROTECTED]> > wrote: > >> As far as I know, there is no such "standard" mechanism in FreeBSD. If >> the drive falls off the bus entirely (e.g. detached), I would hope ZFS >> would notice that. I can imagine it (might) also depend on if the disk >> subsystem you're using is utilising CAM or not (e.g. disks should be daX >> not adX); Scott Long might know if something like this is implemented in >> CAM. I'm fairly certain nothing like this is implemented in ata(4). > > For ATA, at the moment - I don't think it'll notice even if a drive > detaches. I think like my system the other day, it'll just keep issuing > I/O commands to the drive, even if it's disappeared (it might get much > 'quicker failures' if the device has 'gone' to the point of FreeBSD just > quickly returning 'fail' for every request).
I know ATA will notice a detached channel, because I myself have done it: administratively, that is -- atacontrol detach ataX. But the only time that can happen "automatically" is if the actual controller does so itself, or if FreeBSD is told to do it administratively. What this does to other parts of the kernel and userland applications is something I haven't tested. I *can* tell you that there are major, major problems with detach/reattach/reinit on ata(4) causing kernel panics and other such things. I've documented this quite thoroughly in my "Common FreeBSD issues" wiki: http://wiki.freebsd.org/JeremyChadwick/Commonly_reported_issues I am also very curious to know the exact brand/model of 8-port SATA controller from Supermicro you are using, *especially* if it uses ata(4) rather than CAM and da(4). Such Supermicro controllers were recently discussed on freebsd-stable (or was it -hardware?), and no one was able to come to a concise decision as to whether or not they were decent or even remotely trusted. Supermicro provides a few different SATA HBAs. >> Ideally, it would be the job of the controller and controller driver to >> announce to underlying I/O operations fail/success. Do you agree? >> >> I hope this "FMA Engine" on Solaris only *tells* underlying pieces of >> I/O errors, rather than acting on them (e.g. automatically yanking the >> disk off the bus for you). I'm in no way shunning Solaris, I'm simply >> saying such a mechanism could be as risky/deadly as it could be useful. > > Yeah, I guess so - I think the way it's meant to happen (and this is only > AFAIK) is that FMA 'detects' a failing drive by applying some > configurable policy to it. That policy would also include notifying ZFS, > so that ZFS could then decide to stop issuing I/O commands to that > device. It sounds like that is done very differently than on FreeBSD. If such a condition happens on FreeBSD (disk errors scrolling by, etc.), the only way I know of to get FreeBSD to stop sending commands through the ATA subsystem is to detach the channel (atacontrol detach ataX). > None of this seems to be in place, at least for ATA under FreeBSD - when > a drive goes bad, you can just end up with 'hours' worth of I/O timeouts, > until someone intervenes. I can see the usefulness in Solaris's FMA thing. My big concern is whether or not FMA actually pulls the disk off the channel, or if it just leaves the disk/channel connected and simply informs kernel pieces not to use it. If it pulls the disk off the channel, I have serious qualms with it. There are also chips on SATA and SCSI controllers which can cause chaos as well -- specifically, SES/SES2 chips (I'm looking at you, QLogic). These are supposed to be "smart chips" that detect when there are a large number of transport or hardware errors (implying cabling issues, etc.) and *automatically* yank the disk off the bus. Sounds great on paper, but in the field, I see these chips start pulling disks off the bus, changing SCSI IDs on devices, or induce what appear to be full SCSI subsystem timeouts (e.g. the SES/SES2 chip has locked up/crashed in some way, and now your entire bus is dead in the water). I have seen all of the above bugs with onboard Adaptec 320 controllers, the systems running Solaris 8, 9, and OpenSolaris. Most times it turns out to be the SES/SES2 chip getting in the way. > I did enquire on the Open Solaris list about setting limits for 'errors' > in ZFS, which netted me a reply that it's FMA (at least in Solaris) > that's responsible for this - it just then informs ZFS of the condition. > We don't appear (again at least for ATA) to have anything similar for > FreeBSD yet :( My recommendation to people these days is to avoid ata(4) on FreeBSD at all costs if they expect to encounter disk or hardware failures. The ata(4) layer is in no way shape or form reliable in the case of transport or disk failures, and even sometimes in the case of hot- swapping. Try your hardest to find a physical controller that supports SATA disks and uses CAM/da(4), which WILL provide that reliability. I know Areca controllers do this, and Areca is very FreeBSD-friendly. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | _______________________________________________ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"