Scara Maccai wrote: >> Oh, and regarding the original post -- as several >> readers correctly >> surmised, we weren't faking anything, we just didn't >> want to wait >> for all the device timeouts. Because the disks were >> on USB, which >> is a hotplug-capable bus, unplugging the dead disk >> generated an >> interrupt that bypassed the timeout. We could have >> waited it out, >> but 60 seconds is an eternity on stage. >> > > I'm sorry, I didn't mean to sound offensive. Anyway I think that people > should know that their drives can stuck the system for minutes, "despite" > ZFS. I mean: there are a lot of writings about how ZFS is great for recovery > in case a drive fails, but there's nothing regarding this problem. I know now > it's not ZFS fault; but I wonder how many people set up their drives with ZFS > assuming that "as soon as something goes bad, ZFS will fix it". > Is there any way to test these cases other than smashing the drive with a > hammer? Having a failover policy where the failover can't be tested sounds > scary... >
It is with this idea in mind that I wrote part of Chapter 1 of the book Designing Enterprise Solutions with Sun Cluster 3.0. For convenience, I also published chapter 1 as a Sun BluePrint Online article. http://www.sun.com/blueprints/1101/clstrcomplex.pdf False positives are very expensive in highly available systems, so we really do want to avoid them. One thing that we can do, and I've already (again[1]) started down the path to document, is to show where and how the various (common) timeouts are in the system. Once you know how sd, cmdk, dbus, and friends work you can make better decisions on where to look when the behaviour is not as you expect. But this is a very tedious path because there are so many different failure modes and real-world devices can react ambiguously when they fail. [1] we developed a method to benchmark cluster dependability. The description of the benchmark was published in several papers, but is now available in the new IEEE book on Dependability Benchmarking. This is really the first book of its kind and the first steps toward making dependability benchmarks more mainstream. Anyway, the work done for that effort included methods to improve failure detection and handling, so we have a detailed understanding of those things for SPARC, in lab form. Expanding that work to cover the random-device-bought-at-Frys will be a substantial undertaking. Co-conspirators welcome. -- richard _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss