On Sat, Aug 27, 2011 at 08:44:13AM -0700, Richard Elling wrote: > I'm getting a but tired of people designing for fast resilvering.
It is a design consideration, regardless, though your point is valid that it shouldn't be the overriding consideration. To the original question and poster: This often arises out of another type of consideration, that of the size of a "failure unit". When plugging together systems at almost any scale beyond a handful of disks, there are many kinds of groupings of disks whereby the whole group may disappear if a certain component fails: controllers, power supplies, backplanes, cables, network/fabric switches, backplanes, etc. The probabilities of each of these varies, often greatly, but they can shape and constrain a design. I'm going to choose a deliberately exaggerated example, to illustrate the discussion and recommendations in the thread, using the OP's numbers. Let's say that I have 20 5-disk little NAS boxes, each with their own single power supply and NIC. Each is an iSCSI target, and can serve up either 5 bare-disk LUNs, or a single LUN for the whole box, backed by internal RAID. Internal RAID can be 0 or 5. Clearly, a-box-of-5-disks is an independent failure unit, at non-trivial probability via a variety of possible causes. I better plan my pool accordingly. The first option is to "simplify" the configuration, representing the obvious failure unit as a single LUN, just a big disk. There is merit in simplicity, especially for the humans involved if they're not sophisticated and experienced ZFS users (or else why would they be asking these questions?). This may prevent confusion and possible mistakes (at 3am under pressure, even experienced admins make those). This gives us 20 "disks" to make a pool, of whatever layout suits our performance and resiliency needs. Regardless of what disks are used, a 20-way RAIDZ is unlikely to be a good answer. 2x 10-way raidz2, 4x 5-way raidz1, 2-way and 3-way mirrors, might all be useful depending on circumstances. (As an aside, mirrors might be the layout of choice if switch failures are also to be taken into consideration, for practical network topologies.) The second option is to give ZFS all the disks individually. We will embed our knowledge of the failure domains into the pool structure, choosing which disks go in which vdev accordingly. The simplest expression of this is to take the same layout we chose above for 20 big disks, and make 5 of them, each as a top-level vdev in the same pattern, for each of the 5 individual disks. Think about making 5 separate pools with the same layout as the previous case, and stacking them together into one. (As another aside, in previous discussions I've also recommended considering multiple pools vs multiple vdevs, that still applies but I won't reiterate here.) If our pool had enough redundancy for our needs before, we will now have 5 times as many top-level vdevs, which will experience tolerable failures in groups of 5 if a disk box dies, for the same overall result. ZFS generally does better this way. We will have more direct concurrency, because ZFS's device tree maps to spindles, rather than to a more complex interaction of underlying components. Physical disk failures can now be seen by ZFS as such, and don't get amplified to whole LUN failures (RAID0) or performance degradations during internal reconstruction (RAID5). ZFS will prefer not to allocate new data on a degraded vdev until it is repaired, but needs to know about it in the first place. Even before we talk about recovery, ZFS can likely report errors better than the internal RAID, which may just hide an issue long enough for it to become a real problem during another later event. If we can (e.g.) assign the WWN's of the exported LUNs according to a scheme that makes disk location obvious, we're less likely to get confused because of all the extra disks. The structure is still apparent. (There are more layouts we can now create using the extra disks, but we lose the simplicity, and they don't really enhance this example for the general case. Very careful analysis will be required, and errors under pressure might result in a situation where the system works, but later resiliency is compromised. This is especially true if hot-spares are involved.) So, the ZFS preference is definitely for individual disks. What might override this preference, and cause us to use LUNs over the internal raid, other than the perception of simplicity due to inexperience? Some possibilities are below. Because local reconstructions within a box may be much faster than over the network. Remember, though, that we trust ZFS more than RAID5 (even before any specific implementation has a chance to add its own bugs and wrinkles). So, effectively, after such a local RAID5 reconstruction, we'd want to run a scrub anyway - at which point we might as well just have let ZFS resilver. If we have more than one top-level vdev, which we certanly will in this example, a scrub is lot more work (and network traffic) than letting the vdev resilver. Because our iSCSI boxes don't support hot-swap, or the dumb firmware hangs all LUNs when one drive is timing out on errors, so a drive replacement winds up amplifying to a 5-way failure regardless. Or we want to use hot-spares, or some other operational consideration that basically says we don't ever want to deal with hardware failures below this level of granularity. Remember, though, that recovery from a 5-way offline to fix a single failure is still a lot faster, and thus safer, because each disk only needs to be resilvered for what it missed while offline. Because the internal RAID in the NAS boxes is really good and fast (fancy controller with battery-backed nvram we've already paid for), such that presenting them as a single fast unit performs much better. Maybe the "fancy controller" is actually ZFS, and the nvram is an SSD, or maybe it's something else that we trust pretty well. In which case, clustering them into a common logical storage with cross-device redundancy, using a method other than ZFS may be more appropriate. Our example is contrived, but these considerations apply for other connectivity types, and especially where geographical separation is involved. -- Dan.
pgpZKUfcWaKuV.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss