On Sat, Aug 27, 2011 at 08:44:13AM -0700, Richard Elling wrote:
> I'm getting a but tired of people designing for fast resilvering. 

It is a design consideration, regardless, though your point is valid
that it shouldn't be the overriding consideration. 

To the original question and poster: 

This often arises out of another type of consideration, that of the
size of a "failure unit".  When plugging together systems at almost
any scale beyond a handful of disks, there are many kinds of groupings
of disks whereby the whole group may disappear if a certain component
fails: controllers, power supplies, backplanes, cables, network/fabric
switches, backplanes, etc.  The probabilities of each of these varies,
often greatly, but they can shape and constrain a design.

I'm going to choose a deliberately exaggerated example, to illustrate
the discussion and recommendations in the thread, using the OP's
numbers.

Let's say that I have 20 5-disk little NAS boxes, each with their own
single power supply and NIC.  Each is an iSCSI target, and can serve
up either 5 bare-disk LUNs, or a single LUN for the whole box, backed
by internal RAID. Internal RAID can be 0 or 5. 

Clearly, a-box-of-5-disks is an independent failure unit, at
non-trivial probability via a variety of possible causes. I better
plan my pool accordingly. 

The first option is to "simplify" the configuration, representing
the obvious failure unit as a single LUN, just a big disk.  There is
merit in simplicity, especially for the humans involved if they're not
sophisticated and experienced ZFS users (or else why would they be
asking these questions?). This may prevent confusion and possible
mistakes (at 3am under pressure, even experienced admins make those). 

This gives us 20 "disks" to make a pool, of whatever layout suits our
performance and resiliency needs.  Regardless of what disks are used,
a 20-way RAIDZ is unlikely to be a good answer.  2x 10-way raidz2, 4x
5-way raidz1, 2-way and 3-way mirrors, might all be useful depending
on circumstances. (As an aside, mirrors might be the layout of choice
if switch failures are also to be taken into consideration, for
practical network topologies.)

The second option is to give ZFS all the disks individually. We will
embed our knowledge of the failure domains into the pool structure,
choosing which disks go in which vdev accordingly. 

The simplest expression of this is to take the same layout we chose
above for 20 big disks, and make 5 of them, each as a top-level vdev
in the same pattern, for each of the 5 individual disks. Think about
making 5 separate pools with the same layout as the previous case, and
stacking them together into one. (As another aside, in previous
discussions I've also recommended considering multiple pools vs
multiple vdevs, that still applies but I won't reiterate here.)

If our pool had enough redundancy for our needs before, we will now
have 5 times as many top-level vdevs, which will experience tolerable
failures in groups of 5 if a disk box dies, for the same overall
result.  

ZFS generally does better this way.  We will have more direct
concurrency, because ZFS's device tree maps to spindles, rather than
to a more complex interaction of underlying components. Physical disk
failures can now be seen by ZFS as such, and don't get amplified to
whole LUN failures (RAID0) or performance degradations during internal
reconstruction (RAID5). ZFS will prefer not to allocate new data on a
degraded vdev until it is repaired, but needs to know about it in the
first place. Even before we talk about recovery, ZFS can likely report
errors better than the internal RAID, which may just hide an issue
long enough for it to become a real problem during another later event.

If we can (e.g.) assign the WWN's of the exported LUNs according to a
scheme that makes disk location obvious, we're less likely to get
confused because of all the extra disks.  The structure is still
apparent.  

(There are more layouts we can now create using the extra disks, but
we lose the simplicity, and they don't really enhance this example for
the general case.  Very careful analysis will be required, and
errors under pressure might result in a situation where the system
works, but later resiliency is compromised.  This is especially true
if hot-spares are involved.) 

So, the ZFS preference is definitely for individual disks.  What might
override this preference, and cause us to use LUNs over the internal
raid, other than the perception of simplicity due to inexperience?
Some possibilities are below.

Because local reconstructions within a box may be much faster than
over the network.  Remember, though, that we trust ZFS more than
RAID5 (even before any specific implementation has a chance to add its
own bugs and wrinkles). So, effectively, after such a local RAID5
reconstruction, we'd want to run a scrub anyway - at which point we
might as well just have let ZFS resilver.  If we have more than one
top-level vdev, which we certanly will in this example, a scrub is lot
more work (and network traffic) than letting the vdev resilver. 

Because our iSCSI boxes don't support hot-swap, or the dumb firmware
hangs all LUNs when one drive is timing out on errors, so a drive
replacement winds up amplifying to a 5-way failure regardless. Or we
want to use hot-spares, or some other operational consideration that
basically says we don't ever want to deal with hardware failures below
this level of granularity.  Remember, though, that recovery from a 5-way
offline to fix a single failure is still a lot faster, and thus safer,
because each disk only needs to be resilvered for what it missed while
offline. 

Because the internal RAID in the NAS boxes is really good and fast
(fancy controller with battery-backed nvram we've already paid for),
such that presenting them as a single fast unit performs much better. 
Maybe the "fancy controller" is actually ZFS, and the nvram is an SSD,
or maybe it's something else that we trust pretty well. In which case,
clustering them into a common logical storage with cross-device
redundancy, using a method other than ZFS may be more appropriate. 

Our example is contrived, but these considerations apply for other
connectivity types, and especially where geographical separation is
involved. 

--
Dan.

Attachment: pgpZKUfcWaKuV.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to