I have a couple of questions and concerns about using ZFS in an environment
where the underlying LUNs are replicated at a block level using products like
HDS TrueCopy or EMC SRDF. Apologies in advance for the length, but I wanted
the explanation to be clear.
(I do realise that there are other possibilities such as zfs send/recv and
there are technical and business pros and cons for the various options. I don't
want to start a 'which is best' argument :) )
The CoW design of ZFS means that it goes to great lengths to always maintain
on-disk self-consistency, and ZFS can make certain assumptions about state (e.g
not needing fsck) based on that. This is the basis of my questions.
1) First issue relates to the überblock. Updates to it are assumed to be
atomic, but if the replication block size is smaller than the überblock then we
can't guarantee that the whole überblock is replicated as an entity. That
could in theory result in a corrupt überblock at the
secondary.
Will this be caught and handled by the normal ZFS checksumming? If so, does ZFS
just use an alternate überblock and rewrite the damaged one transparently?
2) Assuming that the replication maintains write-ordering, the secondary site
will always have valid and self-consistent data, although it may be out-of-date
compared to the primary if the replication is asynchronous, depending on link
latency, buffering, etc.
Normally most replication systems do maintain write ordering, [i]except[/i] for
one specific scenario. If the replication is interrupted, for example
secondary site down or unreachable due to a comms problem, the primary site
will keep a list of changed blocks. When contact between the sites is
re-established there will be a period of 'catch-up' resynchronization. In
most, if not all, cases this is done on a simple block-order basis.
Write-ordering is lost until the two sites are once again in sync and routine
replication restarts.
I can see this has having major ZFS impact. It would be possible for
intermediate blocks to be replicated before the data blocks they point to, and
in the worst case an updated überblock could be replicated before the block
chains that it references have been copied. This breaks the assumption that
the on-disk format is always self-consistent.
If a disaster happened during the 'catch-up', and the partially-resynchronized
LUNs were imported into a zpool at the secondary site, what would/could happen?
Refusal to accept the whole zpool? Rejection just of the files affected? System
panic? How could recovery from this situation be achieved?
Obviously all filesystems can suffer with this scenario, but ones that expect
less from their underlying storage (like UFS) can be fscked, and although data
that was being updated is potentially corrupt, existing data should still be OK
and usable. My concern is that ZFS will handle this scenario less well.
There are ways to mitigate this, of course, the most obvious being to take a
snapshot of the (valid) secondary before starting resync, as a fallback. This
isn't always easy to do, especially since the resync is usually automatic;
there is no clear trigger to use for the snapshot. It may also be difficult to
synchronize the snapshot of all LUNs in a pool. I'd like to better understand
the risks/behaviour of ZFS before starting to work on mitigation strategies.
Thanks
Steve
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss