"Does raidzN actually protect against bitrot?"
That's a kind of radical, possibly offensive, question formula
that I have lately.

Reading up on theory of RAID5, I grasped the idea of the write
hole (where one of the sectors of the stripe, such as the parity
data, doesn't get written - leading to invalid data upon read).
In general, I think the same applies to bitrot of data that was
written successfully and corrupted later - either way, upon
reading all sectors of the stripe, we don't have a valid result
(for the XOR-parity example, XORing all bytes does not produce
a zero).

The way I get it, RAID5/6 generally has no mechanism to detect
*WHICH* sector was faulty, if all of them got read without
error reports from the disk. Perhaps it won't even test whether
parity matches and bytes zero out, as long as there were no read
errors reported. In this case having a dead drive is better than
having one with a silent corruption, because when one sector is
known to be invalid or absent, its contents can be reconstructed
thanks to other sectors and parity data.

I've seen statements (do I have to scavenge for prooflinks?)
that raidzN {sometimes or always?} has no means to detect
which drive produced bad data either. In this case in output
of "zpool status" we see zero CKSUM error-counts on leaf disk
levels, and non-zero counts on raidzN levels.

Opposed to that, on mirrors (which are used in examples of
ZFS's on-the-fly data repairs in all presentations), we do
always know the faulty source of data and can repair it
with a verifiable good source, if present.

In a real-life example, on my 6-disk raidz2 pool I see some
irrepairable corruptions as well as several "repaired" detected
errors. So I have a set of questions here, outlined below...

  (DISCLAIMER: I haven't finished reading through on-disk
  format spec in detail, but that PDF document is 5 years
  old anyway and I've heard some things have changed).


1) How does raidzN protect agaist bit-rot without known full
   death of a component disk, if it at all does?
   Or does it only help against "loud corruption" where the
   disk reports a sector-access error or dies completely?

2) Do the "leaf blocks" (on-disk sectors or ranges of sectors
   that belong to a raidzN stripe) have any ZFS checksums of
   their own? That is, can ZFS determine which of the disks
   produced invalid data and reconstruct the whole stripe?
2*) How are the sector-ranges on-physical-disk addressed by
   ZFS? Are there special block pointers with some sort of
   physical LBA addresses in place of DVAs and with checksums?
   I think there should be (claimed end-to-end checksumming)
   but wanted to confirm.
2**) Alternatively, how does raidzN get into situation like
   "I know there is an error somewhere, but don't know where"?
   Does this signal simultaneous failures in different disks
   of one stripe?
   How *do* some things get fixed then - can only dittoed data
   or metadata be salvaged from second good copies on raidZ?

3) Is it true that in recent ZFS the metadata is stored in
   a mirrored layout, even for raidzN pools? That is, does
   the raidzN layout only apply to userdata blocks now?
   If "yes":
3*)  Is such mirroring applied over physical VDEVs or over
   top-level VDEVs? For certain 512/4096 bytes of a metadata
   block, are there two (ditto-mirror) or more (ditto over
   raidz) physical sectors of storage directly involved?
3**) If small blocks, sized 1-or-few sectors, are fanned out
   in incomplete raidz stripes (i.e. 512b parity + 512b data)
   does this actually lead to +100% overhead for small data,
   double that (200%) for dittoed data/copies=2?
   Does this apply to metadata in particular? ;)
   Does this large factor apply to ZVOLs with fixed block
   size being defined "small" (i.e. down to the minimum 512b/4k
   available for these disks)?
3***) In fact, for the considerations above, what is metadata? :)
   Is it only the tree of blockpointers, or is it all the two
   or three dozen block types except userdata (ZPL file, ZVOL
   block) and unallocated blocks?


AND ON A SIDE NOTE:

I do hope to see answers from the gurus on the list to these
and other questions I posed recently.

One frequently announced weakness in ZFS is the relatively small
pool of engineering talent knowledgeable enough to hack ZFS and
develop new features (i.e. the ex-Sunnites and very few determined
other individuals): "We might do this, but we have few resources
and already have other more pressing priorities".

I think there is a lot more programming talent in the greater
user/hacker community around ZFS, including active askers on this
list, Linux/BSD porters, and probably many more people who just
occasionally hit upon our discussions here by googling up their
questions. I mean programmers ready to dedicate some time to ZFS,
which are held back by not fully understanding the architecture,
and just do not start their developing (so as not to make matters
worse). And the knowledge barrier to start coding is quite high.

I do hope that instead of spending weeks to make a new feature,
development gurus could spend a day writing replies to questions
like mine (and many others') and then someone in the community
would come up with a reasonable POC or finished code for new
features and improvements.

It is like education. Say, math: many talented mathematicians
have spent thousands of man-years developing and refining the
theory which now we learn over 3 or 6 years in a university.
Maybe we're skimming overheads on lectures, but we gain enough
understanding to deepen into any more specific subject ourselves.

Likewise with opensource: yes, the code is there. A developer
might read into it and possibly comprehend some in a year or so.
Or he could spend a few days midway (when he knows enough to
pose hard questions not googlable in some FAQ yet) in yes-no
question sessions with the more knowledgeable people, and become
ready to work in just a few weeks from start. Wouldn't that be
wonderful for ZFS in general? :)

Thanks in advance,
//Jim Klimov
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to