Thanks again for answering! :)

2012-01-16 10:08, Richard Elling wrote:
On Jan 15, 2012, at 7:04 AM, Jim Klimov wrote:

"Does raidzN actually protect against bitrot?"
That's a kind of radical, possibly offensive, question formula
that I have lately.

Simple answer: no. raidz provides data protection. Checksums verify
data is correct. Two different parts of the storage solution.

Meaning - data-block checksum mismatch allows to detect an error;
afterwards raidz permutations matching the checksum allow to fix
it (if enough redundancy is available)? Right?

raidz uses an algorithm to try permutations of data and parity to
verify against the checksum. Once the checksum matches, repair
can begin.

Ok, nice to have this statement confirmed so many times now ;)

How do per-disk cksum errors get counted then for raidz - thanks
to permutation of fixable errors we can detect which disk:sector
returned mismatching data? Likewise, for unfixable errors we can't
know the faulty disk - unless one had explicitly erred?

So, if my 6-disk raidz2 couldn't fix the error, it either occured
on 3 disks' parts of one stripe, or in RAM/CPU (SPOF) before writing
the data and checksum to disk? In the latter case there is definitely
no single-disk's fault for returning bad data, so per-disk cksum
counters are zero? ;)


2*) How are the sector-ranges on-physical-disk addressed by
ZFS? Are there special block pointers with some sort of
physical LBA addresses in place of DVAs and with checksums?
I think there should be (claimed end-to-end checksumming)
but wanted to confirm.

No.

Ok, so basically there is the vdev_raidz_map_alloc() algorithm
to convert DVAs into leaf addresses, and it is always going to
be the same for all raidz's?

For example, such lack of explicit addressing would not let ZFS
reallocate one disk's bad media sector into another location -
the disk is always expected to do that reliably and successfully?


2**) Alternatively, how does raidzN get into situation like
"I know there is an error somewhere, but don't know where"?
Does this signal simultaneous failures in different disks
of one stripe?
How *do* some things get fixed then - can only dittoed data
or metadata be salvaged from second good copies on raidZ?

No. See the seminal blog on raidz
http://blogs.oracle.com/bonwick/entry/raid_z


3) Is it true that in recent ZFS the metadata is stored in
a mirrored layout, even for raidzN pools? That is, does
the raidzN layout only apply to userdata blocks now?
If "yes":

Yes, for Solaris 11. No, for all other implementations, at this time.

Are there plans to do this for illumos, etc.?
I thought that my oi_148a's disks' IO patterns matched the
idea of mirroring metadata, now I'll have to explain that
data with some secondary ideas ;)


3*) Is such mirroring applied over physical VDEVs or over
top-level VDEVs? For certain 512/4096 bytes of a metadata
block, are there two (ditto-mirror) or more (ditto over
raidz) physical sectors of storage directly involved?

It is done in the top-level vdev. For more information see the manual,


      What's New in /ZFS/? - Oracle Solaris /ZFS/ Administration Guide
      <http://docs.oracle.com/cd/E19963-01/html/821-1448/gbscy.html>

docs.oracle.com/cd/E19963-01/html/821-1448/gbscy.html
<http://docs.oracle.com/cd/E19963-01/html/821-1448/gbscy.html>

3**) If small blocks, sized 1-or-few sectors, are fanned out
in incomplete raidz stripes (i.e. 512b parity + 512b data)
does this actually lead to +100% overhead for small data,
double that (200%) for dittoed data/copies=2?

The term "incomplete" does not apply here. The stripe written is
complete: data + parity.

Just to make up, I meant variable-width stripes as opposed
to "full-width stripe" writes in other RAIDs. That is, to
update one sector of data on a 6-disk raid6 I'd need to
write 6 sectors; while on raidz2 I need to write only two.
No extra reply solicited here ;)

Does this apply to metadata in particular? ;)

lost context here, for non-Solaris 11 implementations, metadata is
no different than data with copies=[23]

The question here was whether writes of metadata (assumed
to be a small number of sectors down to one per block)
incur writes of parity, of ditto copies, or of parity and
copies, increasing storage requirements by several times.

One background thought was that I wanted to make sense of
my last year's experience with a zvol whose blocksize was
1 sector (4kb), and the metadata overhead (consumption of
free space) was about the same as userdata size. At that
time I thought it's because I have a 1-sector metadata
block to address each 1-sector data block of the volume;
but now I think the overhead would be closer to 400% of
userdata size...


Does this large factor apply to ZVOLs with fixed block
size being defined "small" (i.e. down to the minimum 512b/4k
available for these disks)?

NB, there are a few slides in my ZFS tutorials where we talk about this.
http://www.slideshare.net/relling/usenix-lisa11-tutorial-zfs-a

3***) In fact, for the considerations above, what is metadata? :)
Is it only the tree of blockpointers, or is it all the two
or three dozen block types except userdata (ZPL file, ZVOL
block) and unallocated blocks?

It is metadata, there is quite a variety. For example, there is the MOS,
zpool history, DSL configuration, etc.

Yes, there's a big table on the DMU page... So metadata
is indeed everything except userdata and empty space? ;)


AND ON A SIDE NOTE:

I do hope to see answers from the gurus on the list to these
and other questions I posed recently.
I think there is a lot more programming talent in the greater...

Agree 110%
-- richard

Thanks for support, and I'll look into the videos, and the
slides+blogs you referenced.

Do you have a chance to comment on-list about the ZFS patenting
licensing FUD - how many of the fears have real-life foundations,
and what can be dismissed? I.e. after the community makes ZFS
even greater, can Oracle or NetApp pull the carpet and claim
it's all theirs? :)

//Jim

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to