observations below...

Bill Moore wrote:
Thanks, Chris, for digging into this and sharing your results.  These
seemingly stranded sectors are actually properly accounted for in terms
of space utilization, since they are actually unusable while maintaining
integrity in the face of a single drive failure.

The way the RAID-Z space accounting works is this:

    1) Take the size of your data block (4k in your example) and figure
out how much parity you need to protect it. This turns out to be 3 sectors, for a total of 11 (5.5k). See vdev_raidz_asize() for
       details.
    2) For single-parity RAID-Z, round up to a multiple of 2 sectors,
       and for double-parity RAID-Z, round up to a multiple of 3
       sectors.  This becomes ASIZE (6k in your case).  The reason
       for this is a bit complicated, but without this roundup, you can
       end up with stranded sectors that are unallocated and unusable,
       leading to the question, "I still have free space, why can't I
       write a file?"  We simply account for for these roundup sectors
       as part of the allocation that caused them.
    3) Allocate space for ASIZE bytes from the RAID-Z space map.  With
       the first-fit allocator, this aligns the write to the greatest
       power of 2 that evenly divides ASIZE (2k in this case).

With all this in mind, what winds up happening is exactly what Chris
surmised.  In this illustration, "A" represents a single sector of data
and "A." indicates its parity.

        Disk   A   B   C   D
        --------------------
     LBA   0   A.  A   A   A
           1   A.  A   A   A
           2   A.  A   A   X
           3   B.  B   B   B
           4   B.  B   B   B
           5   B.  B   B   X

In the interim, does it makes sense for a simple rule of thumb?
For example, in the above case, I would not have the hole if I did
any of the following:
        1. add one disk
        2. remove one disk
        3. use raidz2 instead of raidz

More generally, I could suggest that we use an odd number of vdevs
for raidz and an even number for mirrors and raidz2.
Thoughts?
 -- richard
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to