> On Jun 7, 2019, at 12:15 PM, Mike Gerdts <mike.ger...@joyent.com> wrote:
> On Fri, Jun 7, 2019 at 12:03 PM Matthew Ahrens <mahr...@delphix.com 
> <mailto:mahr...@delphix.com>> wrote:
> On Thu, Jun 6, 2019 at 10:56 PM Mike Gerdts <mike.ger...@joyent.com 
> <mailto:mike.ger...@joyent.com>> wrote:
> I'm motivated to make zfs set refreservation=auto do the right thing in the 
> face of raidz and 4k physical blocks, but have data points that provide 
> inconsistent data.  Experimentation shows raidz2 parity overhead that matches 
> my expectations for raidz1.
> Let's consider the case of a pool with 8 disks in one raidz2 vdev, ashift=12.
> In the spreadsheet 
> <https://docs.google.com/spreadsheets/d/1tf4qx1aMJp8Lo_R6gpT689wTjHv6CGVElrPqTA0w_ZY/edit?pli=1#gid=930519344>
>  from Matt's How I Learned to Stop Worrying and Love RAIDZ 
> <https://www.delphix.com/blog/delphix-engineering/zfs-raidz-stripe-width-or-how-i-learned-stop-worrying-and-love-raidz>
>  blog entry, the "RAIDZ2 parity cost" sheet cells F4 and F5 suggest the 
> parity and padding cost is 200%.  That is, a 10 gig zvol with volblocksize=4k 
> or 8k should both end up taking up 30 gig of space.
> That makes sense to me as well.
> Experimentation tells me that they each use just a little bit more than 
> double the amount that was calculated by refreservation=auto.  In each of 
> these cases, compression=off and I've overwritten them with `dd if=/dev/zero 
> ...`
> $ zfs get 
> used,referenced,logicalused,logicalreferenced,volblocksize,refreservation 
> zones/mg/disk0
> NAME            PROPERTY           VALUE      SOURCE
> zones/mg/disk0  used               21.4G      -
> zones/mg/disk0  referenced         21.4G      -
> zones/mg/disk0  logicalused        10.0G      -
> zones/mg/disk0  logicalreferenced  10.0G      -
> zones/mg/disk0  volblocksize       8K         default
> zones/mg/disk0  refreservation     10.3G      local
> $ zfs get 
> used,referenced,logicalused,logicalreferenced,volblocksize,refreservation 
> zones/mg/disk1
> NAME            PROPERTY           VALUE      SOURCE
> zones/mg/disk1  used               21.4G      -
> zones/mg/disk1  referenced         21.4G      -
> zones/mg/disk1  logicalused        10.0G      -
> zones/mg/disk1  logicalreferenced  10.0G      -
> zones/mg/disk1  volblocksize       4K         -
> zones/mg/disk1  refreservation     10.6G      local
> $ zpool status zones
>   pool: zones
>  state: ONLINE
>   scan: none requested
> config:
>         NAME                       STATE     READ WRITE CKSUM
>         zones                      ONLINE       0     0     0
>           raidz2-0                 ONLINE       0     0     0
>             c0t55CD2E404C314E1Ed0  ONLINE       0     0     0
>             c0t55CD2E404C314E85d0  ONLINE       0     0     0
>             c0t55CD2E404C315450d0  ONLINE       0     0     0
>             c0t55CD2E404C31554Ad0  ONLINE       0     0     0
>             c0t55CD2E404C315BB6d0  ONLINE       0     0     0
>             c0t55CD2E404C315BCDd0  ONLINE       0     0     0
>             c0t55CD2E404C315BFDd0  ONLINE       0     0     0
>             c0t55CD2E404C317724d0  ONLINE       0     0     0
> # echo ::spa -c | mdb -k | grep ashift | sort -u
>             ashift=000000000000000c
> Overwriting from /dev/urandom didn't change the above numbers in any 
> significant way.
> My understanding is that each volblocksize block has data and parity spread 
> across a minimum of 3 devices so that any two could be lost and still 
> recover.  Considering the simple case of volblocksize=4k and ashift=12, 200% 
> overhead for parity (+ no pad) seems spot-on. 
> That's right.  And in the case of volblocksize=8K, you have 2 data + 2 parity 
> + 2 pad = 6 sectors = 24K allocated.
> I seem to be only seeing 100% overhead for parity plus a little for metadata 
> and its parity.
> What fundamental concept am I missing?
> The spreadsheet shows how much space will be allocated, which is reflected in 
> the zpool `allocated` property.  However, you are looking at the zfs `used` 
> and `referenced` properties.  These properties (as well as `available` and 
> all other zfs (not zpool) accounting values) take into account the expected 
> RAIDZ overhead, which is calculated assuming 128K logical size blocks.  This 
> means that zfs accounting hides the parity (and padding) overhead when the 
> block size is around 128K.  Other block sizes may see (typically only 
> slightly) more or less space consumed than expected (e.g. if the `recordsize` 
> property has been changed, a 1GB file may have zfs `used` of 0.9G, or 1.1G).
> As indicated in cell F23, the expected overhead for 4K-sector 8-wide RAIDZ2 
> is 41% (which is around what the RAID5 overhead would be, 2/6 = 33%).  This 
> is taken into account in the "RAID-Z deflation ratio" (`vdev_deflate_ratio`). 
>  In other words, `used = allocated / 1.41`.  If we undo that, we get `21.4G * 
> 1.41 = 30.2G`, which is around what we expected.
> Thanks for that - it should give me enough of a clue that I can coax 
> zvol_volsize_to_reservation() to give a more appropriate number.
> Now for the follow-up question:
> How should something like this be made available in public interfaces?  A 
> trivial idea would be to simply update zvol_volsize_to_reservation() to take 
> the maximum of the size required on each non-leaf vdev.  I'm hesitant to 
> think that is the complete solution as it could become inaccurate in many 
> ways.  For instance:
> - zfs send | recv to a pool without a different vdev layout
> - a raidz2-9 vdev is added to a pool made up of raidz2-8 vdevs

Long ago, some on my team had a big discussion on how to approach this.
It gets much more complicated because you can legitimately build a pool with
physical block size = 512 for some top-level vdevs and physical block size = 4k
for other top-level vdevs. After dancing around the maypole for a while, we 
it in the custom control plane we built. It is not clear to me that 
retrofitting something
onto the generic zfs command will be intuitive enough to cover the myriad cases.

> The replication case could lead to a failed receive due to quota problems.  
> Both scenarios would break the way that refreservation is automatically 
> changed when volsize changes.  It seems this is already a problem in the face 
> of changing the value of copies.
> One idea I had was to store a property that could take values "auto", 
> "mirror", "raidz-N", "raidz2-N", or "raidz3-N".  N specifies the total number 
> of disks in a raidz* vdev.  When volsize changes, this new property would be 
> consulted to determine which algorithm to use when changing refreservation.  
> In an environment where refreservations and replication between dissimilar 
> pools are important, the admin could choose the worst-case pool layout to 
> plan for.  Changing to a different refreservation algorithm would recalculate 
> refreservation using that algorithm.  We'd also need to work out what to do 
> about volumes that have sized to "auto" and then a less space-efficient vdev 
> is added.

It is more complicated for filesystems, too. Volumes are just lucky that all 
are the same size. If you write a 1.5k file, it will consume 24k of allocated 
plus metadata overhead.

> I'm happy to chat on slack as well, if that's easier.

It is, start a thread
 -- richard

> Regards,
> Mike
> openzfs <https://openzfs.topicbox.com/latest> / openzfs-developer / see 
> discussions <https://openzfs.topicbox.com/groups/developer> + participants 
> <https://openzfs.topicbox.com/groups/developer/members> + delivery options 
> <https://openzfs.topicbox.com/groups/developer/subscription>Permalink 
> <https://openzfs.topicbox.com/groups/developer/Tf89af487ee658da3-Mf695dbb2999deb0eeb773758>

openzfs: openzfs-developer
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription

Reply via email to