> On Jun 7, 2019, at 12:15 PM, Mike Gerdts <mike.ger...@joyent.com> wrote:
> 
> On Fri, Jun 7, 2019 at 12:03 PM Matthew Ahrens <mahr...@delphix.com 
> <mailto:mahr...@delphix.com>> wrote:
> On Thu, Jun 6, 2019 at 10:56 PM Mike Gerdts <mike.ger...@joyent.com 
> <mailto:mike.ger...@joyent.com>> wrote:
> I'm motivated to make zfs set refreservation=auto do the right thing in the 
> face of raidz and 4k physical blocks, but have data points that provide 
> inconsistent data.  Experimentation shows raidz2 parity overhead that matches 
> my expectations for raidz1.
> 
> Let's consider the case of a pool with 8 disks in one raidz2 vdev, ashift=12.
> 
> In the spreadsheet 
> <https://docs.google.com/spreadsheets/d/1tf4qx1aMJp8Lo_R6gpT689wTjHv6CGVElrPqTA0w_ZY/edit?pli=1#gid=930519344>
>  from Matt's How I Learned to Stop Worrying and Love RAIDZ 
> <https://www.delphix.com/blog/delphix-engineering/zfs-raidz-stripe-width-or-how-i-learned-stop-worrying-and-love-raidz>
>  blog entry, the "RAIDZ2 parity cost" sheet cells F4 and F5 suggest the 
> parity and padding cost is 200%.  That is, a 10 gig zvol with volblocksize=4k 
> or 8k should both end up taking up 30 gig of space.
> 
> That makes sense to me as well.
>  
> 
> Experimentation tells me that they each use just a little bit more than 
> double the amount that was calculated by refreservation=auto.  In each of 
> these cases, compression=off and I've overwritten them with `dd if=/dev/zero 
> ...`
> 
> $ zfs get 
> used,referenced,logicalused,logicalreferenced,volblocksize,refreservation 
> zones/mg/disk0
> NAME            PROPERTY           VALUE      SOURCE
> zones/mg/disk0  used               21.4G      -
> zones/mg/disk0  referenced         21.4G      -
> zones/mg/disk0  logicalused        10.0G      -
> zones/mg/disk0  logicalreferenced  10.0G      -
> zones/mg/disk0  volblocksize       8K         default
> zones/mg/disk0  refreservation     10.3G      local
> $ zfs get 
> used,referenced,logicalused,logicalreferenced,volblocksize,refreservation 
> zones/mg/disk1
> NAME            PROPERTY           VALUE      SOURCE
> zones/mg/disk1  used               21.4G      -
> zones/mg/disk1  referenced         21.4G      -
> zones/mg/disk1  logicalused        10.0G      -
> zones/mg/disk1  logicalreferenced  10.0G      -
> zones/mg/disk1  volblocksize       4K         -
> zones/mg/disk1  refreservation     10.6G      local
> $ zpool status zones
>   pool: zones
>  state: ONLINE
>   scan: none requested
> config:
> 
>         NAME                       STATE     READ WRITE CKSUM
>         zones                      ONLINE       0     0     0
>           raidz2-0                 ONLINE       0     0     0
>             c0t55CD2E404C314E1Ed0  ONLINE       0     0     0
>             c0t55CD2E404C314E85d0  ONLINE       0     0     0
>             c0t55CD2E404C315450d0  ONLINE       0     0     0
>             c0t55CD2E404C31554Ad0  ONLINE       0     0     0
>             c0t55CD2E404C315BB6d0  ONLINE       0     0     0
>             c0t55CD2E404C315BCDd0  ONLINE       0     0     0
>             c0t55CD2E404C315BFDd0  ONLINE       0     0     0
>             c0t55CD2E404C317724d0  ONLINE       0     0     0
> # echo ::spa -c | mdb -k | grep ashift | sort -u
>             ashift=000000000000000c
> 
> Overwriting from /dev/urandom didn't change the above numbers in any 
> significant way.
> 
> My understanding is that each volblocksize block has data and parity spread 
> across a minimum of 3 devices so that any two could be lost and still 
> recover.  Considering the simple case of volblocksize=4k and ashift=12, 200% 
> overhead for parity (+ no pad) seems spot-on. 
> 
> That's right.  And in the case of volblocksize=8K, you have 2 data + 2 parity 
> + 2 pad = 6 sectors = 24K allocated.
>  
> I seem to be only seeing 100% overhead for parity plus a little for metadata 
> and its parity.
> 
> What fundamental concept am I missing?
> 
> The spreadsheet shows how much space will be allocated, which is reflected in 
> the zpool `allocated` property.  However, you are looking at the zfs `used` 
> and `referenced` properties.  These properties (as well as `available` and 
> all other zfs (not zpool) accounting values) take into account the expected 
> RAIDZ overhead, which is calculated assuming 128K logical size blocks.  This 
> means that zfs accounting hides the parity (and padding) overhead when the 
> block size is around 128K.  Other block sizes may see (typically only 
> slightly) more or less space consumed than expected (e.g. if the `recordsize` 
> property has been changed, a 1GB file may have zfs `used` of 0.9G, or 1.1G).
> 
> As indicated in cell F23, the expected overhead for 4K-sector 8-wide RAIDZ2 
> is 41% (which is around what the RAID5 overhead would be, 2/6 = 33%).  This 
> is taken into account in the "RAID-Z deflation ratio" (`vdev_deflate_ratio`). 
>  In other words, `used = allocated / 1.41`.  If we undo that, we get `21.4G * 
> 1.41 = 30.2G`, which is around what we expected.
> 
> Thanks for that - it should give me enough of a clue that I can coax 
> zvol_volsize_to_reservation() to give a more appropriate number.
> 
> Now for the follow-up question:
> 
> How should something like this be made available in public interfaces?  A 
> trivial idea would be to simply update zvol_volsize_to_reservation() to take 
> the maximum of the size required on each non-leaf vdev.  I'm hesitant to 
> think that is the complete solution as it could become inaccurate in many 
> ways.  For instance:
> 
> - zfs send | recv to a pool without a different vdev layout
> - a raidz2-9 vdev is added to a pool made up of raidz2-8 vdevs

Long ago, some on my team had a big discussion on how to approach this.
It gets much more complicated because you can legitimately build a pool with
physical block size = 512 for some top-level vdevs and physical block size = 4k
for other top-level vdevs. After dancing around the maypole for a while, we 
solved
it in the custom control plane we built. It is not clear to me that 
retrofitting something
onto the generic zfs command will be intuitive enough to cover the myriad cases.

> 
> The replication case could lead to a failed receive due to quota problems.  
> Both scenarios would break the way that refreservation is automatically 
> changed when volsize changes.  It seems this is already a problem in the face 
> of changing the value of copies.
> 
> One idea I had was to store a property that could take values "auto", 
> "mirror", "raidz-N", "raidz2-N", or "raidz3-N".  N specifies the total number 
> of disks in a raidz* vdev.  When volsize changes, this new property would be 
> consulted to determine which algorithm to use when changing refreservation.  
> In an environment where refreservations and replication between dissimilar 
> pools are important, the admin could choose the worst-case pool layout to 
> plan for.  Changing to a different refreservation algorithm would recalculate 
> refreservation using that algorithm.  We'd also need to work out what to do 
> about volumes that have sized to "auto" and then a less space-efficient vdev 
> is added.

It is more complicated for filesystems, too. Volumes are just lucky that all 
blocks
are the same size. If you write a 1.5k file, it will consume 24k of allocated 
space
plus metadata overhead.

> 
> I'm happy to chat on slack as well, if that's easier.

It is, start a thread
 -- richard

> 
> Regards,
> Mike
> openzfs <https://openzfs.topicbox.com/latest> / openzfs-developer / see 
> discussions <https://openzfs.topicbox.com/groups/developer> + participants 
> <https://openzfs.topicbox.com/groups/developer/members> + delivery options 
> <https://openzfs.topicbox.com/groups/developer/subscription>Permalink 
> <https://openzfs.topicbox.com/groups/developer/Tf89af487ee658da3-Mf695dbb2999deb0eeb773758>

------------------------------------------
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/Tf89af487ee658da3-M26a1e847d74b6fbfe4edb748
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription

Reply via email to