On Tue, Dec 8, 2020 at 12:22 PM Sergio Belkin <seb...@gmail.com> wrote: > > Hi! > I've read the explanation about how much space is available using disks with > different sizes[1]. I understand the rules, but I see a contradiction with > definition of RAID-1 in btrs: > > «A form of RAID which stores two complete copies of each piece of data. Each > copy is stored on a different device. btrfs requires a minimum of two devices > to use RAID-1. This is the default for btrfs's metadata on more than one > device. > > So, let's say we have 3 small disks: 4GB, 3G, and 2GB.
From the btrfs perspective, this is a 9G file system, with raid1 metadata and data block groups. The "raidness" happens at the block group level, it is not at the device level like mdadm raid. Deep dive: Block groups are a logical range of bytes (variable size, typically 1G). Where and what drive a file extent actually exists on is a function of the block group to chunk mapping. i.e. a 1G data block group using raid1 profile, physically exists as two 1G chunks, each one on two devices. What this means is internally to Btrfs it sees everything as just one copy in a virtual address space, and it's a function of the chunk tree and allocator to handle the details of exactly where it's located physically and how it's replicated. It's normal to not totally grok this, it's pretty esoteric, but if there's one complicated thing to try to get about Btrfs, it's this. Because once you get it, all the other unique/unusual/confusing things start to make sense. Because the "pool" is 9G, and each 1G of data results in two 1G "mirror" chunks, each written on two drives, writes consume double the space. Two copies for raid1. The 'btrfs filesystem usage' command reveals this reality. Whereas 'df' kinda lies to try and make it behave more like what we've come to expect with more conventional raid1 implementation. This lie works ok for even number of same size devices. It starts to fall apart [1] with odd number of drives, and odd sized devices. So you're likely to run up against some still remaining issues in 'df' reporting in this example. https://carfax.org.uk/btrfs-usage/ Set three disks. On the right side, use preset raid1. Go down to Devices sizes and enter 4000,3000,2000. And it'll show you what happens. > If I create one file of 3GB I think that > 3 GB is written on 4GB disk, it leaves 1 GB free. > 3 GB of copy is written on 3 GB disk, it leaves 0 GB Free. It's more complicated than that because first it'll be broken up into 3 1GB block groups (possibly more and smaller block groups), and then the allocator tries to maintain equal free space. That means it'll tend to initially write to the biggest and 2nd biggest drives, but it won't fill either of them up. It'll start writing to the smaller device once it has more space than the free space in the middle device. And yep, it can split up chunks like this, sorta like Tetris. The example size 9G is perhaps not a great example of real world allocation for btrfs raid1, I'd bump that to T :) 9G is even below the threshold of USB sticks you can buy off the shelf these days. > > So, I create one file of 1GB that is written on 4GB disk, it leaves 0 GB free. > 1 GB of copy is written on 2 GB disk, so it leaves 1 GB free. > > So I've used 4GB, ok it leaves 1 GB free on only one disk, but cannot be > mirrored. > > However as [1] I could use 4.5 ((4GB+3GB+2GB)/2) GB instead of 4GB. Surely, > I'm missing or mistaking something. Block groups and chunks. There's lots of reused jargon in btrfs that sounds familiar but it's not the same as mdadm or lvm, they're just reused terms. Another example: raid1 or raid10 on btrfs don't work like you're used to with mdadm and LVM. i.e. raid10 on btrfs is not a ""stripe of mirrored drives" it is "striped and mirrored block groups". man mkfs.btrfs has quite concise and important information about such things, and of course questions welcome. So it's worth knowing a bit about how it works differently so you can properly assess (a) if it fits for your use case and meets your expectations (b) how to maintain and manage it, in particular disaster recovery. Because that too is different. [1] https://github.com/kdave/btrfs-progs/issues/277 -- Chris Murphy _______________________________________________ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org