Robert White posted on Wed, 10 Dec 2014 14:18:55 -0800 as excerpted: > So I started looking at the mkfs.btrfs manual page with an eye towards > documenting some of the tidbits like metadata automatically switching > from dup to raid1 when more than one device is used. > > In experimenting I ended up with some questions... > > (1) why is the dup profile for data restricted to only one device and > only if it's mixed mode?
> (2) why is metadata dup profile restricted to only one device on > creation when it will run that way just fine after a device add? 1 and 2 together since they both deal with dup mode... Dup mode was apparently originally considered purely an extra safeguard for metadata in the single-device case, where it was made the default (except for SSDs, which default to single mode metadata on a single- device filesystem, because the FTL voids any guarantees on location anyway, and because firmware such as sandforce compresses and dedups anyway, in which case the hardware/firmware is subverting btrfs' efforts to do dup anyway). In the single-device case, two copies of data was considered simply not worth the cost, due both to doubling the size (especially on SSD where size is money!) and to the speed penalties on spinning rust due to seeks between one 1-GiB data-chunk and its dup. With multi-device, raid1 metadata, forcing one copy to each of two different devices, was considered enough superior to make that the default, since that provided device-loss resiliency for the all-important metadata, thus enabling recovery of at least /some/ files even with a device missing (single-mode data where the file's extents all happened to be on available devices, plus of course raid1, etc, data). Further, dup- mode metadata was considered a mistake it was better not to even have available as an option, since loss of a single device would likely kill the filesystem, which made dup mode little better than single mode, without the doubled-size-cost. Further, on spinning rust there'd again be the seek penalty, to little benefit since dup mode provides no guarantees in case of device loss. So multi-device defaults to raid1 metadata for safety, but single mode metadata remains an option (along with raid0) if you really /don't/ care about losing everything due to loss of a single device. Single-device simply makes dup-mode available (and the default) for metadata, as a poor- man's substitute for the safety of raid1, but single-device-metadata is the only case where that poor-man's-raid1-substitute is worth the (considered extreme) cost, with usage of that option not even available on multi-device as it'd be a near-certain mistake, certainly at the mkfs level. And dup mode isn't ordinarily available for data even on single- device, because it's considered not worth the cost. As for dup-mode working after device-add, that's simply a necessary bit in ordered for device add to work from a default-dup-mode single-device at all. And it's only the existing metadata chunks on the original device that will be dup-mode. Once a second device is added, additional metadata chunks will be written in raid1 mode, forcing the two chunk copies to different devices since there's multiple devices available to allow that. The clear intent and recommendation is to do a rebalance ASAP after a device add, to spread usage to the new device as appropriate. And of course that rebalance will use the new raid1 metadata defaults, unless told otherwise of course, and I don't believe dup mode is available to tell it otherwise there, either. What all that original reasoning fails to account for, however, is the btrfs data/metadata checksumming and integrity features and the very high (which the original btrfs mode designers obviously considered extreme) value some users (including me) place on them. While a multi-device dup- mode-metadata choice at mkfs is arguably still a mistake, the cost of raid1 metadata without the benefit, near the risk of single metadata but at double the size, dup-mode data combined with btrfs checksumming and data integrity features on a single device has strong data integrity benefits that some would definitely consider worth it, even at the additional cost in speed on spinning rust due to seeking, and in size on expensive SSDs. Meanwhile, mixed-bg-mode was an after-thought, added much later (after my own btrfs journey began) in ordered to make working with small filesystems reasonable. Before mixed-bg-mode, people attempting to use btrfs on sub-GiB devices often found they couldn't use all available space (often 25-50% wasted!) as the separate data/metadata chunk allocation was simply too large grained to properly deal with the small sizes involved. And small filesystems really _was_ mixed-mode's _entire_ purpose. That it could additionally be used to allow dup-data, using the ability to specify mixed-bg-mode even on > 1 GiB filesystems where it wasn't the default to get dup-data, was *ENTIRELY* an accident, not even considered until a user figured it out, as confirmed by I believe it was Chris Mason when directly asked at some point. But now that mixed-mode is there and can be used to enable dup-mode data too, for people that want it, and now that we know for sure such people exist because we see mixed-bg mode being offered as a way to get exactly that, dup-mode-data, there's little reason to remove the accidental feature. =:^) Meanwhile, now that demand is known to exist for dup-mode-data, I think it probable that at some point code for that without having to force mixed-bg-mode to get it will be made available and tested, much as other features have been. But there's way more features left to implement than time to implement them, at least with the current btrfs developer pool. And given that mixed-bg-mode is available to deliver dup-mode-data for those /really/ intent on having it, the priority of coding and testing stand-alone-dup-mode-data is going to be relatively low, so I'd suggest not expecting it any time soon -- maybe five years out, I don't see it much sooner unless a dev (or dev sponsor) really gets that itch and decides to priority scratch it. > (3) why can I make a raid5 out of two devices? > (4) Same question for raid6 but with three drives instead of the > mandated four. > > (5) If I can make a RAID5 or RAID6 device with one missing element, why > can't I make a RAID1 out of one drive, e.g. with one missing element? AFAIK, the ability to mkfs raid56 modes with a missing device is a bug. I'm not sure if it was known or not, tho I know there has been some change in minimum number of devices over time and it might have gotten caught in that, but I'd /guess/ that since raid56 isn't yet fully supported, if the bug /was/ known, it had relatively low priority on the fix-list compared to various other bugs with currently supported features. If it is a bug as I believe it to be, that nullifies most of the secondary questions you had... > (6) If I make a RAID1 out of three devices are there three copies of > every extent or are there always two copies that are semi-randomly > spread across three devices? (ibid for more than three). Currently btrfs raid1 is defined very specifically as exactly two copies/ mirrors, regardless of whether there are two or two hundred devices in the filesystem. More devices gives you more room; number of copies remains two. This is covered in the wiki. The feature known as N-way-mirroring is however on the roadmap -- for just after raid56, since the planned implementation depends on some of the same code. This is actually a bit of a personal sore spot for me, since it has long been my most-wished-for feature. When I first investigated btrfs now years ago, I was running quad-way-mdraid-1, and was very disappointed to see that btrfs only offered paired-raid1, since I wanted (and still want) very much to be able to fall back more than once to additional copies, should the checksum fail on the first N-1 copies. And back then (kernel 3.5 era IIRC) it was already roadmapped immediately after raid56 modes, which was to be introduced in another kernel cycle or two, so I figured perhaps 3-4 cycles, maybe a year (~5 cycles) for N-way- mirroring. But it seems as far out now as then, if not further since we know how long raid56 is taking to complete, and two kernel cycles after that for N-way-mirroring seems wildly optimistic, now. Maybe a year after... if it's not too complicated. But it's definitely on the roadmap, next thing to implement in fact, but it's still right after raid56, and raid56 has of course been coming right up since kernel 3.6 or whatever, at least. But I'm not a dev so I can't help in that regard, tho I do use btrfs in pair-way raid1 mode now, and try to help on the list where my knowledge as list regular and sysadmin using btrfs allow it. Someday that feature will be available to play with... but that doesn't mean I can't enjoy btrfs for what it has right now, nor does it mean I can't help others with btrfs while I wait... -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html