Dmitry Katsubo posted on Thu, 15 Oct 2015 16:10:13 +0200 as excerpted: > On 15 October 2015 at 02:48, Duncan <1i5t5.dun...@cox.net> wrote: > >> [snipped] > > Thanks for this information. As far as I can see, btrfs-tools v4.1.2 in > now in experimental Debian repo (but you anyway suggest at least 4.2.2, > which is just 10 days ago released in master git). Kernel image 3.18 is > still not there, perhaps because Debian jessie was frozen before is was > released (2014-12-07).
For userspace, as long as it's supporting the features you need at runtime (where it generally simply has to know how to make the call to the kernel, to do the actual work), and you're not running into anything really hairy that you're trying to offline-recover, which is where the latest userspace code becomes critical... Running a userspace series behind, or even more (as long as it's not /too/ far), isn't all /that/ critical a problem. It generally becomes a problem in one of three ways: 1) You have a bad filesystem and want the best chance at fixing it, in which case you really want the latest code, including the absolute latest fixups for the most recently discovered possible problems. 2) You want/need a new feature that's simply not supported in your old userspace. 3) The userspace gets so old that the output from its diagnostics commands no longer easily compares with that of current tools, giving people on-list difficulties when trying to compare the output in your posts to the output they get. As a very general rule, at least try to keep the userspace version comparable to the kernel version you are running. Since the userspace version numbering syncs to kernelspace version numbering, and userspace of a particular version is normally released shortly after the similarly numbered kernel series is released, with a couple minor updates before the next kernel-series-synced release, keeping userspace to at least the kernel space version, means you're at least running the userspace release that was made with that kernel series release in mind. Then, as long as you don't get too far behind on kernel version, you should remain at least /somewhat/ current on userspace as well, since you'll be upgrading to near the same userspace (at least), when you upgrade the kernel. Using that loose guideline, since you're aiming for the 3.18 stable kernel, you should be running at least a 3.18 btrfs-progs as well. In that context, btrfs-progs 4.1.2 should be fine, as long as you're not trying to fix any problems that a newer version fixed. And, my recommendation of the latest 4.2.2 was in the "fixing problems" context, in which case, yes, getting your hands on 4.2.2, even if it means building from sources to do so, could be critical, depending of course on the problem you're trying to fix. But otherwise, 4.1.2, or even back to the last 3.18.whatever release since that's the kernel version you're targeting, should be fine. Just be sure that whenever you do upgrade to later, you avoid the known- bad-mkfs.btrfs in 4.2.0 and/or 4.2.1 -- be sure if you're doing the btrfs- progs-4.2 series, that you get 4.2.2 or later. As for finding a current 3.18 series kernel released for Debian, I'm not a Debian user so my my knowledge of the ecosystem around it is limited, but I've been very much under the impression that there are various optional repos available that you can choose to include and update from as well, and I'm quite sure based on previous discussions with others that there's a well recognized and fairly commonly enabled repo that includes debian kernel updates thru current release, or close to it. Of course you could also simply run a mainstream Linus kernel and build it yourself, and it's not too horribly hard to do either, as there's all sorts of places with instructions for doing so out there, and back when I switched from MS to freedomware Linux in late 2001, I learned the skill, at at least the reasonably basic level of mostly taking a working config from my distro's kernel and using it as a basis for my mainstream kernel config as well, within about two months of switching. Tho of course just because you can doesn't mean you want to, and for many, finding their distro's experimental/current kernel repos and simply installing the packages from it, will be far simpler. But regardless of the method used, finding or building and keeping current with your own copy of at least the lastest couple of LTS releases, shouldn't be /horribly/ difficult. While I've not used them as actual package resources in years, I do still know a couple rpm-based package resources from my time back on Mandrake (and do still check them in contexts like this for others, or to quickly see what files a package I don't have installed on gentoo might include, etc), and would point you at them if Debian was an rpm-based distro, but of course it's not, so they won't do any good. But I'd guess a google might. =:^) > If I may ask: > > Provided that btrfs allowed to mount a volume in read-only mode – does > it mean that add data blocks are present (e.g. it has assured that add > files / directories can be read) I'm not /absolutely/ sure I understand your question, here. But assuming it's what I believe it is... here's an answer in typical Duncan fashion, answering the question... and rather more! =:^) In this particular scenario, yes, everything should still be accessible, as at least one copy of every raid1 chunk should exist on a still detected and included device. This is because of the balance after the loss of the first device, making sure there was two copies of each chunk on remaining devices, before loss of the second device. But because btrfs device delete missing didn't work, you couldn't remove that first device, even tho you now had two copies of each chunk on existing devices. So when another device dropped, you had two missing devices, but because of the balance between, you still had at least one copy of all chunks. The reason it's not letting you mount read-write is that btrfs sees now two devices missing on a raid1, the one that you actually replaced but couldn't device delete, and the new missing one that it didn't detect this time. To btrfs' rather simple way of thinking about it, that means anything with one of the only two raid1 copies on each of the two missing devices is now entirely gone, and to avoid making changes that would complicate things and prevent return of at least one of those missing devices, it won't let you mount writable, even in degraded mode. It doesn't understand that there's actually still at least one copy of everything available, as it simply sees the two missing devices and gives up without actually checking. And in the situation where btrfs' fears were correct, where chunks existed with each of the two copies on one of the now missing devices, no, not everything /would/ be accessible, and btrfs forcing read-only mounting is its way of not letting you make the problem even worse, forcing you to copy the data you can actually get to off to somewhere else, while you can still get to it in read-only mode, at least. Also, of course, forcing the filesystem read-only when there's two devices missing, at least in theory preserves a state where a device might be able to return, allowing repair of the filesystem, while allowing writable could prevent a returning device allowing the healing of the filesystem. So in this particular scenario, yes, all your data should be there, intact. However, a forced read-only mount normally indicates a serious issue, and in other scenarios, it could well indicate that some of the data is now indeed *NOT* accessible. Which is where AJ's patch comes in. That teaches btrfs to actually check each chunk. Once it sees that there's actually at least one copy of each chunk available, it'll allow mounting degraded, writable, again, so you can fix the problem. (Tho the more direct scenario that the patch addresses is a bit different, loss of one device of a two-device raid1, in which case mounting degraded writable will force new chunks to be written in single mode, because there's not a second device to write to so writing raid1 is no longer possible. So far, so good. But then on an unmount and attempt to mount again, btrfs sees single mode chunks on a two-device btrfs, and knows that single mode normally won't allow a missing device, so forces read-only, thus blocking adding a new device and rebalancing all the single chunks back to raid1. But in actuality, the only single mode chunks there are the ones written when the second device wasn't available, so they HAD to be written to the available device, and it's not POSSIBLE for any to be on the missing device. Again, the patch teaches btrfs to actually look at what's there and see that it can actually deal with it, thus allowing writable mounting, instead of jumping to conclusions and giving up, as soon as it sees a situation that /could/, in a different situation, mean entirely missing chunks with no available copies on remaining devices.) Again, these patches are in newer kernel versions, so there (assuming no further bugs) they "just work". On older kernels, however, you either have to cherry-pick the patches yourself, or manually avoid or work around the problem they fix. This is why we typically stress new versions so much -- they really /do/ fix active bugs and make problems /much/ easier to deal with. =:^) > Do you have any ideas why "btrfs balance" has pulled all data to two > drives (and not balanced between three)? Hugo did much better answering that, than I would have initially done, as most of my btrfs are raid1 here, but they're all exactly two-device, with the two devices exactly the same size, so I'm not used to thinking in terms of different sizes and didn't actually notice the situation, thus leaving me clueless, until Hugo pointed it out. But he's right. Here's my much more detailed way of saying the same thing, now that he reminded me of why that would be the deciding factor here. Given that (1) your devices are different sizes, that (2) btrfs raid1 means exactly two copies, not one per device, and that (3), the btrfs chunk-allocator allocates chunks from the device with the most free space left, subject to the restriction that both copies of a raid1 chunk can't be allocated to the same device... A rebalance of raid1 chunks would indeed start filling the two biggest devices first, until the space available on the smallest of the two biggest devices (thus the second largest) was equal to the space available on the third largest device, at which point it would continue allocating from the largest for one copy (until it too reached equivalent space available), while alternating between the others for the second copy. Given that the amount of data you had fit a copy each on the two largest devices, before the space available on either one dwindled to that available on the third largest device, only the two largest devices actually had chunk allocations, leaving the third device, still with less space total than the other two each had remaining available, entirely empty. > Does btrfs has the following optimization for mirrored data: if drive is > non-rotational, then prefer reads from it? Or it simply schedules the > read to the drive that performs faster (irrelative to rotational > status)? Such optimizations have in general not yet been done to btrfs -- not even scheduling to the faster drive. In fact, the lack of such optimizations is arguably the biggest "objective" proof that btrfs devs themselves don't yet consider btrfs truly stable. As any good dev knows there's a real danger to "premature optimization", with that danger appearing in one or both of two forms: (a) We've now severely limited the alternative code paths we can take, because implementing things differently will force throwing away all that optimization work we did as it won't work with what would otherwise be the better alternative, and (b) We're now throwing away all that optimization work we did, making it a waste, because the previous implementation didn't work, and the new one does, but doesn't work with the current optimization code, so that work must now be redone as well. Thus, good devs tend to leave moderate to complex optimization code until they know the implementation is stable and won't be changing out from under the optimization. To do differently is "premature optimization", and devs tend to be well aware of the problem, often because of the number of times they did it themselves earlier in their career. It follows that looking at whether devs (assuming you consider them good enough to be aware of the dangers of premature optimization, which if they're doing the code that runs your filesystem, you better HOPE they're at least that good, or you and your data are in serious trouble!) have actually /done/ that sort of optimization, ends up being a pretty good indicator of whether they consider the code actually stable enough to avoid the dangers of premature optimization, or not. In this case, definitely not, since these sorts of optimizations in general remain to be done. Meanwhile, the present btrfs raid1 read-scheduler is both pretty simple to code up and pretty simple to arrange tests for that run either one side or the other, but not both, or that are well balanced to both. However, it's pretty poor in terms of ensuring optimized real-world deployment read-scheduling. What it does is simply this. Remember, btrfs raid1 is specifically two copies. It chooses which copy of the two will be read very simply, based on the PID making the request. Odd PIDs get assigned one copy, even PIDs the other. As I said, simple to code, great for ensuring testing of one copy or the other or both, but not really optimized at all for real-world usage. If your workload happens to be a bunch of all odd or all even PIDs, well, enjoy your testing-grade read-scheduler, bottlenecking everything reading one copy, while the other sits entirely idle. (Of course on fast SSDs with their zero seek-time, which is what I'm using for my own btrfs, that's not the issue it'd be on spinning rust. I'm still using my former reiserfs standard for spinning rust, which I use for backup and media files. But normal operations are on btrfs on ssd, and despite btrfs lack of optimization, on ssd, it's fast /enough/ for my usage, and I particularly like the data integrity features of btrfs raid1 mode, so...) > No, it was particular my decision to use btrfs on various reasons. > First of all, I am using raid1 on all data. Second, I benefit from > transparent compression. Third I need CRC consistency: some of the > drives (like /dev/sdd in my case) seem to fail, also once I have a buggy > DIMM so btrfs helps me not to loose the data "silently". Anyway, > it much better then md-raid. The fact that despite it being available, mdraid couldn't be configured to runtime-verify integrity using either parity or redundancy, nor checksums (which weren't available) was a very strong disappointment for me. To me, the fact that btrfs /does/ do runtime checksumming on write and data integrity checking on read, and in raid1/10 mode, will actually fallback to the second copy if the first one fails checksum verification, is one of its best features, and why I use btrfs raid1 (or on a couple single-device btrfs, mixed-bg mode dup). =:^) That's also why my personally most hotly anticipated features is N-way- mirroring, with 3-way being my ideal balance, since that will give me a fallback to the fallback, if both the first read copy and the first fallback copy fail verification. Four-way would be too much, but I just don't quite rest as easy as I otherwise could, because I know that if both the primary-read copy and the fallback happen to be bad, same logical place at the same time, there's no third copy to fall back on! It seems as much of a shame not to have that on btrfs with its data integrity, as it did to have mdraid with N-way-mirroring but no runtime data integrity. But at least btrfs does have N-way-mirroring on the roadmap, actually for after raid56, which is now done, so N-way-mirroring should be coming up rather soon (even if on btrfs, "soon" is relative), while AFAIK, mdraid has no plans to implement runtime data integrity checking. > And dynamic assignment is not a problem since udev was introduced (so > one can add extra persistent symlinks): > > https://wiki.debian.org/Persistent_disk_names FWIW, I actually use labels as my own form of "human-readable" UUID, here. I came up with the scheme back when I was on reiserfs, with 15- character label limits, so that's what mine are. Using this scheme, I encode the purpose of the filesystem (root/home/media/whatever), the size and brand of the media, the sequence number of the media (since I often have more than one of the same brand and size), the machine the media is targeted at, the date I did the formatting, and the sequence-number of the partition (root-working, root-backup1, root-backup2, etc). hm0238gcnx+35l0 home, on a 238 gig corsair neutron, #x (the filesystem is multidevice, across #0 and #1), targeted at + (the workstation), originally partitioned in (201)3, on May (5) 21 (l), working copy (0) I use GPT partitioning, which takes partition labels (aka names) as well. The two partitions hosting that filesystem are on identically partitioned corsair neutrons, 256 GB = 238 GiB. The gpt labels on those two partitions are identical to the above, except one will have a 0 replacing the x, while the other has a 1, as they are my first and second media of that size and brand. hm0238gcn0+35l0 hm0238gcn1+35l0 The primary backup of home, on a different pair of partitions on the same physical devices, is labeled identically, except the partition number is one: hm0238gcnx+35l1 ... and its partitions: hm0238gcn0+35l1 hm0238gcn1+35l1 The secondary backup is on a reiserfs, on spinning rust: hm0465gsg0+47f0 In that case the partition label and filesystem label are the same, since the partition and its filesystem correspond 1:1. It's home on the 465 GiB (aka 500 GB) seagate #0, targeted at the workstation, first formatted in (201)4, on July 15, first (0) copy there. (I could make it #3 instead of #0, indicating second backup, but didn't, as I know that 0465gsg0+ is the media and backups spinning rust device for the workstation.) Both my internal and USB attached devices have the same labeling scheme, media identified by size, brand, media sequence number and what it's targetting, partition/filesystem identified by purpose, original partition/format date, and partition sequence number. As I said, it's effectively human-readable GUID, my own scheme for my own devices. And I use LABEL= in fstab as well, running gdisk -l to get a listing of partitions with their gpt-labels when I need to associate actual sdN mapping to specific partitions (if I don't already have the mapping from mount or whatever). Which makes it nice when btrfs fi show outputs filesystem label as well. =:^) The actual GUID is simply machine-readable but not necessary for the human to deal with "noise", to me, as the label (of either the gpt partition or the filesystem it hosts) gives me *FAR* more and more useful information, while being entirely unique within my ID system. > If "btrfs device scan" is user-space, then I think doing some output is > better then outputting nothing :) (perhaps with "-v" flag). If it is > kernel-space, then I agree that logging to dmesg is not very evident > (from perspective that user should remember where to look), > but I think has a value. Well, btrfs is a userspace tool, but in this case, btrfs device scan's use is purely to make a particular kernel call, which triggers the btrfs module to do a device rescan to update its own records, *not* for human consumption. -v to force output could work if it had been designed that way, but getting that output is precisely what btrfs filesystem show is for, printing for both mounted and unmounted filesystems unless told otherwise. Put it this way. If neither your initr* nor some service started before whatever mounts local filesystems doesn't do a btrfs device scan, then attempting to mount a multi-device btrfs will fail, unless all its component devices have been fed in using device= options. Why? Because mount takes exactly one device to mount. With traditional filesystems, that's enough, since they only consist of a single device. And with single-device btrfs, it's enough as well. But with a multi-device btrfs, something has to supply the other devices to btrfs, along with the one that mount tells it about. It is possible to list all those component devices in device= options, but those take /dev/sd* style device nodes, and those may change from boot to boot, so that's not very reliable. Which is where btrfs device scan comes in. It tells the btrfs module to do a general scan and map out internally which devices belong to which filesystems, after which a mount supplying just one of them can work, since this internal map, the generation or refresh of which is triggered by btrfs device scan, supplies the others. IOW, btrfs device scan needs no output, because all the userspace command does is call a kernel function, which triggers the mapping internal to the btrfs kernel module, so it can then handle mounts with just one of the possibly many devices handed to it from mount. Outputting that mapping is an entirely different function, with the userspace side of that being btrfs filesystem show, which calls a kernel function that generates output back to the btrfs userspace app, which then further formats it for output back to the user. > Thanks. I have carefully read changelog wiki page and found that: > > btrfs-progs 4.2.2: > scrub: report status 'running' until all devices are finished Thanks. As I said, I had seen the patch on the list, and /thought/ it was now in, but had lost track of specifically when it went in, or indeed, /whether/ it had gone in. Now I know it's in 4.2.2, without having to actually go look it up in the git log again, myself. > Idea concerning balance is listed on wiki page "Project ideas": > > balance: allow to run it in background (fork) and report status > periodically FWIW, it sort of does that today, except that the btrfs bal start doesn't actually return to the command prompt. But again, what it actually does is call a kernel function to initiate the balance, and then it's simply waiting. On my relatively small btrfs on partitioned ssd, the return is often within a minute or two anyway, but on multi-TB spinning rust... In any case, once the kernel function has triggered the balance, ctrl-C should I believe terminate the userspace side and get you back to the prompt, without terminating the balance as that continues on in kernel space. But it would still be useful to have balance start actually return quickly, instead of having to ctrl-C it. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html