On 16/10/2015 10:18, Duncan wrote: > Dmitry Katsubo posted on Thu, 15 Oct 2015 16:10:13 +0200 as excerpted: > >> On 15 October 2015 at 02:48, Duncan <1i5t5.dun...@cox.net> wrote: >> >>> [snipped] >> >> Thanks for this information. As far as I can see, btrfs-tools v4.1.2 in >> now in experimental Debian repo (but you anyway suggest at least 4.2.2, >> which is just 10 days ago released in master git). Kernel image 3.18 is >> still not there, perhaps because Debian jessie was frozen before is was >> released (2014-12-07). > > For userspace, as long as it's supporting the features you need at > runtime (where it generally simply has to know how to make the call to > the kernel, to do the actual work), and you're not running into anything > really hairy that you're trying to offline-recover, which is where the > latest userspace code becomes critical... > > Running a userspace series behind, or even more (as long as it's not > /too/ far), isn't all /that/ critical a problem. > > It generally becomes a problem in one of three ways: 1) You have a bad > filesystem and want the best chance at fixing it, in which case you > really want the latest code, including the absolute latest fixups for the > most recently discovered possible problems. 2) You want/need a new > feature that's simply not supported in your old userspace. 3) The > userspace gets so old that the output from its diagnostics commands no > longer easily compares with that of current tools, giving people on-list > difficulties when trying to compare the output in your posts to the > output they get. > > As a very general rule, at least try to keep the userspace version > comparable to the kernel version you are running. Since the userspace > version numbering syncs to kernelspace version numbering, and userspace > of a particular version is normally released shortly after the similarly > numbered kernel series is released, with a couple minor updates before > the next kernel-series-synced release, keeping userspace to at least the > kernel space version, means you're at least running the userspace release > that was made with that kernel series release in mind. > > Then, as long as you don't get too far behind on kernel version, you > should remain at least /somewhat/ current on userspace as well, since > you'll be upgrading to near the same userspace (at least), when you > upgrade the kernel. > > Using that loose guideline, since you're aiming for the 3.18 stable > kernel, you should be running at least a 3.18 btrfs-progs as well. > > In that context, btrfs-progs 4.1.2 should be fine, as long as you're not > trying to fix any problems that a newer version fixed. And, my > recommendation of the latest 4.2.2 was in the "fixing problems" context, > in which case, yes, getting your hands on 4.2.2, even if it means > building from sources to do so, could be critical, depending of course on > the problem you're trying to fix. But otherwise, 4.1.2, or even back to > the last 3.18.whatever release since that's the kernel version you're > targeting, should be fine. > > Just be sure that whenever you do upgrade to later, you avoid the known- > bad-mkfs.btrfs in 4.2.0 and/or 4.2.1 -- be sure if you're doing the btrfs- > progs-4.2 series, that you get 4.2.2 or later. > > As for finding a current 3.18 series kernel released for Debian, I'm not > a Debian user so my my knowledge of the ecosystem around it is limited, > but I've been very much under the impression that there are various > optional repos available that you can choose to include and update from > as well, and I'm quite sure based on previous discussions with others > that there's a well recognized and fairly commonly enabled repo that > includes debian kernel updates thru current release, or close to it. > > Of course you could also simply run a mainstream Linus kernel and build > it yourself, and it's not too horribly hard to do either, as there's all > sorts of places with instructions for doing so out there, and back when I > switched from MS to freedomware Linux in late 2001, I learned the skill, > at at least the reasonably basic level of mostly taking a working config > from my distro's kernel and using it as a basis for my mainstream kernel > config as well, within about two months of switching. > > Tho of course just because you can doesn't mean you want to, and for > many, finding their distro's experimental/current kernel repos and simply > installing the packages from it, will be far simpler. > > But regardless of the method used, finding or building and keeping > current with your own copy of at least the lastest couple of LTS > releases, shouldn't be /horribly/ difficult. While I've not used them as > actual package resources in years, I do still know a couple rpm-based > package resources from my time back on Mandrake (and do still check them > in contexts like this for others, or to quickly see what files a package > I don't have installed on gentoo might include, etc), and would point you > at them if Debian was an rpm-based distro, but of course it's not, so > they won't do any good. But I'd guess a google might. =:^)
Thanks, Duncan. The information you give is of the greatest value for me. Finally I have decided not to play with the fate and copy the data off, re-create btrfs and copy it back. That is anyway a good exercise. >> If I may ask: >> >> Provided that btrfs allowed to mount a volume in read-only mode – does >> it mean that add data blocks are present (e.g. it has assured that add >> files / directories can be read) > > I'm not /absolutely/ sure I understand your question, here. But assuming > it's what I believe it is... here's an answer in typical Duncan fashion, > answering the question... and rather more! =:^) > > In this particular scenario, yes, everything should still be accessible, > as at least one copy of every raid1 chunk should exist on a still > detected and included device. This is because of the balance after the > loss of the first device, making sure there was two copies of each chunk > on remaining devices, before loss of the second device. But because > btrfs device delete missing didn't work, you couldn't remove that first > device, even tho you now had two copies of each chunk on existing > devices. So when another device dropped, you had two missing devices, > but because of the balance between, you still had at least one copy of > all chunks. > > The reason it's not letting you mount read-write is that btrfs sees now > two devices missing on a raid1, the one that you actually replaced but > couldn't device delete, and the new missing one that it didn't detect > this time. To btrfs' rather simple way of thinking about it, that means > anything with one of the only two raid1 copies on each of the two missing > devices is now entirely gone, and to avoid making changes that would > complicate things and prevent return of at least one of those missing > devices, it won't let you mount writable, even in degraded mode. It > doesn't understand that there's actually still at least one copy of > everything available, as it simply sees the two missing devices and gives > up without actually checking. > > And in the situation where btrfs' fears were correct, where chunks > existed with each of the two copies on one of the now missing devices, > no, not everything /would/ be accessible, and btrfs forcing read-only > mounting is its way of not letting you make the problem even worse, > forcing you to copy the data you can actually get to off to somewhere > else, while you can still get to it in read-only mode, at least. Also, > of course, forcing the filesystem read-only when there's two devices > missing, at least in theory preserves a state where a device might be > able to return, allowing repair of the filesystem, while allowing > writable could prevent a returning device allowing the healing of the > filesystem. > > So in this particular scenario, yes, all your data should be there, > intact. However, a forced read-only mount normally indicates a serious > issue, and in other scenarios, it could well indicate that some of the > data is now indeed *NOT* accessible. > > Which is where AJ's patch comes in. That teaches btrfs to actually check > each chunk. Once it sees that there's actually at least one copy of each > chunk available, it'll allow mounting degraded, writable, again, so you > can fix the problem. > > (Tho the more direct scenario that the patch addresses is a bit > different, loss of one device of a two-device raid1, in which case > mounting degraded writable will force new chunks to be written in single > mode, because there's not a second device to write to so writing raid1 is > no longer possible. So far, so good. But then on an unmount and attempt > to mount again, btrfs sees single mode chunks on a two-device btrfs, and > knows that single mode normally won't allow a missing device, so forces > read-only, thus blocking adding a new device and rebalancing all the > single chunks back to raid1. But in actuality, the only single mode > chunks there are the ones written when the second device wasn't > available, so they HAD to be written to the available device, and it's > not POSSIBLE for any to be on the missing device. Again, the patch > teaches btrfs to actually look at what's there and see that it can > actually deal with it, thus allowing writable mounting, instead of > jumping to conclusions and giving up, as soon as it sees a situation > that /could/, in a different situation, mean entirely missing chunks with > no available copies on remaining devices.) > > Again, these patches are in newer kernel versions, so there (assuming no > further bugs) they "just work". On older kernels, however, you either > have to cherry-pick the patches yourself, or manually avoid or work > around the problem they fix. This is why we typically stress new > versions so much -- they really /do/ fix active bugs and make problems > /much/ easier to deal with. =:^) Thanks for explanation. You understood the question correctly, basically I wondered if btrfs checks that all data can be read before allowing read-only mount. In my case I was luck and I just copied the date from mounted volume to another place and then copied it back. >> Do you have any ideas why "btrfs balance" has pulled all data to two >> drives (and not balanced between three)? > > Hugo did much better answering that, than I would have initially done, as > most of my btrfs are raid1 here, but they're all exactly two-device, with > the two devices exactly the same size, so I'm not used to thinking in > terms of different sizes and didn't actually notice the situation, thus > leaving me clueless, until Hugo pointed it out. > > But he's right. Here's my much more detailed way of saying the same > thing, now that he reminded me of why that would be the deciding factor > here. > > Given that (1) your devices are different sizes, that (2) btrfs raid1 > means exactly two copies, not one per device, and that (3), the btrfs > chunk-allocator allocates chunks from the device with the most free space > left, subject to the restriction that both copies of a raid1 chunk can't > be allocated to the same device... > > A rebalance of raid1 chunks would indeed start filling the two biggest > devices first, until the space available on the smallest of the two > biggest devices (thus the second largest) was equal to the space > available on the third largest device, at which point it would continue > allocating from the largest for one copy (until it too reached equivalent > space available), while alternating between the others for the second > copy. > > Given that the amount of data you had fit a copy each on the two largest > devices, before the space available on either one dwindled to that > available on the third largest device, only the two largest devices > actually had chunk allocations, leaving the third device, still with less > space total than the other two each had remaining available, entirely > empty. I think the mentioned strategy (fill in the device with most free space) is not most effective. If the data is spread equally, the read performance would be higher (reading from 3 disks instead of 2). In my case this is even crucial, because the smallest drive is SSD (and it is not loaded at all). Maybe I don't see the benefit from the strategy which is currently implemented (besides that it is robust and well-tested)? >> Does btrfs has the following optimization for mirrored data: if drive is >> non-rotational, then prefer reads from it? Or it simply schedules the >> read to the drive that performs faster (irrelative to rotational >> status)? > > Such optimizations have in general not yet been done to btrfs -- not even > scheduling to the faster drive. In fact, the lack of such optimizations > is arguably the biggest "objective" proof that btrfs devs themselves > don't yet consider btrfs truly stable. > > As any good dev knows there's a real danger to "premature optimization", > with that danger appearing in one or both of two forms: (a) We've now > severely limited the alternative code paths we can take, because > implementing things differently will force throwing away all that > optimization work we did as it won't work with what would otherwise be > the better alternative, and (b) We're now throwing away all that > optimization work we did, making it a waste, because the previous > implementation didn't work, and the new one does, but doesn't work with > the current optimization code, so that work must now be redone as well. > > Thus, good devs tend to leave moderate to complex optimization code until > they know the implementation is stable and won't be changing out from > under the optimization. To do differently is "premature optimization", > and devs tend to be well aware of the problem, often because of the > number of times they did it themselves earlier in their career. > > It follows that looking at whether devs (assuming you consider them good > enough to be aware of the dangers of premature optimization, which if > they're doing the code that runs your filesystem, you better HOPE they're > at least that good, or you and your data are in serious trouble!) have > actually /done/ that sort of optimization, ends up being a pretty good > indicator of whether they consider the code actually stable enough to > avoid the dangers of premature optimization, or not. > > In this case, definitely not, since these sorts of optimizations in > general remain to be done. > > Meanwhile, the present btrfs raid1 read-scheduler is both pretty simple > to code up and pretty simple to arrange tests for that run either one > side or the other, but not both, or that are well balanced to both. > However, it's pretty poor in terms of ensuring optimized real-world > deployment read-scheduling. > > What it does is simply this. Remember, btrfs raid1 is specifically two > copies. It chooses which copy of the two will be read very simply, based > on the PID making the request. Odd PIDs get assigned one copy, even PIDs > the other. As I said, simple to code, great for ensuring testing of one > copy or the other or both, but not really optimized at all for real-world > usage. > > If your workload happens to be a bunch of all odd or all even PIDs, well, > enjoy your testing-grade read-scheduler, bottlenecking everything reading > one copy, while the other sits entirely idle. > > (Of course on fast SSDs with their zero seek-time, which is what I'm > using for my own btrfs, that's not the issue it'd be on spinning rust. > I'm still using my former reiserfs standard for spinning rust, which I > use for backup and media files. But normal operations are on btrfs on > ssd, and despite btrfs lack of optimization, on ssd, it's fast /enough/ > for my usage, and I particularly like the data integrity features of > btrfs raid1 mode, so...) I think PID-based solution is not the best one. Why not simply take a random device? Then at least all drives in the volume are equally loaded (in average). >From what you said I believe that certain servers will not benefit from btrfs, e.g. dedicated server that runs only one "fat" Java process, or one "huge" MySQL database. In general I think that btrfs should not check for rotational flag, as even SATA-III is two times faster than SATA-II. So ideal scheduler should assign read requests to the drive that simply copes with reads faster :) If SSD drive can read 10 blocks while normal HDD reads only one during the same time - let it do it. Maybe my case is a corner one, as I am mixing "fast" and "slow" drives in one volume, more over, faster drive is the smallest. If I would have the drives of the same performance - the strategy I suggest would not matter. >> No, it was particular my decision to use btrfs on various reasons. >> First of all, I am using raid1 on all data. Second, I benefit from >> transparent compression. Third I need CRC consistency: some of the >> drives (like /dev/sdd in my case) seem to fail, also once I have a buggy >> DIMM so btrfs helps me not to loose the data "silently". Anyway, >> it much better then md-raid. > > The fact that despite it being available, mdraid couldn't be configured > to runtime-verify integrity using either parity or redundancy, nor > checksums (which weren't available) was a very strong disappointment for > me. > > To me, the fact that btrfs /does/ do runtime checksumming on write and > data integrity checking on read, and in raid1/10 mode, will actually > fallback to the second copy if the first one fails checksum verification, > is one of its best features, and why I use btrfs raid1 (or on a couple > single-device btrfs, mixed-bg mode dup). =:^) > > That's also why my personally most hotly anticipated features is N-way- > mirroring, with 3-way being my ideal balance, since that will give me a > fallback to the fallback, if both the first read copy and the first > fallback copy fail verification. Four-way would be too much, but I just > don't quite rest as easy as I otherwise could, because I know that if > both the primary-read copy and the fallback happen to be bad, same > logical place at the same time, there's no third copy to fall back on! > It seems as much of a shame not to have that on btrfs with its data > integrity, as it did to have mdraid with N-way-mirroring but no runtime > data integrity. But at least btrfs does have N-way-mirroring on the > roadmap, actually for after raid56, which is now done, so N-way-mirroring > should be coming up rather soon (even if on btrfs, "soon" is relative), > while AFAIK, mdraid has no plans to implement runtime data integrity > checking. > >> And dynamic assignment is not a problem since udev was introduced (so >> one can add extra persistent symlinks): >> >> https://wiki.debian.org/Persistent_disk_names > > FWIW, I actually use labels as my own form of "human-readable" UUID, > here. I came up with the scheme back when I was on reiserfs, with 15- > character label limits, so that's what mine are. Using this scheme, I > encode the purpose of the filesystem (root/home/media/whatever), the size > and brand of the media, the sequence number of the media (since I often > have more than one of the same brand and size), the machine the media is > targeted at, the date I did the formatting, and the sequence-number of > the partition (root-working, root-backup1, root-backup2, etc). > > hm0238gcnx+35l0 > > home, on a 238 gig corsair neutron, #x (the filesystem is multidevice, > across #0 and #1), targeted at + (the workstation), originally > partitioned in (201)3, on May (5) 21 (l), working copy (0) > > I use GPT partitioning, which takes partition labels (aka names) as > well. The two partitions hosting that filesystem are on identically > partitioned corsair neutrons, 256 GB = 238 GiB. The gpt labels on those > two partitions are identical to the above, except one will have a 0 > replacing the x, while the other has a 1, as they are my first and second > media of that size and brand. > > hm0238gcn0+35l0 > hm0238gcn1+35l0 > > The primary backup of home, on a different pair of partitions on the same > physical devices, is labeled identically, except the partition number is > one: > > hm0238gcnx+35l1 > > ... and its partitions: > > hm0238gcn0+35l1 > hm0238gcn1+35l1 > > The secondary backup is on a reiserfs, on spinning rust: > > hm0465gsg0+47f0 > > In that case the partition label and filesystem label are the same, since > the partition and its filesystem correspond 1:1. It's home on the 465 > GiB (aka 500 GB) seagate #0, targeted at the workstation, first formatted > in (201)4, on July 15, first (0) copy there. (I could make it #3 instead > of #0, indicating second backup, but didn't, as I know that 0465gsg0+ is > the media and backups spinning rust device for the workstation.) > > Both my internal and USB attached devices have the same labeling scheme, > media identified by size, brand, media sequence number and what it's > targetting, partition/filesystem identified by purpose, original > partition/format date, and partition sequence number. > > As I said, it's effectively human-readable GUID, my own scheme for my own > devices. > > And I use LABEL= in fstab as well, running gdisk -l to get a listing of > partitions with their gpt-labels when I need to associate actual sdN > mapping to specific partitions (if I don't already have the mapping from > mount or whatever). > > Which makes it nice when btrfs fi show outputs filesystem label as well. > =:^) > > The actual GUID is simply machine-readable but not necessary for the > human to deal with "noise", to me, as the label (of either the gpt > partition or the filesystem it hosts) gives me *FAR* more and more useful > information, while being entirely unique within my ID system. > >> If "btrfs device scan" is user-space, then I think doing some output is >> better then outputting nothing :) (perhaps with "-v" flag). If it is >> kernel-space, then I agree that logging to dmesg is not very evident >> (from perspective that user should remember where to look), >> but I think has a value. > > Well, btrfs is a userspace tool, but in this case, btrfs device scan's > use is purely to make a particular kernel call, which triggers the btrfs > module to do a device rescan to update its own records, *not* for human > consumption. -v to force output could work if it had been designed that > way, but getting that output is precisely what btrfs filesystem show is > for, printing for both mounted and unmounted filesystems unless told > otherwise. > > Put it this way. If neither your initr* nor some service started before > whatever mounts local filesystems doesn't do a btrfs device scan, then > attempting to mount a multi-device btrfs will fail, unless all its > component devices have been fed in using device= options. Why? Because > mount takes exactly one device to mount. With traditional filesystems, > that's enough, since they only consist of a single device. And with > single-device btrfs, it's enough as well. But with a multi-device btrfs, > something has to supply the other devices to btrfs, along with the one > that mount tells it about. It is possible to list all those component > devices in device= options, but those take /dev/sd* style device nodes, > and those may change from boot to boot, so that's not very reliable. > Which is where btrfs device scan comes in. It tells the btrfs module to > do a general scan and map out internally which devices belong to which > filesystems, after which a mount supplying just one of them can work, > since this internal map, the generation or refresh of which is triggered > by btrfs device scan, supplies the others. > > IOW, btrfs device scan needs no output, because all the userspace command > does is call a kernel function, which triggers the mapping internal to > the btrfs kernel module, so it can then handle mounts with just one of > the possibly many devices handed to it from mount. > > Outputting that mapping is an entirely different function, with the > userspace side of that being btrfs filesystem show, which calls a kernel > function that generates output back to the btrfs userspace app, which > then further formats it for output back to the user. I understand that. If btrfs can show the mapping for *unmounted* volume (e.g. "btrfs fi show /dev/sdb") that would be great. Also I think that btrfs kernel-space can be smart enough and perform a scan, if mount was attempted without a prio scan. So one should be able to mount (provided that all devices are present) without a hassle. >> Thanks. I have carefully read changelog wiki page and found that: >> >> btrfs-progs 4.2.2: >> scrub: report status 'running' until all devices are finished > > Thanks. As I said, I had seen the patch on the list, and /thought/ it > was now in, but had lost track of specifically when it went in, or > indeed, /whether/ it had gone in. > > Now I know it's in 4.2.2, without having to actually go look it up in the > git log again, myself. > >> Idea concerning balance is listed on wiki page "Project ideas": >> >> balance: allow to run it in background (fork) and report status >> periodically > > FWIW, it sort of does that today, except that the btrfs bal start doesn't > actually return to the command prompt. But again, what it actually does > is call a kernel function to initiate the balance, and then it's simply > waiting. On my relatively small btrfs on partitioned ssd, the return is > often within a minute or two anyway, but on multi-TB spinning rust... > > In any case, once the kernel function has triggered the balance, ctrl-C > should I believe terminate the userspace side and get you back to the > prompt, without terminating the balance as that continues on in kernel > space. > > But it would still be useful to have balance start actually return > quickly, instead of having to ctrl-C it. Thanks for expression your thoughts. I will keep my eye on new features development. -- With best regards, Dmitry -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html