Christoph Anton Mitterer posted on Wed, 09 Dec 2015 06:43:01 +0100 as excerpted:
> Hey Hugo, > > > On Thu, 2015-11-26 at 00:33 +0000, Hugo Mills wrote: > >> The issue is that nodatacow bypasses the transactional nature of >> the FS, making changes to live data immediately. This then means that >> if you modify a modatacow file, the csum for that modified section is >> out of date, and won't be back in sync again until the latest >> transaction is committed. So you can end up with an inconsistent >> filesystem if there's a crash between the two events. > Sure,... (and btw: is there some kind of journal planned for > nodatacow'ed files?),... but why not simply trying to write an updated > checksum after the modified section has been flushed to disk... of > course there's no guarantee that both are consistent in case of crash ( > but that's also the case without any checksum)... but at least one would > have the csum protection against everything else (blockerrors and that > like) in case no crash occurs? Answering the BTW first, not to my knowledge, and I'd be skeptical. In general, btrfs is cowed, and that's the focus. To the extent that nocow is necessary for fragmentation/performance reasons, etc, the idea is to try to make cow work better in those cases, for example by working on autodefrag to make it better at handling large files without the scaling issues it currently has above half a gig or so, and thus to confine nocow to a smaller and smaller niche use-case, rather than focusing on making nocow better. Of course it remains to be seen how much better they can do with autodefrag, etc, but at this point, there's way more project possibilities than people to develop them, so even if they do find they can't make cow work much better for these cases, actually working on nocow would still be rather far down the list, because there's so many other improvement and feature opportunities that will get the focus first. Which in practice probably puts it in "it'd be nice, but it's low enough priority that we're talking five years out or more, unless of course someone else qualified steps up and that's their personal itch they want to scratch", territory. As for the updated checksum after modification, the problem with that is that in the mean time, the checksum wouldn't verify, and while btrfs could of course keep status in memory during normal operations, that's not the problem, the problem is what happens if there's a crash and in- memory state vaporizes. In that case, when btrfs remounted, it'd have no way of knowing why the checksum didn't match, just that it didn't, and would then refuse access to that block in the file, because for all it knows, it /is/ a block error. And there's already a mechanism for telling btrfs to ignore checksums, and nocow already activates it, so... there's really nothing more to be done. >> > For me the checksumming is actually the most important part of btrfs >> > (not that I wouldn't like its other features as well)... so turning >> > it off is something I really would want to avoid. Same here. In fact, my most anticipated feature is N-way-mirroring, since that will allow three copies (or more, but three is my sweet spot balance between the space and reliability factors) instead of the current limit of two. It just disturbs me than in the event of one copy being bad, the other copy /better/ be good, because there's no further fallback! With a third copy, there'd be that one further fallback, and the chances of all three copies failing checksum verification are remote enough I'm willing to risk it, given the incremental cost of additional copies. >> > Plus it opens questions like: When there are no checksums, how can it >> > (in the RAID cases) decide which block is the good one in case of >> > corruptions? >> It doesn't decide -- both copies look equally good, because >> there's no checksum, so if you read the data, the FS will return >> whatever data was on the copy it happened to pick. > Hmm I see... so one gets basically the behaviour of RAID. > Isn't that kind of a big loss? I always considered the guarantee against > block errors and that like one of the big and basic features of btrfs. It is a big and basic feature, but turning it off isn't the end of the world, because then it's still the same level of reliability other solutions such as raid generally provide. And the choice to turn it off is just that, a choice, tho it's currently the recommended one in some cases, such as with large VM images, etc. But as it happens, both VM image management and databases tend to come with their own integrity management, in part precisely because the filesystem could never provide that sort of service. So to the extent that btrfs must turn off its integrity management features when dealing with that sort of file, it's no bigger deal than it would be on any other filesystem, it's simply returning what's normally a huge bonus compared to other filesystems, to the status quo for specific situations that it otherwise doesn't deal so well with. And if the status quo was good enough before, and in the absence of btrfs would of necessity be good enough still, then where it's necessary with btrfs, it's good enough there as well. IOW, there's only upside, no downside. If the upside doesn't apply, it's still no worse than it was before, no downside. > It seems that for certain (not too unimportant cases: DBs, VMs) one has > to decide between either evil, loosing the guaranteed consistency via > checksums... or basically running into severe troubles (like Mitch's > reported fragmentation issues). > > >> > 3) When I would actually disable datacow for e.g. a subvolume that >> > holds VMs or DBs... what are all the implications? >> > Obviously no checksumming, but what happens if I snapshot such a >> > subvolume or if I send/receive it? >> >> After snapshotting, modifications are CoWed precisely once, and >> then it reverts to nodatacow again. This means that making a snapshot >> of a nodatacow object will cause it to fragment as writes are made to >> it. > I see... something that should possibly go to some advanced admin > documentation (if not already in). > It means basically, that one must assure that any such files (VM images, > DB data dirs) are already created with nodatacow (perhaps on a subvolume > which is mounted as such. > > >> > 4) Duncan mentioned that defrag (and I guess that's also for auto- >> > defrag) isn't ref-link aware... >> > Isn't that somehow a complete showstopper? >> It is, but the one attempt at dealing with it caused massive data >> corruption, and it was turned off again. IIRC, it wasn't data corruption so much, as massive scaling issues, to the point where defrag was entirely useless, as it could take a week or more for just one file. So the decision was made that a non-reflink-aware defrag that actually worked in something like reasonable time even if it did break reflinks and thus increase space usage, was of more use than a defrag that basically didn't work at all, because it effectively took an eternity. After all, you can always decide not to run it if you're worried about the space effects it's going to have, but if it's going to take a week or more for just one file, you effectively don't have the choice to run it at all. > So... does this mean that it's still planned to be implemented some day > or has it been given up forever? AFAIK it's still on the list. And the scaling issues are better, but one big thing holding it up now is quota management. Quotas never have worked correctly, but they were a big part (close to half, IIRC) of the original snapshot-aware-defrag scaling issues, and thus must be reliably working and in a generally stable state before a snapshot-aware-defrag can be coded to work with them. And without that, it's only half a solution that would have to be redone when quotes stabilized anyway, so really, quota code /must/ be stabilized to the point that it's not a moving target, before reimplementing snapshot-aware-defrag makes any sense at all. But even at that point, while snapshot-aware-defrag is still on the list, I'm not sure if it's ever going to be actually viable. It may be that the scaling issues are just too big, and it simply can't be made to work both correctly and in anything approaching practical time. Time will tell, of course, but until then... > Given that you (or Duncan?,... sorry I sometimes mix up which of said > exactly what, since both of you are notoriously helpful :-) ) mentioned > that autodefrag basically fails with larger files,... and given that it > seems to be quite important for btrfs to not be fragmented too heavily, > it sounds a bit as if anything that uses (multiple) reflinks (e.g. > snapshots) cannot be really used very well. That might have been either of us, as I think we've both said effectively that, over time. As for reflink/snapshot usefulness, it really depends on your use-case. If both modifications and snapshots are seldom, it shouldn't be a big deal. For use-cases where snapshots are temporary, as can be the case for most snapshots anyway in most send/receive usage scenarios, again, the problem is quite limited. The biggest problem is with large random-rewrite-pattern files, where both rewrites and snapshots occur frequently. That's really a worst-case for copy-on-write in general, and btrfs is no exception. But there's still workarounds that can help keep the situation under control, and if it comes to it, one can always use other filesystems and accept their limitations, where btrfs isn't a particularly useful choice due to these sorts of limitations. Which again emphasizes my point, while there's cases where btrfs' features run into limits, it's all upside, no downside. Worst-case, you set nocow and turn off snapshotting, but that's exactly the situation you're in anyway with other filesystems, so you're no worse off than if you were using them. Meanwhile, where those btrfs features *can* be used, which is on /most/ files, with only limited exceptions, it's all upside! =:^) >> autodefrag, however, has >> always been snapshot aware and snapshot safe, and would be the >> recommended approach here. > Ahhh... so autodefag *is* snapshot aware, and that's basically why the > suggestion is (AFAIU) that it's turned on, right? FWIW, I've seen it asserted that autodefrag is snapshot aware a few times now, but I'm not personally sure that is the case and I don't see any immediately obvious reason it would be, when (manual) defrag isn't, so I've refrained from making that claim, myself. If I were to see multiple devs make that assertion, I'd be more confident, but I believe I've only seen it from Hugo, and while I trust him in general because in general what he says makes sense, here, as I said, it just doesn't make immediate sense to me that the two would be so different, and without that explained and lacking further/other confirmation... I just remain personally unsure and thus refrain from making that assertion, myself. Which is why you've not seen me mention it... Tho I can and _do_ say I've been happy with autodefrag here, and ensure it's enabled on everything, generally on first mount. But again, my particular use-case doesn't deal with snapshots or reflinking in general, neither does it have these large random-rewrite-pattern files, so I'd be unlikely to see the effects of reflink-awareness, or lack thereof, in my own autodefrag usage, however much I might otherwise endorse it in general. > So, I'm afraid O:-), that triggers a follow-up question: > Why isn't it the default? Or in other words what are its drawbacks (e.g. > other cases where ref-links would be broken up,... or issues with > compression)? The biggest downside of autodefrag is its performance on large (generally noticeable at between half a gig and a gig) random-rewrite-pattern files in actively-being-rewritten use. For all other cases it's generally recommended, but that's why it's not the default. And the problem there is simply that at some point the files get large enough that the defragging rewrites take longer than the time between those random updates, so the defragging rewrites become the bottleneck. As long as that's not occurring, either because the file is small enough, or because the backing device is SSD and/or simply fast enough, or because the updates are coming in slow enough to allow the file to be rewritten between them (the VM or DB using the file isn't in heavy enough use to trigger the problem), autodefrag works fine. Meanwhile, there remain some tweaks they think they can do to autodefrag, that in theory should help eliminate this issue or at least move the bottlenecking to say 10 gig instead of 1 gig, but again, there's way more improvements to be made at this point than devs working on making them, so this improvement, as many others, simply has to wait its turn. However, this one's at least intermediate priority, so I'd put it at anywhere from two months to perhaps three years out. It's unlikely to be beyond the 5 year mark, as some features on the wishlist almost certainly are. > And also, when I now activate it on an already populated fs, will it > defrag also any old files (even if they're not rewritten or so)? > I tried to have a look for some general (rather "for dummies" than for > core developers) description of how defrag and autodefrag work... but > couldn't find anything in the usual places... :-( AFAIK autodefrag only queues up the defrag when it detects fragmentation beyond some threshold, and it only checks and thus only detects at file (re)write. Additionally, on a filesystem that hasn't had autodefrag on from the beginning, fragmentation is likely to be high enough that defrag, either auto or manual, won't be able to defrag to ideal levels, and fragmentation is thus likely to remain high for some time. Further, when a filesystem is highly fragmented and autodefrag is first turned on, often it actually rather negatively affects performance for a few days, because so many files are so fragmented that it's queuing up defrags for nearly everything written. So really, the ideal is having autodefrag on from the beginning, which is why I generally ensure it's on from the very first mount, or at least before I actually start putting files in the filesystem, here. (Normally I'll create the filesystem including the label, and create the fstab entry for it referencing that label that includes autodefrag, at very nearly the same time, sometimes creating the fstab entry first since I do use the label, not the UUID. Then I mount it using that fstab entry, so yes, it /does/ have autodefrag enabled from the very first mount. =:^) Of course this might be reason enough to verify your backups one more time, blow away the filesystem with a brand new mkfs.btrfs, create that fstab entry with autodefrag included, mount, and restore from backups. This even gives you a chance to activate newer btrfs features like 16 KiB node size by default, if your filesystem is old enough to have been created before they were available, or before they were the default. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html