(consider that question being asked with that face on: http://goo.gl/LQaOuA)
Hey. I've had some discussions on the list these days about not having checksumming with nodatacow (mostly with Hugo and Duncan). They both basically told me it wouldn't be straight possible with CoW, and Duncan thinks it may not be so much necessary, but none of them could give me really hard arguments, why it cannot work (or perhaps I was just too stupid to understand them ^^)... while at the same time I think that it would be generally utmost important to have checksumming (real world examples below). Also, I remember that in 2014, Ted Ts'o told me that there are some plans ongoing to get data checksumming into ext4, with possibly even some guy at RH actually doing it sooner or later. Since these threads were rather admin-work-centric, developers may have skipped it, therefore, I decided to write down some thoughts&ideas label them with a more attracting subject and give it some bigger attention. O:-) 1) Motivation why, it makes sense to have checksumming (especially also in the nodatacow case) I think of all major btrfs features I know of (apart from the CoW itself and having things like reflinks), checksumming is perhaps the one that distinguishes it the most from traditional filesystems. Sure we have snapshots, multi-device support and compression - but we could have had that as well with LVM and software/hardware RAID... (and ntfs supported compression IIRC ;) ). Of course, btrfs does all that in a much smarter way, I know, but it's nothing generally new. The *data* checksumming at filesystem level, to my knowledge, is however. Especially that it's always verified. Awesome. :-) When one starts to get a bit deeper into btrfs (from the admin/end-user side) one sooner or later stumbles across the recommendation/need to use nodatacow for certain types of data (DBs, VM images, etc.) and the reason, AFAIU, being the inherent fragmentation that comes along with the CoW, which is especially noticeable for those types of files with lots of random internal writes. Now duncan implied, that this could improve in the future, with the auto-defragmentation getting (even) better, defrag becoming usable again for those that do snapshots or reflinked copies and btrfs itself generally maturing more and more. But I kinda wonder to what extent one will be really able to solve that, what seems to me a CoW-inherent "problem",... Even *if* one can make the auto-defrag much smarter, it would still mean that such files, like big DBs, VMs, or scientific datasets that are internally rewritten, may get more or less constantly defragmented. That may be quite undesired... a) for performance reasons (when I consider our research software which often has IO as the limiting factor and where we want as much IO being used by actual programs as possible)... b) SSDs... Not really sure about that; btrfs seems to enable the autodefrag even when an SSD is detected,... what is it doing? Placing the block in a smart way on different chips so that accesses can be better parallelised by the controller? Anyway, (a) is could be already argument enough, not to run solve the problem by a smart-[auto-]defrag, should that actually be implemented. So I think having notdatacow is great and not just a workaround till everything else gets better to handle these cases. Thus, checksumming, which is such a vital feature, should also be possible for that. Duncan also mention that in some of those cases, the integrity is already protected by the application layer, making it less important to have it at the fs-layer. Well, this may be true for file-sharing protocols, but I wouldn't know that relational DBs really do cheksuming of the data. They have journals, of course, but these protect against crashes, not against silent block errors and that like. And I wouldn't know that VM hypervisors do checksuming (but perhaps I've just missed that). Here I can give a real-world example, from the Tier-2 that I run for LHC at work/university. We have large amounts of storage (perhaps not as large as what Google and Facebook have, or what the NSA stores about us)... but it's still some ~ 2PiB, or a bit more. That's managed with some special storage management software called dCache. dCache even stores checksums, but per file, so that means for normal reads, these cannot be verified (well technically it's supported, but with our usual file sizes, this is not working) so what remains are scrubs. For The two PiB, we have some... roughly 50-60 nodes, each with something between 12 and 24 disks, usually in either one or two RAID6 volumes, all different kinds of hard disks. And we do run these scrubs quite rarely, since it costs IO that could be used for actual computing jobs (a problem that wouldn't be there with how btrfs calculates the sums on read, the data is then read anyway)... so likely there are even more errors that are just never noticed, because the datasets are removed again, before being scrubbed. Long story short, it does happen every now and then, that a scrub shows file errors, for neither the RAID was broken, nor there were any block errors reported by the disks, or anything suspicious in SMART. In other words, silent block corruption. One may rely on the applications to do integrity protection, but I think that's not realistic, and perhaps that shouldn't be their task anyway (at least not when it's about storage device block errors and that like). I don't think it's on the horizon that things like DBs or large scientific data files do their own integrity protection (i.e. one that protects against bad blocks, and not just journalling that preserves consistency in case of crashes). And handling that on the fs level is anyway quite nice, I think. It doesn't mean that countless applications need to handle this on the application layer, making it configurable whether it should be enabled (for integrity protection) or disabled (for more speed), each of them writing a lot of code for that. If we can control that on the fs layer, by setting datasum/nodatasum, all needed is already there - except, that as of now, nodatacowed stuff is excluded in btrfs. 2) Technical Okay the following is obviously based on my naive view of how things could work, which may not necessarily go well with how an actual fs developer sees things ;-) As said in the introduction, I can't quite believe that data checksumming should in principle be possible for ext4, but not for btrfs non-CoWed parts. Duncan&Hugo said, the reason is basically it cannot do checksums with no-CoW, because there's no guarantee that the fs doesn't end up inconsistently... But, AFAIU, not doing CoW, while not having a journal (or does it have one for these cases???) almost certainly means that the data (not necessarily the fs) will be inconsistent in case of a crash during a no-CoWed write anyway, right? Wouldn't it be basically like ext2? Or we have the case of multi-device, e.g. RAID1, multiple copies of the same blocks, a crash has happened during writing such (no-CoWed and no- checksummed)... Again it's almost certainly that at least one (maybe even both) of the blocks contains garbage and likely (at least a 50% chance) we get that one when the actual read happens later (I was told btrfs would behave in these cases like e.g MD RAID does,... deliver what the first readable block said). If btrfs would calculate checksums and write them e.g. after or before the actual data was written,... what would be the worst that could happen (in my naive understanding of course ;-) ) at a crash? - I'd say either one is lucky, and checksum and data matches. Yay. - Or it doesn't match, which could boil down to the following two cases: - the data wasn't written out correctly and is actually garbage => then we can be happy, that the checksum wouldn't match and we'd get an error - the data was written out correctly, but before the csum was written the system crashed, so the csum would now tell us that the block is bad, while in reality it isn't. or the other way round: the csum was written out (completely)... and no data was written at all before the system crashed (so the old block would be still completely there) => in both cases: so what? Having that particular case happening is probably far less likely, than csumming actually detecting a bad block, or not completely written data in case of a crash. (Not to talk about all the cases where nothing crashes, and where we simply would want to detect block errors, bus errors, etc.) => Of course it wouldn't be as nice as in CoW, where it could simply take the most recent consistent state of that block, but still way better than: - delivering bogus data to the application in n other cases - not being able to decide which of m block copies is valid, if a RAID is scrubbed And as said before, AFAIU, nodatacow'ed files have no journal in btrfs as in ext3/4, so it's basically anyway that such files, when written during a crash, may end up in any state, right? Which makes not having a csum sound even worse, since nothing tells that this file is possibly bad. Not having checksumming seems to be especially bad in the multi-device case... what happens when one runs a scrub? AFAIU, it simply does what e.g. MD does: taking the first readable block, writing it to any others, thereby possibly destroying the actually good one? Not sure about whether the following would make any practical sense: If data checksumming would work for nodatacow, then maybe some people may even choose to run btrfs in CoW1 mode,.. they still could have most fancy features from btrfs (checksumming, snapshots, perhaps even refcopy?) but unless snapshots or refcopies are explicitly made, btrfs doesn't do CoW. Well, thanks for spending (hopefully not wasting ;-) ) your time on reading my X-Mas wish ;) Cheers, Chris.
smime.p7s
Description: S/MIME cryptographic signature