Re: dear developers, can we have notdatacow + checksumming, plz?
Austin S. Hemmelgarn posted on Mon, 21 Dec 2015 08:36:02 -0500 as excerpted: > On 2015-12-16 21:09, Christoph Anton Mitterer wrote: >> On Tue, 2015-12-15 at 11:00 -0500, Austin S. Hemmelgarn wrote: >>> nodatacow only [avoids fragmentation] if the file is >>> pre-allocated, if it isn't, then it still ends up fragmented. >> Hmm is that "it may end up fragmented" or a "it will definitely? Cause >> I'd have hoped, that if nothing else had been written in the meantime, >> btrfs would perhaps try to write next to the already allocated blocks. > If there are multiple files being written, then there is a relatively > high probability that they will end up fragmented if they are more than > about 64k and aren't pre-allocated. Does the 30-second-by-default commit window (and similarly 30-second- default dirty-flush-time at the VFS level) modify this at all? It has been my assumption that same-file writes accumulated during this time should merge, increasing efficiency and decreasing fragmentation (both with and without nocow), tho of course further writes outside this 30- second window will likely trigger it, if other files have been written in parallel or in the mean time. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dear developers, can we have notdatacow + checksumming, plz?
On 2015-12-22 04:12, Duncan wrote: Austin S. Hemmelgarn posted on Mon, 21 Dec 2015 08:36:02 -0500 as excerpted: On 2015-12-16 21:09, Christoph Anton Mitterer wrote: On Tue, 2015-12-15 at 11:00 -0500, Austin S. Hemmelgarn wrote: nodatacow only [avoids fragmentation] if the file is pre-allocated, if it isn't, then it still ends up fragmented. Hmm is that "it may end up fragmented" or a "it will definitely? Cause I'd have hoped, that if nothing else had been written in the meantime, btrfs would perhaps try to write next to the already allocated blocks. If there are multiple files being written, then there is a relatively high probability that they will end up fragmented if they are more than about 64k and aren't pre-allocated. Does the 30-second-by-default commit window (and similarly 30-second- default dirty-flush-time at the VFS level) modify this at all? It has been my assumption that same-file writes accumulated during this time should merge, increasing efficiency and decreasing fragmentation (both with and without nocow), tho of course further writes outside this 30- second window will likely trigger it, if other files have been written in parallel or in the mean time. I think it does, but not much, and it depends on the workload. I do notice less fragmentation on the filesystems I increase the commit window on, and more on ones I decrease it, but the difference is pretty small as long as you use something reasonable (I've never tested anything higher than 300, and I rarely go above 60). My guess based on what the commit window is for (namely, it's the amount of time the log tree gets updated before forcing a transaction to be committed) would be that it has less effect if stuff is regularly calling fsync(). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dear developers, can we have notdatacow + checksumming, plz?
On 2015-12-16 21:09, Christoph Anton Mitterer wrote: On Tue, 2015-12-15 at 11:00 -0500, Austin S. Hemmelgarn wrote: Well sure, I think we'de done most of this and have dedicated controllers, at least of a quality that funding allows us ;-) But regardless how much one tunes, and how good the hardware is. If you'd then loose always a fraction of your overall IO, and be it just 5%, to defragging these types of files, one may actually want to avoid this at all, for which nodatacow seems *the* solution. nodatacow only works for that if the file is pre-allocated, if it isn't, then it still ends up fragmented. Hmm is that "it may end up fragmented" or a "it will definitely? Cause I'd have hoped, that if nothing else had been written in the meantime, btrfs would perhaps try to write next to the already allocated blocks. If there are multiple files being written, then there is a relatively high probability that they will end up fragmented if they are more than about 64k and aren't pre-allocated. The problem is not entirely the lack of COW semantics, it's also the fact that it's impossible to implement an atomic write on a hard disk. Sure... but that's just the same for the nodatacow writes of data. (And the same, AFAIU, for CoW itself, just that we'd notice any corruption in case of a crash due to the CoWed nature of the fs and could go back to the last generation). Yes, but it's also the reason that using either COW or a log- structured filesystem (like NILFS2, LogFS, or I think F2FS) is important for consistency. So then it's no reason why it shouldn't work. The meta-data is CoWed, any incomplete writes of checksumdata in that (be it for CoWed data or no-CoWed data, should the later be implemented), would be protected at that level. Currently, the no-CoWed data is, AFAIU completely at risk of being corrupted (no checksums, no journal). Checksums on no-CoWed data would just improve that. Except that without COW semantics on the data blocks, you can't be sure whether the checksum is for the data that is there, the data that was going to be written there, or data that had been there previously. This will significantly increase the chances of having false positives, which really isn't a viable tradeoff. What about VMs? At least a quick google search didn't give me any results on whether there would be e.g. checksumming support for qcow2. For raw images there surely is not. I don't mean that the VMM does checksumming, I mean that the guest OS should be the one to handle the corruption. No sane OS doesn't run at least some form of consistency checks when mounting a filesystem. Well but we're not talking about having a filesystem that "looks clear" here. For this alone we wouldn't need any checksumming at all. We talk about data integrity protection, i.e. all files and their contents. Nothing which a fsck inside a guest VM would ever notice (I mean by a fsck), if there are just some bit flips or things like that. That really depends on what is being done inside the VM. If you're using BTRFS or even dm-verity, you should have no issues detecting the corruption. And even if DBs do some checksumming now, it may be just a consequence of that missing in the filesystems. As I've written somewhere else in the previous mail: it's IMHO much better if one system takes care on this, where the code is well tested, than each application doing it's own thing. That's really a subjective opinion. The application knows better than we do what type of data integrity it needs, and can almost certainly do a better job of providing it than we can. Hmm I don't see that. When we, at the filesystem level, provide data integrity, than all data is guaranteed to be valid. What more should an application be able to provide? At best they can do the same thing faster, but even for that I see no immediate reason to believe it. Any number of things. As of right now, there are no local filesystems on Linux that provide: 1. Cryptographic verification of the file data (Technically possible with IMA and EVM, or with DM-Verity (if the data is supposed to be read-only), but those require extra setup, and aren't part of the FS). 2. Erasure coding other than what is provided by RAID5/6 (At least one distributed cluster filesystem provides this (Ceph), but running such a FS on a single node is impractical). 3. Efficient transactional logging (for example, the type that is needed by most RDBMS software). 4. Easy selective protections (Some applications need only part of their data protected). Item 1 can't really be provided by BTRFS under it's current design, it would require at least implementing support for cryptographically secure hashes in place of CRC32c (and each attempt to do that has been pretty much shot down). Item 2 is possible, and is something I would love to see support for, but would require a significant amount of coding, and almost certainly wouldn't anywhere near as flexible as letting
Re: dear developers, can we have notdatacow + checksumming, plz?
On Tue, 2015-12-15 at 11:00 -0500, Austin S. Hemmelgarn wrote: > > Well sure, I think we'de done most of this and have dedicated > > controllers, at least of a quality that funding allows us ;-) > > But regardless how much one tunes, and how good the hardware is. If > > you'd then loose always a fraction of your overall IO, and be it > > just > > 5%, to defragging these types of files, one may actually want to > > avoid > > this at all, for which nodatacow seems *the* solution. > nodatacow only works for that if the file is pre-allocated, if it > isn't, > then it still ends up fragmented. Hmm is that "it may end up fragmented" or a "it will definitely? Cause I'd have hoped, that if nothing else had been written in the meantime, btrfs would perhaps try to write next to the already allocated blocks. > > > The problem is not entirely the lack of COW semantics, it's also > > > the > > > fact that it's impossible to implement an atomic write on a hard > > > disk. > > Sure... but that's just the same for the nodatacow writes of data. > > (And the same, AFAIU, for CoW itself, just that we'd notice any > > corruption in case of a crash due to the CoWed nature of the fs and > > could go back to the last generation). > Yes, but it's also the reason that using either COW or a log- > structured > filesystem (like NILFS2, LogFS, or I think F2FS) is important for > consistency. So then it's no reason why it shouldn't work. The meta-data is CoWed, any incomplete writes of checksumdata in that (be it for CoWed data or no-CoWed data, should the later be implemented), would be protected at that level. Currently, the no-CoWed data is, AFAIU completely at risk of being corrupted (no checksums, no journal). Checksums on no-CoWed data would just improve that. > > What about VMs? At least a quick google search didn't give me any > > results on whether there would be e.g. checksumming support for > > qcow2. > > For raw images there surely is not. > I don't mean that the VMM does checksumming, I mean that the guest OS > should be the one to handle the corruption. No sane OS doesn't run > at > least some form of consistency checks when mounting a filesystem. Well but we're not talking about having a filesystem that "looks clear" here. For this alone we wouldn't need any checksumming at all. We talk about data integrity protection, i.e. all files and their contents. Nothing which a fsck inside a guest VM would ever notice (I mean by a fsck), if there are just some bit flips or things like that. > > > > And even if DBs do some checksumming now, it may be just a > > consequence > > of that missing in the filesystems. > > As I've written somewhere else in the previous mail: it's IMHO much > > better if one system takes care on this, where the code is well > > tested, > > than each application doing it's own thing. > That's really a subjective opinion. The application knows better > than > we do what type of data integrity it needs, and can almost certainly > do > a better job of providing it than we can. Hmm I don't see that. When we, at the filesystem level, provide data integrity, than all data is guaranteed to be valid. What more should an application be able to provide? At best they can do the same thing faster, but even for that I see no immediate reason to believe it. And in practise it seems far more likely that if countless applications should such task on their own, that it's more error prone (that's why we have libraries for all kinds of code, trying to reuse code, minimising the possibility of errors in countless home-brew solutions), or not done at all. > > > > - the data was written out correctly, but before the csum > > > > was > > > > written the system crashed, so the csum would now tell us > > > > that > > > > the > > > > block is bad, while in reality it isn't. > > > There is another case to consider, the data got written out, but > > > the > > > crash happened while writing the checksum (so the checksum was > > > partially > > > written, and is corrupt). This means we get a false positive on > > > a > > > disk > > > error that isn't there, even when the data is correct, and that > > > should > > > be avoided if at all possible. > > I've had that, and I've left it quoted above. > > But as I've said before: That's one case out of many? How likely is > > it > > that the crash happens exactly after a large data block has been > > written followed by a relatively tiny amount of checksum data. > > I'd assume it's far more likely that the crash happens during > > writing > > the data. > Except that the whole metadata block pointing to that data block gets > rewritten, not just the checksum. But that's the case anyway, isn't it? With or without checksums. > > And regarding "reporting data to be in error, which is actually > > correct"... isn't that what all journaling systems may do? > No, most of them don't actually do that. The general design of a > journaling filesystem is that
Re: dear developers, can we have notdatacow + checksumming, plz?
Austin S. Hemmelgarn posted on Tue, 15 Dec 2015 11:00:40 -0500 as excerpted: > And in particular, the only > journaling filesystem that I know of that even allows the option of > journaling the file contents instead of just metadata is ext4. IIRC, ext3 was the first to have it in Linux mainline, with data=writeback for the speed freaks that don't care about data loss, data=ordered as the default normal option (except for that infamous period when Linus lost his head and let people talk him into switching to data=writeback, despite the risks... he later came back to his senses and reverted that), and data=journal for the folks that were willing to pay trade a bit of speed for better data protection (tho it was famous for surprising everybody, in that in certain use-cases it was extremely fast, faster than data=writeback, something I don't think was ever fully explained). To my knowledge ext3 still has that, tho I haven't used it probably a decade. Reiserfs has all three data= options as well, with data=ordered the default, tho it only had data=writeback initially. While I've used reiserfs for years, it has always been with the default data=ordered since that was introduced, and I'd be surprised if data=journal had the same use-case speed advantage that it did on ext3, as it's too different. Meanwhile, that early data=writeback default is where reiserfs got its ill repute for data loss, but it had long switched to data=ordered by default by the time Linus lost his senses and tried data=writeback by default on ext3. Because I was on reiserfs from data=writeback era, I was rather glad most kernel hackers didn't want to touch it by the time Linus let them talk him into data=writeback on ext3, and thus left reiserfs (which again had long been data=ordered by default by then) well enough alone. But I did help a few people running ext3 trace down their new ext3 stability issues to that bad data=writeback experiment, and persuaded them to specify data=ordered, which solved their problems, so indeed they /were/ data=writeback related. And happily, Linus did eventually regain his senses and return ext3 to data=ordered by default once again. And based on what you said, ext4 still has all three data= options, including data=journal. But I wasn't sure on that myself (tho I would have assumed it inherited it from ext3) and thus am /definitely/ not sure whether it inherits ext3's data=journal speed advantages in certain corner-cases. I have no idea whether other journaled filesystems allow choosing the journal level or not, tho. I only know of those three. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dear developers, can we have notdatacow + checksumming, plz?
Austin S. Hemmelgarn posted on Tue, 15 Dec 2015 11:00:40 -0500 as excerpted: > AFAIUI, checksums are stored per-instance for every block. This is > important in a multi-device filesystem in case you lose a device, so > that you still have a checksum for the block. There should be no > difference between extent layout and compression between devices > however. I don't believe that's quite correct. What is correct, to the best of my knowledge, is that checksums are metadata, and thus have whatever duplication/parity level metadata is assigned. For single devices, that is of course by default dup, 2X the metadata and thus 2X the checksums, both on the single data (as effectively the only choice on a single device, at least thru 4.3, tho there's a patch adding dup data as an option that I think should be in 4.4) when covering data, dup metadata when covering it. For multiple devices, it's default raid1 metadata, default single data, so the picture doesn't differ much by default from the single-device default picture. It's also possible to do single metadata, raidN data, which really doesn't make sense except for raid0 data, and thus I believe there's a warning about that sort of layout in newer mkfs.btrfs, or when lowering the metadata redundancy using balance filters. But of course it's possible to do raid1 data and metadata, which would be two copies of each, regardless of the number of devices (except that it's 2+, of course). But the copies aren't 1:1 assigned. That is, if they're equal generation, btrfs can read either checksum and apply it to either data/metadata block. (Of course if they're not equal generation, btrfs will choose the higher one, thus covering the case of writing at the time of a crash, since either they will both be the same generation if the root block wasn't updated to the new one on either one yet, or one will be a higher/newer generation than the other, if it had already finished writing one but not the other at the time of the crash.) This is why it's an extremely good idea if you have a pair of devices in raid1, and you mount one of them degraded/writable with the other unavailable for some reason, that you don't also mount the other one writable and then try to recombined them. Chances are the generations wouldn't match and it'd pick the one with the higher generation, but if they did for some reason match, and both checksums were valid on their data, but the data differed... either one could be chosen, and a scrub might choose either one to fix the other, as well, which could in theory result in a file with intermixed blocks from the two different versions! Just ensure that if one is mounted writable, it's the only one mounted writable if there's a chance of recombining, and you'll be fine, as it'll be the only one with advancing generations. And if by some accident both are mounted writable separately, the best bet is to be sure and wipe the one, then add it as a new device, if you're going to reintroduce it to the same filesystem. Of course this gets a bit more complicated with 3+ device raid1, since currently, there's still only two copies of each block and two copies of the checksum, meaning there's at least one device without a copy of each block, and if the filesystem is mounted degraded writable repeatedly with a random device missing... Similarly, the permutations can be calculated for the other raid types, and for mixed raid types like raid6 data (specified) and raid1 metadata (unspecified so the default used), but I won't attempt that here. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dear developers, can we have notdatacow + checksumming, plz?
On 2015-12-14 22:15, Christoph Anton Mitterer wrote: On Mon, 2015-12-14 at 09:16 -0500, Austin S. Hemmelgarn wrote: When one starts to get a bit deeper into btrfs (from the admin/end- user side) one sooner or later stumbles across the recommendation/need to use nodatacow for certain types of data (DBs, VM images, etc.) and the reason, AFAIU, being the inherent fragmentation that comes along with the CoW, which is especially noticeable for those types of files with lots of random internal writes. It is worth pointing out that in the case of DB's at least, this is because at least some of the do COW internally to provide the transactional semantics that are required for many workloads. Guess that also applies to some VM images then, IIRC qcow2 does CoW. Yep, and I think that VMWare's image format does too. a) for performance reasons (when I consider our research software which often has IO as the limiting factor and where we want as much IO being used by actual programs as possible)... There are other things that can be done to improve this. I would assume of course that you're already doing some of them (stuff like using dedicated storage controller cards instead of the stuff on the motherboard), but some things often get overlooked, like actually taking the time to fine-tune the I/O scheduler for the workload (Linux has particularly brain-dead default settings for CFQ, and the deadline I/O scheduler is only good in hard-real-time usage or on small hard drives that actually use spinning disks). Well sure, I think we'de done most of this and have dedicated controllers, at least of a quality that funding allows us ;-) But regardless how much one tunes, and how good the hardware is. If you'd then loose always a fraction of your overall IO, and be it just 5%, to defragging these types of files, one may actually want to avoid this at all, for which nodatacow seems *the* solution. nodatacow only works for that if the file is pre-allocated, if it isn't, then it still ends up fragmented. The big argument for defragmenting a SSD is that it makes it such that you require fewer I/O requests to the device to read a file I've had read about that too, but since I haven't had much personal experience or measurements in that respect, I didn't list it :) I can't give any real numbers, but I've seen noticeable performance improvements on good SSD's (Intel, Samsung, and Crucial) when making sure that things are defragmented. The problem is not entirely the lack of COW semantics, it's also the fact that it's impossible to implement an atomic write on a hard disk. Sure... but that's just the same for the nodatacow writes of data. (And the same, AFAIU, for CoW itself, just that we'd notice any corruption in case of a crash due to the CoWed nature of the fs and could go back to the last generation). Yes, but it's also the reason that using either COW or a log-structured filesystem (like NILFS2, LogFS, or I think F2FS) is important for consistency. but I wouldn't know that relational DBs really do cheksuming of the data. All the ones I know of except GDBM and BerkDB do in fact provide the option of checksumming. It's pretty much mandatory if you want to be considered for usage in financial, military, or medical applications. Hmm I see... PostgreSQL seem to have it since 9.3 ... didn't know that... only crc16 but at least something. Long story short, it does happen every now and then, that a scrub shows file errors, for neither the RAID was broken, nor there were any block errors reported by the disks, or anything suspicious in SMART. In other words, silent block corruption. Or a transient error in system RAM that ECC didn't catch, or a undetected error in the physical link layer to the disks, or an error in the disk cache or controller, or any number of other things. Well sure,... I was referring to these particular cases, where silent block corruption was the most likely reason. The data was reproducibly read identical, which probably rules out bad RAM or controller, etc. BTRFS could only protect against some cases, not all (for example, if you have a big enough error in RAM that ECC doesn't catch it, you've got serious issues that just about nothing short of a cold reboot can save you from). Sure, I haven't claimed, that checksumming for no-CoWed data is a solution for everything. But, AFAIU, not doing CoW, while not having a journal (or does it have one for these cases???) almost certainly means that the data (not necessarily the fs) will be inconsistent in case of a crash during a no-CoWed write anyway, right? Wouldn't it be basically like ext2? Kind of, but not quite. Even with nodatacow, metadata is still COW, which is functionally as safe as a traditional journaling filesystem like XFS or ext4. Sure, I was referring to the data part only, should have made that more clear. Absolute worst case scenario for both nodatacow on BTRFS, and a traditional journaling
Re: dear developers, can we have notdatacow + checksumming, plz?
On Mon, 2015-12-14 at 17:42 +1100, Russell Coker wrote: > My understanding of BTRFS is that the metadata referencing data > blocks has the > checksums for those blocks, then the blocks which link to that > metadata (EG > directory entries referencing file metadata) has checksums of those. You mean basically, that all metadata is chained, right? > For each > metadata block there is a new version that is eventually linked from > a new > version of the tree root. > > This means that the regular checksum mechanisms can't work with nocow > data. A > filesystem can have checksums just pointing to data blocks but you > need to > cater for the case where a corrupt metadata block points to an old > version of > a data block and matching checksum. The way that BTRFS works with an > entire > checksumed tree means that there's no possibility of pointing to an > old > version of a data block. Hmm I'm not sure whether I understand that (or better said, I'm probably sure I don't :D). AFAIU, the metadata is always CoWed, right? So when a nodatacow file is written, I'd assume it's mtime was update, which already leads to CoWing of metadata... just that now, the checksums should be written as well. If the metadata block is corrupt, then should that be noticed via the csums on that? And you said "The way that BTRFS works with an entire checksumed tree means that there's no possibility of pointing to an old version of a data block."... how would that work for nodatacow'ed blocks? If there is a crash it cannot know whether it was still the old block or the new one or any garbage in between?! > The NetApp published research into hard drive errors indicates that > they are > usually in small numbers and located in small areas of the disk. So > if BTRFS > had a nocow file with any storage method other than dup you would > have metadata > and file data far enough apart that they are not likely to be hit by > the same > corruption (and the same thing would apply with most Ext4 Inode > tables and > data blocks). Well put aside any such research (whose results aren't guaranteed to be always the case)... but that's just one reason from my motivation why I've said checksums for no-CoWed files would be great (I used the multi-device example though, not DUP). > I think that a file mode where there were checksums on data > blocks with no checksums on the metadata tree would be useful. But > it would > require a moderate amount of coding Do you mean in general, or having this as a mode for nodatacow'ed files? Loosing the meta data checksumming, doesn't seem really much more appealing than not having data checksumming :-( > and there's lots of other things that the > developers are working on. Sure, I just wanted to bring this to their attending... I already imagined that they wouldn't drop their current work to do that, just because me whining for it ;-) Thanks, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: dear developers, can we have notdatacow + checksumming, plz?
On Mon, 2015-12-14 at 09:16 -0500, Austin S. Hemmelgarn wrote: > > When one starts to get a bit deeper into btrfs (from the admin/end- > > user > > side) one sooner or later stumbles across the recommendation/need > > to > > use nodatacow for certain types of data (DBs, VM images, etc.) and > > the > > reason, AFAIU, being the inherent fragmentation that comes along > > with > > the CoW, which is especially noticeable for those types of files > > with > > lots of random internal writes. > It is worth pointing out that in the case of DB's at least, this is > because at least some of the do COW internally to provide the > transactional semantics that are required for many workloads. Guess that also applies to some VM images then, IIRC qcow2 does CoW. > > a) for performance reasons (when I consider our research software > > which > > often has IO as the limiting factor and where we want as much IO > > being > > used by actual programs as possible)... > There are other things that can be done to improve this. I would > assume > of course that you're already doing some of them (stuff like using > dedicated storage controller cards instead of the stuff on the > motherboard), but some things often get overlooked, like actually > taking > the time to fine-tune the I/O scheduler for the workload (Linux has > particularly brain-dead default settings for CFQ, and the deadline > I/O > scheduler is only good in hard-real-time usage or on small hard > drives > that actually use spinning disks). Well sure, I think we'de done most of this and have dedicated controllers, at least of a quality that funding allows us ;-) But regardless how much one tunes, and how good the hardware is. If you'd then loose always a fraction of your overall IO, and be it just 5%, to defragging these types of files, one may actually want to avoid this at all, for which nodatacow seems *the* solution. > The big argument for defragmenting a SSD is that it makes it such > that > you require fewer I/O requests to the device to read a file I've had read about that too, but since I haven't had much personal experience or measurements in that respect, I didn't list it :) > The problem is not entirely the lack of COW semantics, it's also the > fact that it's impossible to implement an atomic write on a hard > disk. Sure... but that's just the same for the nodatacow writes of data. (And the same, AFAIU, for CoW itself, just that we'd notice any corruption in case of a crash due to the CoWed nature of the fs and could go back to the last generation). > > but I wouldn't know that relational DBs really do cheksuming of the > > data. > All the ones I know of except GDBM and BerkDB do in fact provide the > option of checksumming. It's pretty much mandatory if you want to be > considered for usage in financial, military, or medical applications. Hmm I see... PostgreSQL seem to have it since 9.3 ... didn't know that... only crc16 but at least something. > > Long story short, it does happen every now and then, that a scrub > > shows > > file errors, for neither the RAID was broken, nor there were any > > block > > errors reported by the disks, or anything suspicious in SMART. > > In other words, silent block corruption. > Or a transient error in system RAM that ECC didn't catch, or a > undetected error in the physical link layer to the disks, or an error > in > the disk cache or controller, or any number of other things. Well sure,... I was referring to these particular cases, where silent block corruption was the most likely reason. The data was reproducibly read identical, which probably rules out bad RAM or controller, etc. > BTRFS > could only protect against some cases, not all (for example, if you > have > a big enough error in RAM that ECC doesn't catch it, you've got > serious > issues that just about nothing short of a cold reboot can save you > from). Sure, I haven't claimed, that checksumming for no-CoWed data is a solution for everything. > > But, AFAIU, not doing CoW, while not having a journal (or does it > > have > > one for these cases???) almost certainly means that the data (not > > necessarily the fs) will be inconsistent in case of a crash during > > a > > no-CoWed write anyway, right? > > Wouldn't it be basically like ext2? > Kind of, but not quite. Even with nodatacow, metadata is still COW, > which is functionally as safe as a traditional journaling filesystem > like XFS or ext4. Sure, I was referring to the data part only, should have made that more clear. > Absolute worst case scenario for both nodatacow on > BTRFS, and a traditional journaling filesystem, the contents of the > file > are inconsistent. However, almost all of the things that are > recommended use cases for nodatacow (primarily database files and VM > images) have some internal method of detecting and dealing with > corruption (because of the traditional filesystem semantics ensuring > metadata consistency, but not data
Re: dear developers, can we have notdatacow + checksumming, plz?
On 2015-12-13 23:59, Christoph Anton Mitterer wrote: (consider that question being asked with that face on: http://goo.gl/LQaOuA) Hey. I've had some discussions on the list these days about not having checksumming with nodatacow (mostly with Hugo and Duncan). They both basically told me it wouldn't be straight possible with CoW, and Duncan thinks it may not be so much necessary, but none of them could give me really hard arguments, why it cannot work (or perhaps I was just too stupid to understand them ^^)... while at the same time I think that it would be generally utmost important to have checksumming (real world examples below). Also, I remember that in 2014, Ted Ts'o told me that there are some plans ongoing to get data checksumming into ext4, with possibly even some guy at RH actually doing it sooner or later. Since these threads were rather admin-work-centric, developers may have skipped it, therefore, I decided to write down some thoughts label them with a more attracting subject and give it some bigger attention. O:-) 1) Motivation why, it makes sense to have checksumming (especially also in the nodatacow case) I think of all major btrfs features I know of (apart from the CoW itself and having things like reflinks), checksumming is perhaps the one that distinguishes it the most from traditional filesystems. Sure we have snapshots, multi-device support and compression - but we could have had that as well with LVM and software/hardware RAID... (and ntfs supported compression IIRC ;) ). Of course, btrfs does all that in a much smarter way, I know, but it's nothing generally new. The *data* checksumming at filesystem level, to my knowledge, is however. Especially that it's always verified. Awesome. :-) When one starts to get a bit deeper into btrfs (from the admin/end-user side) one sooner or later stumbles across the recommendation/need to use nodatacow for certain types of data (DBs, VM images, etc.) and the reason, AFAIU, being the inherent fragmentation that comes along with the CoW, which is especially noticeable for those types of files with lots of random internal writes. It is worth pointing out that in the case of DB's at least, this is because at least some of the do COW internally to provide the transactional semantics that are required for many workloads. Now duncan implied, that this could improve in the future, with the auto-defragmentation getting (even) better, defrag becoming usable again for those that do snapshots or reflinked copies and btrfs itself generally maturing more and more. But I kinda wonder to what extent one will be really able to solve that, what seems to me a CoW-inherent "problem",... Even *if* one can make the auto-defrag much smarter, it would still mean that such files, like big DBs, VMs, or scientific datasets that are internally rewritten, may get more or less constantly defragmented. That may be quite undesired... a) for performance reasons (when I consider our research software which often has IO as the limiting factor and where we want as much IO being used by actual programs as possible)... There are other things that can be done to improve this. I would assume of course that you're already doing some of them (stuff like using dedicated storage controller cards instead of the stuff on the motherboard), but some things often get overlooked, like actually taking the time to fine-tune the I/O scheduler for the workload (Linux has particularly brain-dead default settings for CFQ, and the deadline I/O scheduler is only good in hard-real-time usage or on small hard drives that actually use spinning disks). b) SSDs... Not really sure about that; btrfs seems to enable the autodefrag even when an SSD is detected,... what is it doing? Placing the block in a smart way on different chips so that accesses can be better parallelised by the controller? This really isn't possible with an SSD. Except for NVMe and Open Channel SSD's, they use the same interfaces as a regular hard drive, which means you get absolutely no information about the data layout on the device. The big argument for defragmenting a SSD is that it makes it such that you require fewer I/O requests to the device to read a file, and in most cases, the device will outlive it's usefulness because of performance long before it dies due to wearing out the flash storage. Anyway, (a) is could be already argument enough, not to run solve the problem by a smart-[auto-]defrag, should that actually be implemented. So I think having notdatacow is great and not just a workaround till everything else gets better to handle these cases. Thus, checksumming, which is such a vital feature, should also be possible for that. The problem is not entirely the lack of COW semantics, it's also the fact that it's impossible to implement an atomic write on a hard disk. If we could tell the disk 'ensure that this set of writes either all happen, or none of them happen', then we could do
Re: dear developers, can we have notdatacow + checksumming, plz?
On Mon, 14 Dec 2015 03:59:18 PM Christoph Anton Mitterer wrote: > I've had some discussions on the list these days about not having > checksumming with nodatacow (mostly with Hugo and Duncan). > > They both basically told me it wouldn't be straight possible with CoW, > and Duncan thinks it may not be so much necessary, but none of them > could give me really hard arguments, why it cannot work (or perhaps I > was just too stupid to understand them ^^)... while at the same time I > think that it would be generally utmost important to have checksumming > (real world examples below). My understanding of BTRFS is that the metadata referencing data blocks has the checksums for those blocks, then the blocks which link to that metadata (EG directory entries referencing file metadata) has checksums of those. For each metadata block there is a new version that is eventually linked from a new version of the tree root. This means that the regular checksum mechanisms can't work with nocow data. A filesystem can have checksums just pointing to data blocks but you need to cater for the case where a corrupt metadata block points to an old version of a data block and matching checksum. The way that BTRFS works with an entire checksumed tree means that there's no possibility of pointing to an old version of a data block. The NetApp published research into hard drive errors indicates that they are usually in small numbers and located in small areas of the disk. So if BTRFS had a nocow file with any storage method other than dup you would have metadata and file data far enough apart that they are not likely to be hit by the same corruption (and the same thing would apply with most Ext4 Inode tables and data blocks). I think that a file mode where there were checksums on data blocks with no checksums on the metadata tree would be useful. But it would require a moderate amount of coding and there's lots of other things that the developers are working on. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html