Re: btrfs autodefrag?
Hugo Millswrites: >It has to be disabled because if you enable it, there's a race > condition: since you're overwriting existing data (rather than CoWing > it), you can't update the checksums atomically. So, in the interests > of consistency, checksums are disabled. I suppose this has been suggested before, but couldn't it store both the new and the old checksums and be satisfied if either of them match? The user is probably not happy that a partial write is going to be difficult to read from the device due to a checksum error, but there is no promise of recently-overwritten data state with traditional filesystems either in case of sudden powerdown, assuming there is no data journaling.. -- _ / __// /__ __ http://www.modeemi.fi/~flux/\ \ / /_ / // // /\ \/ /\ / /_/ /_/ \___/ /_/\_\@modeemi.fi \/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs autodefrag?
On 18 October 2015 at 16:46, Duncan <1i5t5.dun...@cox.net> wrote: > Xavier Gnata posted on Sat, 17 Oct 2015 18:36:32 +0200 as excerpted: > >> Hi, >> >> On a desktop equipped with an ssd with one 100GB virtual image used >> frequently, what do you recommend? >> 1) nothing special, it is all fine as long as you have a recent kernel >> (which I do) >> 2) Disabling copy-on-write for just the VM image directory. >> 3) autodefrag as a mount option. >> 4) something else. >> >> I don't think this usecase is well documented therefore I asked this >> question. [snip] > So ssd or spinning rust, there's serious conflicts between nocow and > snapshotting that really must be taken into consideration if you're > planning to both snapshot and nocow. This is all spot on advice, but I just wanted to chime in to mention: I've been experimenting with - - Active working copy of VM image files are hosted on non-btrfs filesystems - Regular scheduled rsync --inplace onto a btrfs subvol copy of the file that *is* snapshotted and part of regular send/receive runs. rsync --inplace does what it says on the tin: it just rewrites those parts of a file which need to be updated. Thus it only gets written to once prior to each snapshot run, rather than continuously. So the theory is that I can retain CoW storage efficiency (hold lots of snapshots cheaply) but still keep decent performance (by running the active, in-use working copies outside of my normal snapshotted btrfs filesystems). The cost is obviously more filesystems than you'd normally have to run, more complex disaster recovery, not to mention storage sizing has to accommodate a working copy on a separate fs to the archived copies. Plus, this rsync approach has noticeably bigger I/O overhead than btrfs send/receive, although in my environment nobody is noticing. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs autodefrag?
On 2015-10-19 02:19, Erkki Seppala wrote: Hugo Millswrites: It has to be disabled because if you enable it, there's a race condition: since you're overwriting existing data (rather than CoWing it), you can't update the checksums atomically. So, in the interests of consistency, checksums are disabled. I suppose this has been suggested before, but couldn't it store both the new and the old checksums and be satisfied if either of them match? Actually, I don't think that's been suggested before, read on however for an explanation of why we don't do that. The user is probably not happy that a partial write is going to be difficult to read from the device due to a checksum error, but there is no promise of recently-overwritten data state with traditional filesystems either in case of sudden powerdown, assuming there is no data journaling.. And that is exactly the case with how things are now, when something is marked NOCOW, it has essentially zero guarantee of data consistency after a crash. As things are now though, there is a guarantee that you can still read the file, but using checksums like you suggest would result in it being unreadable most of the time, because it's statistically unlikely that we wrote the _whole_ block (IOW, we can't guarantee without COW that the data was completely written) because: a. While some disks do atomically write single sectors, most don't, and if the power dies during the disk writing a single sector, there is no certainty exactly what that sector will read back as. b. Assuming that item a is not an issue, one block in BTRFS is usually multiple sectors on disk, and a majority of disks have volatile write caches, thus it is not unlikely that the power will die during the process of writing the block. c. In the event that both items a and b are not an issue (for example, you have a storage controller with a non-volatile write cache, have write caching turned off on the disks, and it's a smart enough storage controller that it only removes writes from the cache after they return), then there is still the small but distinct possibility that the crash will cause either corruption in the write cache, or some other hardware related issue. smime.p7s Description: S/MIME Cryptographic Signature
Re: btrfs autodefrag?
Austin S Hemmelgarnwrites: > And that is exactly the case with how things are now, when something > is marked NOCOW, it has essentially zero guarantee of data consistency > after a crash. Yes. In addition to the zero guarantee of the data validity for the data being written into, btrfs also doesn't give any guarantees for the rest of the data, even if it was perfectly quiescent, but was just marked COW at the time it was written :). > As things are now though, there is a guarantee that > you can still read the file, but using checksums like you suggest > would result in it being unreadable most of the time, because it's > statistically unlikely that we wrote the _whole_ block (IOW, we can't > guarantee without COW that the data was completely written) because: Well, the amount of data being written at any given time is very small compared to the whole device. So it's not all the data that is at risk of having the wrong checksum. Given how small blocks are (4k) I really doubt that the likelihood of large amounts of data remaining unreadable would be great. However, here's a compromise: when detecting an error on a COW file, instead of refusing to read it, produce a warning to the kernel log. In addition, when scrubbing it, the last resort after trying other copies the checksum could simply be repaired, paired with an appropriate log message. Such a log message would not indicate that the data is wrong, but that the system administrator might be interested in checking it, for example against backups, or by perhaps running a scrub within the virtual machine. If the scrub would say everything is OK, then certainly everything would be OK. > a. While some disks do atomically write single sectors, most don't, > and if the power dies during the disk writing a single sector, there > is no certainty exactly what that sector will read back as. So it seems that the majority vote is to not to provide a feature to the minority.. :) > b. Assuming that item a is not an issue, one block in BTRFS is usually > multiple sectors on disk, and a majority of disks have volatile write > caches, thus it is not unlikely that the power will die during the > process of writing the block. I'm not at all familiar with the on-disk structure of Btrfs, but it seems that indeed the block size is 16 kilobytes by default, so the risk of one of the four device-blocks (on modern 4kB-sector HDDs) being corrupted or only a set of them having being written is real. But, there's only so much data in-flight at any given time. I did read that there are two checksums (on Wikipedia, Btrfs#Checksum_tree..): one per block, and one per a contiguous run of allocated blocks. The latter checksum seems more likely to be broken, but I don't see why in that case the per-block checksums (or one of the two checksums I proposed) couldn't be referred to. This is of course because I don't understand much of the Btrfs on-disk format, technical feasibility be damned :). I understand that the metadata is always COW, so that level of corruption cannot occur. > c. In the event that both items a and b are not an issue (for example, > you have a storage controller with a non-volatile write cache, have > write caching turned off on the disks, and it's a smart enough storage > controller that it only removes writes from the cache after they > return), then there is still the small but distinct possibility that > the crash will cause either corruption in the write cache, or some > other hardware related issue. However, should this not be the case, for example when my computer is never brought down abruptly, it could still be valuable information to see that the data has not changed behind my back. I understand it is the prime motivation behind btrfs scrubbing in any case; otherwise there could be a faster 'queue a verify after a write' that would never scrub the same data twice. -- _ / __// /__ __ http://www.modeemi.fi/~flux/\ \ / /_ / // // /\ \/ /\ / /_/ /_/ \___/ /_/\_\@modeemi.fi \/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs autodefrag?
On 2015-10-19 12:13, Erkki Seppala wrote: Austin S Hemmelgarnwrites: And that is exactly the case with how things are now, when something is marked NOCOW, it has essentially zero guarantee of data consistency after a crash. Yes. In addition to the zero guarantee of the data validity for the data being written into, btrfs also doesn't give any guarantees for the rest of the data, even if it was perfectly quiescent, but was just marked COW at the time it was written :). Assuming you do actually mean COW and not NOCOW, in which case there is a guarantee that the data will either: 1. Match the original data prior to the write. 2. Match the data that was written. or, if you are using only single copies of the metadata blocks and the system crashes exactly during a write to a metadata block: 3. Everything under that metadata block will become inaccessible, and require usage of btrfs-progs to recover. In the case of NOCOW however, there is absolutely no such guarantee (just like ext4 for example can not provide such a guarantee), and any of the above could be the case, or any arbitrary portion of the new data could have been written. As things are now though, there is a guarantee that you can still read the file, but using checksums like you suggest would result in it being unreadable most of the time, because it's statistically unlikely that we wrote the _whole_ block (IOW, we can't guarantee without COW that the data was completely written) because: Well, the amount of data being written at any given time is very small compared to the whole device. So it's not all the data that is at risk of having the wrong checksum. Given how small blocks are (4k) I really doubt that the likelihood of large amounts of data remaining unreadable would be great. That very much depends on how you are using things.for many of the types of things which NOCOW should be used for, directio and AIO are also very commonly used, and those can write chunks much bigger than BTRFS's block size in one go. However, here's a compromise: when detecting an error on a COW file, instead of refusing to read it, produce a warning to the kernel log. In addition, when scrubbing it, the last resort after trying other copies the checksum could simply be repaired, paired with an appropriate log message. Such a log message would not indicate that the data is wrong, but that the system administrator might be interested in checking it, for example against backups, or by perhaps running a scrub within the virtual machine. In this case I'm assuming you mean NOCOW instead of COW, as the corruption can't be detected in a NOCOW file by BTRFS. In a significant majority of cases, it is actually better to return no data than to return known corrupted data (think medical or military applications, in those kind of cases it's quite often worse to act on incorrect data than it is to not act at all). Disk images for virtual machines are one of the very few rare cases where this is not true, simply because they can usually correct the corruption themselves. If the scrub would say everything is OK, then certainly everything would be OK. That's a _very_ optimistic point of view to take, and doesn't take into account software bugs, or potential hardware problems. a. While some disks do atomically write single sectors, most don't, and if the power dies during the disk writing a single sector, there is no certainty exactly what that sector will read back as. So it seems that the majority vote is to not to provide a feature to the minority.. :) For something that provides a false sense of data safety and is potentially easy to shoot yourself in the foot with? Yes we will almost certainly not provide it. If, however, you wish to write a patch to provide such a feature (or pay someone to do so for you), there is nothing stopping you from doing so, and if it's something that people actually want, then it will likely end up included. b. Assuming that item a is not an issue, one block in BTRFS is usually multiple sectors on disk, and a majority of disks have volatile write caches, thus it is not unlikely that the power will die during the process of writing the block. I'm not at all familiar with the on-disk structure of Btrfs, but it seems that indeed the block size is 16 kilobytes by default, so the risk of one of the four device-blocks (on modern 4kB-sector HDDs) being corrupted or only a set of them having being written is real. But, there's only so much data in-flight at any given time. While the default is usually 16k, there are situations where it may be different, for example if the system has a page size greater than 16k (some ARM64, PPC, and MIPS systems use 64k pages), or if it's a small filesystem (in which case the blocks will be 4k). It is also worth noting that while most 'modern' HDDs use 4k sectors: 1. They are still vastly outnumbered by older HDDs that use 512 byte sectors. 2. A
Re: btrfs autodefrag?
On 18/10/2015 07:46, Duncan wrote: Xavier Gnata posted on Sat, 17 Oct 2015 18:36:32 +0200 as excerpted: Hi, On a desktop equipped with an ssd with one 100GB virtual image used frequently, what do you recommend? 1) nothing special, it is all fine as long as you have a recent kernel (which I do) 2) Disabling copy-on-write for just the VM image directory. 3) autodefrag as a mount option. 4) something else. I don't think this usecase is well documented therefore I asked this question. You are correct. The VM images on ssd use-case /isn't/ particularly well documented, I'd guess because people have differing opinions, and, indeed, actual observed behavior, and thus recommendations even in the ideal case, may well be different depending on the specs and firmware of the ssd. The documentation tends to be aimed at the spinning rust case. There's one detail of the use-case (besides ssd specs), however, that you didn't mention, that could have a big impact on the recommendation. What sort of btrfs snapshotting are you planning to do, and if you're doing snapshots, does your use-case really need them to include the VM image file? Snapshots are a big issue for anything that you might set nocow, because snapshot functionality assumes and requires cow, and thus conflicts, to some extent, with nocow. A snapshot locks in place the existing extents, so they can no longer be modified. On a normal btrfs cow-based file, that's not an issue, since any modifications would be cowed elsewhere anyway -- that's how btrfs normally works. On a nocow file, however, there's a problem, because once the snapshot locks in place the existing version, the first change to a specific block (normally 4 KiB) *MUST* be cowed, despite the nocow attribute, because to rewrite in-place would alter the snapshot. The nocow attribute remains in place, however, and further writes to the same block will again be nocow... to the new block location established by that first post-snapshot write... until the next snapshot comes along and locks that too in-place, of course. This sort of cow-only-once behavior is sometimes called cow1. If you only do very occasional snapshots, probably manually, this cow1 behavior isn't /so/ bad, tho the file will still fragment over time as more and more bits of it are written and rewritten after the few snapshots that are taken. However, for people doing frequent, generally schedule-automated snapshots, the nocow attribute is effectively nullified as all those snapshots force cow1s over and over again. So ssd or spinning rust, there's serious conflicts between nocow and snapshotting that really must be taken into consideration if you're planning to both snapshot and nocow. For use-cases that don't require snapshotting of the nocow files, the simplest workaround is to put any nocow files on dedicated subvolumes. Since snapshots stop at subvolume boundaries, having nocow files on dedicated subvolume(s) stops snapshots of the parent from including them, thus avoiding the cow1 situation entirely. If the use-case requires snapshotting of nocow files, the workaround that has been reported (mostly on spinning rust, where fragmentation is a far worse problem due to non-zero seek-times) to work is first to reduce snapshotting to a minimum -- if it was going to be hourly, consider daily or every 12 hours, if you can get away with it, if it was going to be daily, consider every other day or weekly. Less snapshotting means less cow1s and thus directly affects how quickly fragmentation becomes a problem. Again, dedicated subvolumes can help here, allowing you to snapshot the nocow files on a different schedule than you do the up- hierarchy parent subvolume. Second, schedule periodic manual defrags of the nocow files, so the fragmentation that does occur is at least kept manageable. If the snapshotting is daily, consider weekly or monthly defrags. If it's weekly, consider monthly or quarterly defrags. Again, various people who do need to snapshot their nocow files have reported that this really does help, keeping fragmentation to at least some sanely managed level. That's the snapshot vs. nocow problem in general. With luck, however, you can avoid snapshotting the files in question entirely, thus factoring this issue out of the equation entirely. Now to the ssd issue. On ssds in general, there are two very major differences we need to consider vs. spinning rust. One, fragmentation isn't as much of a problem as it is on spinning rust. It's still worth keeping to a minimum, because as the number of fragments increases, so does both btrfs and device overhead, but it's not the nearly everything-overriding consideration that it is on spinning rust. Two, ssds have a limited write-cycle factor to consider, where with spinning rust the write-cycle limit is effectively infinite... at least compared to the much lower limit of ssds. The weighing of these two overriding ssd factors one against the other, along
Re: btrfs autodefrag?
On Sun, Oct 18, 2015 at 10:24:39AM -0400, Rich Freeman wrote: > On Sat, Oct 17, 2015 at 12:36 PM, Xavier Gnatawrote: > > 2) Disabling copy-on-write for just the VM image directory. > > Unless this has changed, doing this will also disable checksumming. I > don't see any reason why it has to, but it does. So, I avoid using > this at all costs. It has to be disabled because if you enable it, there's a race condition: since you're overwriting existing data (rather than CoWing it), you can't update the checksums atomically. So, in the interests of consistency, checksums are disabled. Hugo. -- Hugo Mills | Nothing wrong with being written in Perl... Some of hugo@... carfax.org.uk | my best friends are written in Perl. http://carfax.org.uk/ | PGP: E2AB1DE4 | dark signature.asc Description: Digital signature
Re: btrfs autodefrag?
On Sat, Oct 17, 2015 at 12:36 PM, Xavier Gnatawrote: > 2) Disabling copy-on-write for just the VM image directory. Unless this has changed, doing this will also disable checksumming. I don't see any reason why it has to, but it does. So, I avoid using this at all costs. -- Rich -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs autodefrag?
Xavier Gnata posted on Sat, 17 Oct 2015 18:36:32 +0200 as excerpted: > Hi, > > On a desktop equipped with an ssd with one 100GB virtual image used > frequently, what do you recommend? > 1) nothing special, it is all fine as long as you have a recent kernel > (which I do) > 2) Disabling copy-on-write for just the VM image directory. > 3) autodefrag as a mount option. > 4) something else. > > I don't think this usecase is well documented therefore I asked this > question. You are correct. The VM images on ssd use-case /isn't/ particularly well documented, I'd guess because people have differing opinions, and, indeed, actual observed behavior, and thus recommendations even in the ideal case, may well be different depending on the specs and firmware of the ssd. The documentation tends to be aimed at the spinning rust case. There's one detail of the use-case (besides ssd specs), however, that you didn't mention, that could have a big impact on the recommendation. What sort of btrfs snapshotting are you planning to do, and if you're doing snapshots, does your use-case really need them to include the VM image file? Snapshots are a big issue for anything that you might set nocow, because snapshot functionality assumes and requires cow, and thus conflicts, to some extent, with nocow. A snapshot locks in place the existing extents, so they can no longer be modified. On a normal btrfs cow-based file, that's not an issue, since any modifications would be cowed elsewhere anyway -- that's how btrfs normally works. On a nocow file, however, there's a problem, because once the snapshot locks in place the existing version, the first change to a specific block (normally 4 KiB) *MUST* be cowed, despite the nocow attribute, because to rewrite in-place would alter the snapshot. The nocow attribute remains in place, however, and further writes to the same block will again be nocow... to the new block location established by that first post-snapshot write... until the next snapshot comes along and locks that too in-place, of course. This sort of cow-only-once behavior is sometimes called cow1. If you only do very occasional snapshots, probably manually, this cow1 behavior isn't /so/ bad, tho the file will still fragment over time as more and more bits of it are written and rewritten after the few snapshots that are taken. However, for people doing frequent, generally schedule-automated snapshots, the nocow attribute is effectively nullified as all those snapshots force cow1s over and over again. So ssd or spinning rust, there's serious conflicts between nocow and snapshotting that really must be taken into consideration if you're planning to both snapshot and nocow. For use-cases that don't require snapshotting of the nocow files, the simplest workaround is to put any nocow files on dedicated subvolumes. Since snapshots stop at subvolume boundaries, having nocow files on dedicated subvolume(s) stops snapshots of the parent from including them, thus avoiding the cow1 situation entirely. If the use-case requires snapshotting of nocow files, the workaround that has been reported (mostly on spinning rust, where fragmentation is a far worse problem due to non-zero seek-times) to work is first to reduce snapshotting to a minimum -- if it was going to be hourly, consider daily or every 12 hours, if you can get away with it, if it was going to be daily, consider every other day or weekly. Less snapshotting means less cow1s and thus directly affects how quickly fragmentation becomes a problem. Again, dedicated subvolumes can help here, allowing you to snapshot the nocow files on a different schedule than you do the up- hierarchy parent subvolume. Second, schedule periodic manual defrags of the nocow files, so the fragmentation that does occur is at least kept manageable. If the snapshotting is daily, consider weekly or monthly defrags. If it's weekly, consider monthly or quarterly defrags. Again, various people who do need to snapshot their nocow files have reported that this really does help, keeping fragmentation to at least some sanely managed level. That's the snapshot vs. nocow problem in general. With luck, however, you can avoid snapshotting the files in question entirely, thus factoring this issue out of the equation entirely. Now to the ssd issue. On ssds in general, there are two very major differences we need to consider vs. spinning rust. One, fragmentation isn't as much of a problem as it is on spinning rust. It's still worth keeping to a minimum, because as the number of fragments increases, so does both btrfs and device overhead, but it's not the nearly everything-overriding consideration that it is on spinning rust. Two, ssds have a limited write-cycle factor to consider, where with spinning rust the write-cycle limit is effectively infinite... at least compared to the much lower limit of ssds. The weighing of these two overriding