Re: Are nocow files snapshot-aware
Duncan 1i5t5.dun...@cox.net schrieb: [...] Difficult to twist your mind around that but well explained. ;-) A snapshot thus looks much like a crash in terms of NOCOW file integrity since the blocks of a NOCOW file are simply snapshotted in-place, and there's already no checksumming or file integrity verification on such files -- they're simply directly written in-place (with the exception of a single COW write when a writable snapshottted NOCOW file diverges from the shared snapshot version). But as I said, the applications themselves are normally designed to handle and recover from crashes, and in fact, having btrfs try to manage it too only complicates things and can actually make it impossible for the app to recover what it would have otherwise recovered just fine. So it should be with these NOCOW in-place snapshotted files, too. If a NOCOW file is put back into operation from a snapshot, and the file was being written to at snapshot time, it'll very likely trigger exactly the same response from the application as a crash while writing would have triggered, but, the point is, such applications are normally designed to deal with just that, and thus, they should recover just as they would from a crash. If they could recover from a crash, it shouldn't be an issue. If they couldn't, well... So we have common sense that taking a snapshot looks like a crash from the applications perspective. That means if their are facilities to instruct the application to suspend its operations first, you should use them - like in the InnoDB case: http://dev.mysql.com/doc/refman/5.1/en/lock-tables.html: | FLUSH TABLES WITH READ LOCK; | SHOW MASTER STATUS; | SYSTEM xfs_freeze -f /var/lib/mysql; | SYSTEM YOUR_SCRIPT_TO_CREATE_SNAPSHOT.sh; | SYSTEM xfs_freeze -u /var/lib/mysql; | UNLOCK TABLES; | EXIT; Only that way you get consistent snapshots and won't trigger crash-recovery (which might otherwise throw away unrecoverable transactions or otherwise harm your data for the sake of consistency). InnoDB is more or less like a vm filesystem image on btrfs in this case. So the same approach should be taken for vm images if possible. I think VMware has facilities to prepare the guest for a snapshot being taken (it is triggered when you take snapshots with VMware itself, and btw it usually takes much longer than btrfs snapshots do). Take xfs for example: Although it is crash-safe, it prefers to zero-out your files for security reasons during log-replay - because it is crash-safe only for meta-data: if meta-data has already allocated blocks but file-data has not yet been written, a recovered file may end up with wrong content otherwise, so its cleared out. This _IS_NOT_ the situation you want with vm images with xfs inside hosted on btrfs when taking a snapshot. You should trigger xfs_freeze in the guest before taking the btrfs snapshot in the host. I think the same holds true for most other meta-data-only-journalling file systems which probably even do not zero-out files during recovery and just silently corrupt your files during crash-recovery. So in case of crash or snapshot (which looks the same from the application perspective), btrfs' capabilities won't help you here (at least in the nocow case, probably in the cow case too, because the vm guest may write blocks out-of-order without having the possibility to pass write-barriers down to btrfs cow mechanism). Taking snapshots of database files or vm images without proper prepartion only guarantees you crash-like rollback situations. Taking snapshots even at short intervals only makes this worse, with all the extra downsides of effects this has within the btrfs. I think this is important to understand for people planning to do automated snapshots of such file data. Making a file nocow only helps the situation during normal operation - but after a snapshot, a nocow file is essentially cow while carried over to the new subvolume generation during writes of blocks from the old generation. -- Replies to list only preferred. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Are nocow files snapshot-aware
Chris Murphy li...@colorremedies.com schrieb: If the database/virtual machine/whatever is crash safe, then the atomic state that a snapshot grabs will be useful. How fast is this state fixed on disk from the time of the snapshot command? Loosely speaking. I'm curious if this is 1 second; a few seconds; or possibly up to the 30 second default commit interval? And also if it's even related to the commit interval time at all? Such constructs can only be crash-safe if write-barriers are passed down through the cow logic of btrfs to the storage layer. That won't probably ever happen. Atomic and transactional updates cannot happen without write- barriers or synchronous writes. To make it work, you need to design the storage-layers from the ground up to work without write-barriers, like having battery-backed write-caches, synchronous logical file-system layers etc. Otherwise, database/vm/whatever transactional/atomic writes are just having undefined status down at the lowest storage layer. I'm also curious what happens to files that are presently writing. e.g. I'm writing a 1GB file to subvol A and before it completes I snapshot subvol A into A.1. If I go find the file I was writing to, in A.1, what's its state? Truncated? Or or are in-progress writes permitted to complete if it's a rw snapshot? Any difference in behavior if it's an ro snapshot? I wondered that many times, too. What happens to files being written to? I suppose, at the time of snapshotting it's taking the current state of the blocks as they are, ignoring pending writes. This means, the file being written to is probably in limbo state. For example, xfs has an option to freeze the file system to take atomic snapshots. You can use that feature to take consistent snapshots of MySQL InnoDB files to create a hot-copy backup of it. But: You need to instruct MySQL first to complete its transactions and pausing before running xfs_freeze, then after that's done, you can resume MySQL operations. That clearly tells me that it is probably not safe to take snapshots of online databases, even if they are crash-safe (and by what I know, InnoDB is designed to be crash-safe). A solution, probably far-future, could be that a btrfs snapshot would inform all current file-writers to complete transactions and atomic operations and wait until each one signals a ready state, then take the snapshot, then signal the processes to resume operations. For this, the btrfs driver could offer some sort of subscription, similar to what inotify offers. Processes subscribe to some sort of notification broadcasts, btrfs can wait for every process to report an integral file state. If I remember right, reiser4 offered some similar feature (approaching the problem from the opposite side): processes were offered an interface to start and commit transactions within reiser4. If btrfs had such information from file-writers, it could take consistent snapshots of online databases/vms/whatever (given, that in the vm case the guest could pass this information to the host). Whatever approach is taken, however, it will make the time needed to create snapshots undeterministic, processes may not finish their transactions within a reasonable time... -- Replies to list only preferred. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Are nocow files snapshot-aware
On Feb 7, 2014, at 2:07 PM, Kai Krakow hurikhan77+bt...@gmail.com wrote: Chris Murphy li...@colorremedies.com schrieb: If the database/virtual machine/whatever is crash safe, then the atomic state that a snapshot grabs will be useful. How fast is this state fixed on disk from the time of the snapshot command? Loosely speaking. I'm curious if this is 1 second; a few seconds; or possibly up to the 30 second default commit interval? And also if it's even related to the commit interval time at all? Such constructs can only be crash-safe if write-barriers are passed down through the cow logic of btrfs to the storage layer. That won't probably ever happen. Atomic and transactional updates cannot happen without write- barriers or synchronous writes. To make it work, you need to design the storage-layers from the ground up to work without write-barriers, like having battery-backed write-caches, synchronous logical file-system layers etc. Otherwise, database/vm/whatever transactional/atomic writes are just having undefined status down at the lowest storage layer. This explanation makes sense. But I failed to qualify the state fixed on disk. I'm not concerned about when bits actually arrive on disk. I'm wondering what state they describe. So assume no crash or power failure, and assume writes eventually make it onto the media without a problem. What I'm wondering is, what state of the subvolume I'm snapshotting do I end up with? Is there a delay and how long is it, or is it pretty much instant? The command completes really quickly even when the file system is actively being used, so the feedback is that the snapshot state is established very fast but I'm not sure what bearing that has in reality. Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Are nocow files snapshot-aware
Duncan 1i5t5.dun...@cox.net schrieb: The question here is: Does it really make sense to create such snapshots of disk images currently online and running a system. They will probably be broken anyway after rollback - or at least I'd not fully trust the contents. VM images should not be part of a subvolume of which snapshots are taken at a regular and short interval. The problem will go away if you follow this rule. The same applies to probably any kind of file which you make nocow - e.g. database files. The only use case is taking _controlled_ snapshots - and doing it all 30 seconds is by all means NOT controlled, it's completely undeterministic. I'd absolutely agree -- and that wasn't my report, I'm just recalling it, as at the time I didn't understand the interaction between NOCOW and snapshots and couldn't quite understand how a NOCOW file was still triggering the snapshot-aware-defrag pathology, which in fact we were just beginning to realize based on such reports. Sorry, didn't mean to push it to you. ;-) I just wanted to give some pointers to rethink such practices for people stumpling upon this. But some of the snapshotting scripts out there, and the admins running them, seem to have the idea that just because it's possible it must be done, and they have snapshots taken every minute or more frequently, with no automated snapshot thinning at all. IMO that's pathology run amok even if btrfs /was/ stable and mature and /could/ handle it properly. Yeah, people should stop such bullshit practice (sorry), no matter if there's a technical problem with it. It does not give the protection they intended to give. It's just wrong sense for security/safety... There _may_ be actual use cases for doing it - but generally I'd suggest it's plain wrong. That's regardless of the content so it's from a different angle than you were attacking the problem from... But if admins aren't able to recognize the problem with per-minute snapshots without any thinning at all for days, weeks, months on end, I doubt they'll be any better at recognizing that VMs, databases, etc, should have a dedicated subvolume. True. But be that as it may, since such extreme snapshotting /is/ possible, and with automation and downloadable snapper scripts somebody WILL be doing it, btrfs should scale to it if it is to be considered mature and stable. People don't want a filesystem that's going to fall over on them and lose data or simply become unworkably live-locked just because they didn't know what they were doing when they setup the snapper script and set it to 1 minute snaps without any corresponding thinning after an hour or a day or whatever. Such, uhm, sorry, bullshit practice should not be a high priority on the fix-list for btrfs. There are other areas. It's a technical problem, yes, but I think there are more important ones than brute-forcing problems out of btrfs that are never being hit by normal usage patterns. It is good that such tests are done, but I would not understand how people can expect they need such a feature - now and at once. Such tests are not ready to leave the development sandbox yet. From a normal use perspective, doing such heavy snapshotting is probably almost always nonsense. I'd be more interested in how btrfs behaves in highly io loaded server patterns. One interesting use case for me would be to use btrfs as the building block of a system with container virtualization (docker, lxc), making a high vm density on the machine (with the io load and unpredictable io bahavior that internet-facing servers apply to their storage layer), using btrfs snapshots to instantly create new vms from vm templates living in subvolumes (thin provisioning), spreading btrfs across a higher number of disks as the average desktop user / standard server has. I think this is one of many very interesting use cases for btrfs and its capabilities. And this is how we get back to my initial question: In such a scenario I'd like to take ro snapshots of all machines (which probably host nocow files for databases), send these to a backup server at low io-priority, then remove the snapshots. Apparently, btrfs send/receive is still far from being stable and bullet-proof from what I read here, so the destination would probably be another btrfs or zfs, using inplace-rsync backups and snapshotting for backlog. -- Replies to list only preferred. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Are nocow files snapshot-aware
Chris Murphy li...@colorremedies.com schrieb: On Feb 7, 2014, at 2:07 PM, Kai Krakow hurikhan77+bt...@gmail.com wrote: Chris Murphy li...@colorremedies.com schrieb: If the database/virtual machine/whatever is crash safe, then the atomic state that a snapshot grabs will be useful. How fast is this state fixed on disk from the time of the snapshot command? Loosely speaking. I'm curious if this is 1 second; a few seconds; or possibly up to the 30 second default commit interval? And also if it's even related to the commit interval time at all? Such constructs can only be crash-safe if write-barriers are passed down through the cow logic of btrfs to the storage layer. That won't probably ever happen. Atomic and transactional updates cannot happen without write- barriers or synchronous writes. To make it work, you need to design the storage-layers from the ground up to work without write-barriers, like having battery-backed write-caches, synchronous logical file-system layers etc. Otherwise, database/vm/whatever transactional/atomic writes are just having undefined status down at the lowest storage layer. This explanation makes sense. But I failed to qualify the state fixed on disk. I'm not concerned about when bits actually arrive on disk. I'm wondering what state they describe. So assume no crash or power failure, and assume writes eventually make it onto the media without a problem. What I'm wondering is, what state of the subvolume I'm snapshotting do I end up with? Is there a delay and how long is it, or is it pretty much instant? The command completes really quickly even when the file system is actively being used, so the feedback is that the snapshot state is established very fast but I'm not sure what bearing that has in reality. I think from that perspective it is more or less the same taking a snapshot or cycling the power. For the state of the file consistency it means the same, I suppose. I got your argument about state fixed on disk, but I implied from perspective of the writing process it is just the same situation: in the moment of the snapshot the data file is in a crashed state. That is like cycling the power without having a mechanism to support transactional guarantees. So the question is: Do btrfs snapshots give the same guarantees on the filesystem level that write-barriers give on the storage level which exactly those processes rely upon? The cleanest solution would be if processes could give btrfs hints about what belongs to their transactions so in the moment of a snapshot the data file would be in clean state. I guess snapshots are atomic in that way, that pending writes will never reach the snapshots just taken, which is good. But what about the ordering of writes? Maybe some younger write requests already made it to the disk, while older ones didn't. The file system usually only has to care about its own transactional integrity, not those of its writing processes, and that is completely unrelated to what the writing process expects. Or in other words: A following crash only guarantees that the active subvolume being written to is clean from the transactional perspective of the process, but the snapshot may be broken. As far as I know, user processes cannot tell the filesystem when to issue write- barriers, it could only issue fsyncs (which hurts performance). Otherwise this discussion would be a whole different story. Did you test how btrfs snapshots perform while running fsync with a lot of data to be committed? Could give a clue... -- Replies to list only preferred. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Are nocow files snapshot-aware
Kai Krakow posted on Fri, 07 Feb 2014 23:26:34 +0100 as excerpted: So the question is: Do btrfs snapshots give the same guarantees on the filesystem level that write-barriers give on the storage level which exactly those processes rely upon? The cleanest solution would be if processes could give btrfs hints about what belongs to their transactions so in the moment of a snapshot the data file would be in clean state. I guess snapshots are atomic in that way, that pending writes will never reach the snapshots just taken, which is good. Keep in mind that btrfs' metadata is COW-based also. Like reiser4 in this way, in theory at least, commits are atomic -- they've ether made it to disk or they haven't, there's no half there. Commits at the leaf level propagate up the tree, and are not finalized until the top-level root node is written. AFAIK if there's dirty data to write, btrfs triggers a root node commit every 30 seconds. Until that root is rewritten, it points to the last consistent-state written root node. Once it's rewritten, it points to the new one and a new set of writes are started, only to be finalized at the next root node write. And I believe that final write simply updates a pointer to point at the latest root node. There's also a history of root nodes, which is what the btrfs-find-root tool uses in combination with btrfs restore, if necessary, to find a valid root from the root node pointer log if the system crashed in the middle of that final update so the pointer ends up pointing at garbage. Meanwhile, I'm a bit blurry on this but if I understand things correctly, between root node writes/full-filesystem-commits there's a log of transaction completions at the atomic individual transaction level, such that even transactions completed between root node writes can normally be replayed. Of course this is only ~30 seconds worth of activity max, since the root node writes should occur every 30 seconds, but this is what btrfs-zero-log zeroes out, if/when needed. You'll lose that few seconds of log replay since the last root node write, but if it was garbage data due to it being written when the system actually went down, dropping those few extra seconds of log can allow the filesystem to mount properly from the last full root node commit, where it couldn't, otherwise. It's actually those metadata trees and the atomic root-node commit feature that btrfs snapshots depend on, and why they're normally so fast to create. When a snapshot is taken, btrfs simply keeps a record of the current root node instead of letting it recede into history and fall off the end of the root node log, labeling that record with the name of the snapshot for humans as well as the object-ID that btrfs uses. That root node is by definition a record of the filesystem in a consistent state, so any snapshot that's a reference to it is similarly by definition in a consistent state. So normally, files in the process of being written out (created) simply wouldn't appear in the snapshot. Of course preexisting files will appear (and fallocated files are simply the blanked-out-special-case of preexisting), but again, with normal COW-based files at least, will exist in a state either before the latest transaction started, or after it finished, which of course is where fsync comes in, since that's how userspace apps communicate file transactions to the filesystem. And of course in addition to COW, btrfs normally does checksumming as well, and again, the filesystem including that checksumming will be self- consistent when a root-node is written, or it won't be written until the filesystem /is/ self-consistent. If for whatever reason there's garbage when btrfs attempts to read the data back, which is exactly what btrfs defines it as if it doesn't pass checksum, btrfs will refuse to use that data. If there's a second copy somewhere (as with raid1 mode), it'll try to restore from that second copy. If it can't, btrfs will return an error and simply won't let you access that file. So one way or another, a snapshot is deterministic and atomic. No partial transactions, at least on ordinary COW and checksummed files. Which brings us to NOCOW files, where for btrfs NOCOW also turns off checksumming. Btrfs will write these files in-place, and as a result there's not the transaction integrity guarantee on these files that there is on ordinary files. *HOWEVER*, the situation isn't as bad as it might seem, because most files where NOCOW is recommended, database files, VM images, pre- allocated torrent files, etc, are created and managed by applications that already have their own data integrity management/verification/repair methods, since they're designed to work on filesystems without the data integrity guarantees btrfs normally provides. In fact, it's possible, even likely in case of a crash, that the application's own data integrity mechanisms can fight with those
Re: Are nocow files snapshot-aware
Duncan 1i5t5.dun...@cox.net schrieb: Ah okay, that makes it clear. So, actually, in the snapshot the file is still nocow - just for the exception that blocks being written to become unshared and relocated. This may introduce a lot of fragmentation but it won't become worse when rewriting the same blocks over and over again. That also explains the report of a NOCOW VM-image still triggering the snapshot-aware-defrag-related pathology. It was a _heavily_ auto- snapshotted btrfs (thousands of snapshots, something like every 30 seconds or more frequent, without thinning them down right away), and the continuing VM writes would nearly guarantee that many of those snapshots had unique blocks, so the effect was nearly as bad as if it wasn't NOCOW at all! The question here is: Does it really make sense to create such snapshots of disk images currently online and running a system. They will probably be broken anyway after rollback - or at least I'd not fully trust the contents. VM images should not be part of a subvolume of which snapshots are taken at a regular and short interval. The problem will go away if you follow this rule. The same applies to probably any kind of file which you make nocow - e.g. database files. Most of those file implement their own way of transaction protection or COW system, e.g. look at InnoDB files. Neither they gain anything from using IO schedulers (because InnoDB internally does block sorting and prioritizing and knows better, doing otherwise even hurts performance), nor they gain from file system semantics like COW (because it does its own transactions and atomic updates and probably can do better for its use case). Similar applies to disk images (imagine ZFS, NTFS, ReFS, or btrfs images on btrfs). Snapshots can only do harm here (the only protection use case would be to have a backup, but snapshots are no backups), and COW will probably hurt performance a lot. The only use case is taking _controlled_ snapshots - and doing it all 30 seconds is by all means NOT controlled, it's completely undeterministic. -- Replies to list only preferred. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Are nocow files snapshot-aware
On Thu, Feb 6, 2014 at 6:32 PM, Kai Krakow hurikhan77+bt...@gmail.com wrote: Duncan 1i5t5.dun...@cox.net schrieb: Ah okay, that makes it clear. So, actually, in the snapshot the file is still nocow - just for the exception that blocks being written to become unshared and relocated. This may introduce a lot of fragmentation but it won't become worse when rewriting the same blocks over and over again. That also explains the report of a NOCOW VM-image still triggering the snapshot-aware-defrag-related pathology. It was a _heavily_ auto- snapshotted btrfs (thousands of snapshots, something like every 30 seconds or more frequent, without thinning them down right away), and the continuing VM writes would nearly guarantee that many of those snapshots had unique blocks, so the effect was nearly as bad as if it wasn't NOCOW at all! The question here is: Does it really make sense to create such snapshots of disk images currently online and running a system. They will probably be broken anyway after rollback - or at least I'd not fully trust the contents. VM images should not be part of a subvolume of which snapshots are taken at a regular and short interval. The problem will go away if you follow this rule. The same applies to probably any kind of file which you make nocow - e.g. database files. Most of those file implement their own way of transaction protection or COW system, e.g. look at InnoDB files. Neither they gain anything from using IO schedulers (because InnoDB internally does block sorting and prioritizing and knows better, doing otherwise even hurts performance), nor they gain from file system semantics like COW (because it does its own transactions and atomic updates and probably can do better for its use case). Similar applies to disk images (imagine ZFS, NTFS, ReFS, or btrfs images on btrfs). Snapshots can only do harm here (the only protection use case would be to have a backup, but snapshots are no backups), and COW will probably hurt performance a lot. The only use case is taking _controlled_ snapshots - and doing it all 30 seconds is by all means NOT controlled, it's completely undeterministic. If the database/virtual machine/whatever is crash safe, then the atomic state that a snapshot grabs will be useful. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Are nocow files snapshot-aware
On Feb 6, 2014, at 6:01 PM, cwillu cwi...@cwillu.com wrote: On Thu, Feb 6, 2014 at 6:32 PM, Kai Krakow hurikhan77+bt...@gmail.com wrote: Duncan 1i5t5.dun...@cox.net schrieb: Ah okay, that makes it clear. So, actually, in the snapshot the file is still nocow - just for the exception that blocks being written to become unshared and relocated. This may introduce a lot of fragmentation but it won't become worse when rewriting the same blocks over and over again. That also explains the report of a NOCOW VM-image still triggering the snapshot-aware-defrag-related pathology. It was a _heavily_ auto- snapshotted btrfs (thousands of snapshots, something like every 30 seconds or more frequent, without thinning them down right away), and the continuing VM writes would nearly guarantee that many of those snapshots had unique blocks, so the effect was nearly as bad as if it wasn't NOCOW at all! The question here is: Does it really make sense to create such snapshots of disk images currently online and running a system. They will probably be broken anyway after rollback - or at least I'd not fully trust the contents. VM images should not be part of a subvolume of which snapshots are taken at a regular and short interval. The problem will go away if you follow this rule. The same applies to probably any kind of file which you make nocow - e.g. database files. Most of those file implement their own way of transaction protection or COW system, e.g. look at InnoDB files. Neither they gain anything from using IO schedulers (because InnoDB internally does block sorting and prioritizing and knows better, doing otherwise even hurts performance), nor they gain from file system semantics like COW (because it does its own transactions and atomic updates and probably can do better for its use case). Similar applies to disk images (imagine ZFS, NTFS, ReFS, or btrfs images on btrfs). Snapshots can only do harm here (the only protection use case would be to have a backup, but snapshots are no backups), and COW will probably hurt performance a lot. The only use case is taking _controlled_ snapshots - and doing it all 30 seconds is by all means NOT controlled, it's completely undeterministic. If the database/virtual machine/whatever is crash safe, then the atomic state that a snapshot grabs will be useful. How fast is this state fixed on disk from the time of the snapshot command? Loosely speaking. I'm curious if this is 1 second; a few seconds; or possibly up to the 30 second default commit interval? And also if it's even related to the commit interval time at all? I'm also curious what happens to files that are presently writing. e.g. I'm writing a 1GB file to subvol A and before it completes I snapshot subvol A into A.1. If I go find the file I was writing to, in A.1, what's its state? Truncated? Or or are in-progress writes permitted to complete if it's a rw snapshot? Any difference in behavior if it's an ro snapshot? Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Are nocow files snapshot-aware
Kai Krakow posted on Fri, 07 Feb 2014 01:32:27 +0100 as excerpted: Duncan 1i5t5.dun...@cox.net schrieb: That also explains the report of a NOCOW VM-image still triggering the snapshot-aware-defrag-related pathology. It was a _heavily_ auto- snapshotted btrfs (thousands of snapshots, something like every 30 seconds or more frequent, without thinning them down right away), and the continuing VM writes would nearly guarantee that many of those snapshots had unique blocks, so the effect was nearly as bad as if it wasn't NOCOW at all! The question here is: Does it really make sense to create such snapshots of disk images currently online and running a system. They will probably be broken anyway after rollback - or at least I'd not fully trust the contents. VM images should not be part of a subvolume of which snapshots are taken at a regular and short interval. The problem will go away if you follow this rule. The same applies to probably any kind of file which you make nocow - e.g. database files. The only use case is taking _controlled_ snapshots - and doing it all 30 seconds is by all means NOT controlled, it's completely undeterministic. I'd absolutely agree -- and that wasn't my report, I'm just recalling it, as at the time I didn't understand the interaction between NOCOW and snapshots and couldn't quite understand how a NOCOW file was still triggering the snapshot-aware-defrag pathology, which in fact we were just beginning to realize based on such reports. In fact at the time I assumed it was because the NOCOW had been added after the file was originally written, such that btrfs couldn't NOCOW it properly. That still might have been the case, but now that I understand the interaction between snapshots and NOCOW, I see that such heavy snapshotting on an actively written VM could trigger the same issue, even if the NOCOW file was created properly and was indeed NOCOW when content was actually first written into it. But definitely agreed. 30 second snapshotting, with a 30 second commit deadline, is pretty much off the deep end regardless of the content. I'd even argue that 1 minute snapshotting without snapshots thinned down to say 5 or 10 minute snapshots after say an hour, is too extreme to be practical. Even a couple days of that, and how are you going to even manage the thousands of snapshots or know which precise snapshot to roll back to if you had to? That's why in the what-I-considered toward the extreme end of practical example I posted here some days ago, IIRC I had it do 1 minute snapshots but thin them down to 5 or 10 minutes after a couple hours and to half hour after a couple days, with something like 90 day snapshots out to a decade. Even that I considered extreme altho at least reasonably so, but the point was, even with something as extreme as 1 minute snapshots at first frequency and decade of snapshots, with reasonable thinning it was still very manageable, something like 250 snapshots total, well below the thousands or tens of thousands we're sometimes seeing in reports. That's hardly practical no matter how you slice it, as how likely are you to know the exact minute to roll back to, even a month out, and even if you do, if you can survive a month before detecting it, how important is rolling back to precisely the last minute before the problem actually going to be? At a month out perhaps the hour, but the minute? But some of the snapshotting scripts out there, and the admins running them, seem to have the idea that just because it's possible it must be done, and they have snapshots taken every minute or more frequently, with no automated snapshot thinning at all. IMO that's pathology run amok even if btrfs /was/ stable and mature and /could/ handle it properly. That's regardless of the content so it's from a different angle than you were attacking the problem from... But if admins aren't able to recognize the problem with per-minute snapshots without any thinning at all for days, weeks, months on end, I doubt they'll be any better at recognizing that VMs, databases, etc, should have a dedicated subvolume. Taking the long view, with a bit of luck we'll get to the point were database and VM setup scripts and/or documentation recommend setting NOCOW on the directory the VMs/DBs/etc will be in, but in practice, even that's pushing it, and will take some time (2-5 years) as btrfs stabilizes and mainstreams, taking over from ext4 as the assumed Linux default. Other than that, I guess it'll be a case-by-case basis as people report problems here. But with a snapshot-aware-defrag that actually scales, hopefully there won't be so many people reporting problems. True, they might not have the best optimized system and may have some minor pathologies in their admin practices, but as long as they remain /minor/ pathologies because btrfs can deal with them better than it does now thus keeping them from
Re: Are nocow files snapshot-aware
David Sterba dste...@suse.cz schrieb: On Tue, Feb 04, 2014 at 08:22:05PM -0500, Josef Bacik wrote: On 02/04/2014 03:52 PM, Kai Krakow wrote: Hi! I'm curious... The whole snapshot thing on btrfs is based on its COW design. But you can make individual files and directory contents nocow by applying the C attribute on it using chattr. This is usually recommended for database files and VM images. So far, so good... But what happens to such files when they are part of a snapshot? Do they become duplicated during the snapshot? Do they become unshared (as a whole) when written to? Or when the the parent snapshot becomes deleted? Or maybe the nocow attribute is just ignored after a snapshot was taken? After all they are nocow and thus would be handled in another way when snapshotted. When snapshotted nocow files fallback to normal cow behaviour. This may seem unclear to people not familiar with the actual implementation, and I had to think for a second about that sentence. The file will keep the NOCOW status, but any modified blocks will be newly allocated on the first write (in a COW manner), then the block location will not change anymore (unlike ordinary COW). Ah okay, that makes it clear. So, actually, in the snapshot the file is still nocow - just for the exception that blocks being written to become unshared and relocated. This may introduce a lot of fragmentation but it won't become worse when rewriting the same blocks over and over again. HTH Yes, it does. ;-) -- Replies to list only preferred. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Are nocow files snapshot-aware
Kai Krakow posted on Wed, 05 Feb 2014 19:17:10 +0100 as excerpted: David Sterba dste...@suse.cz schrieb: On Tue, Feb 04, 2014 at 08:22:05PM -0500, Josef Bacik wrote: On 02/04/2014 03:52 PM, Kai Krakow wrote: Hi! I'm curious... The whole snapshot thing on btrfs is based on its COW design. But you can make individual files and directory contents nocow by applying the C attribute on it using chattr. This is usually recommended for database files and VM images. So far, so good... But what happens to such files when they are part of a snapshot? Do they become duplicated during the snapshot? Do they become unshared (as a whole) when written to? Or when the the parent snapshot becomes deleted? Or maybe the nocow attribute is just ignored after a snapshot was taken? When snapshotted nocow files fallback to normal cow behaviour. This may seem unclear to people not familiar with the actual implementation, and I had to think for a second about that sentence. The file will keep the NOCOW status, but any modified blocks will be newly allocated on the first write (in a COW manner), then the block location will not change anymore (unlike ordinary COW). Ah okay, that makes it clear. So, actually, in the snapshot the file is still nocow - just for the exception that blocks being written to become unshared and relocated. This may introduce a lot of fragmentation but it won't become worse when rewriting the same blocks over and over again. That also explains the report of a NOCOW VM-image still triggering the snapshot-aware-defrag-related pathology. It was a _heavily_ auto- snapshotted btrfs (thousands of snapshots, something like every 30 seconds or more frequent, without thinning them down right away), and the continuing VM writes would nearly guarantee that many of those snapshots had unique blocks, so the effect was nearly as bad as if it wasn't NOCOW at all! -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Are nocow files snapshot-aware
Hi! I'm curious... The whole snapshot thing on btrfs is based on its COW design. But you can make individual files and directory contents nocow by applying the C attribute on it using chattr. This is usually recommended for database files and VM images. So far, so good... But what happens to such files when they are part of a snapshot? Do they become duplicated during the snapshot? Do they become unshared (as a whole) when written to? Or when the the parent snapshot becomes deleted? Or maybe the nocow attribute is just ignored after a snapshot was taken? After all they are nocow and thus would be handled in another way when snapshotted. -- Replies to list only preferred. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Are nocow files snapshot-aware
On 02/04/2014 03:52 PM, Kai Krakow wrote: Hi! I'm curious... The whole snapshot thing on btrfs is based on its COW design. But you can make individual files and directory contents nocow by applying the C attribute on it using chattr. This is usually recommended for database files and VM images. So far, so good... But what happens to such files when they are part of a snapshot? Do they become duplicated during the snapshot? Do they become unshared (as a whole) when written to? Or when the the parent snapshot becomes deleted? Or maybe the nocow attribute is just ignored after a snapshot was taken? After all they are nocow and thus would be handled in another way when snapshotted. When snapshotted nocow files fallback to normal cow behaviour. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Are nocow files snapshot-aware
On Tue, Feb 04, 2014 at 08:22:05PM -0500, Josef Bacik wrote: On 02/04/2014 03:52 PM, Kai Krakow wrote: Hi! I'm curious... The whole snapshot thing on btrfs is based on its COW design. But you can make individual files and directory contents nocow by applying the C attribute on it using chattr. This is usually recommended for database files and VM images. So far, so good... But what happens to such files when they are part of a snapshot? Do they become duplicated during the snapshot? Do they become unshared (as a whole) when written to? Or when the the parent snapshot becomes deleted? Or maybe the nocow attribute is just ignored after a snapshot was taken? After all they are nocow and thus would be handled in another way when snapshotted. When snapshotted nocow files fallback to normal cow behaviour. This may seem unclear to people not familiar with the actual implementation, and I had to think for a second about that sentence. The file will keep the NOCOW status, but any modified blocks will be newly allocated on the first write (in a COW manner), then the block location will not change anymore (unlike ordinary COW). HTH -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html