Xavier Gnata posted on Sat, 17 Oct 2015 18:36:32 +0200 as excerpted: > Hi, > > On a desktop equipped with an ssd with one 100GB virtual image used > frequently, what do you recommend? > 1) nothing special, it is all fine as long as you have a recent kernel > (which I do) > 2) Disabling copy-on-write for just the VM image directory. > 3) autodefrag as a mount option. > 4) something else. > > I don't think this usecase is well documented therefore I asked this > question.
You are correct. The VM images on ssd use-case /isn't/ particularly well documented, I'd guess because people have differing opinions, and, indeed, actual observed behavior, and thus recommendations even in the ideal case, may well be different depending on the specs and firmware of the ssd. The documentation tends to be aimed at the spinning rust case. There's one detail of the use-case (besides ssd specs), however, that you didn't mention, that could have a big impact on the recommendation. What sort of btrfs snapshotting are you planning to do, and if you're doing snapshots, does your use-case really need them to include the VM image file? Snapshots are a big issue for anything that you might set nocow, because snapshot functionality assumes and requires cow, and thus conflicts, to some extent, with nocow. A snapshot locks in place the existing extents, so they can no longer be modified. On a normal btrfs cow-based file, that's not an issue, since any modifications would be cowed elsewhere anyway -- that's how btrfs normally works. On a nocow file, however, there's a problem, because once the snapshot locks in place the existing version, the first change to a specific block (normally 4 KiB) *MUST* be cowed, despite the nocow attribute, because to rewrite in-place would alter the snapshot. The nocow attribute remains in place, however, and further writes to the same block will again be nocow... to the new block location established by that first post-snapshot write... until the next snapshot comes along and locks that too in-place, of course. This sort of cow-only-once behavior is sometimes called cow1. If you only do very occasional snapshots, probably manually, this cow1 behavior isn't /so/ bad, tho the file will still fragment over time as more and more bits of it are written and rewritten after the few snapshots that are taken. However, for people doing frequent, generally schedule-automated snapshots, the nocow attribute is effectively nullified as all those snapshots force cow1s over and over again. So ssd or spinning rust, there's serious conflicts between nocow and snapshotting that really must be taken into consideration if you're planning to both snapshot and nocow. For use-cases that don't require snapshotting of the nocow files, the simplest workaround is to put any nocow files on dedicated subvolumes. Since snapshots stop at subvolume boundaries, having nocow files on dedicated subvolume(s) stops snapshots of the parent from including them, thus avoiding the cow1 situation entirely. If the use-case requires snapshotting of nocow files, the workaround that has been reported (mostly on spinning rust, where fragmentation is a far worse problem due to non-zero seek-times) to work is first to reduce snapshotting to a minimum -- if it was going to be hourly, consider daily or every 12 hours, if you can get away with it, if it was going to be daily, consider every other day or weekly. Less snapshotting means less cow1s and thus directly affects how quickly fragmentation becomes a problem. Again, dedicated subvolumes can help here, allowing you to snapshot the nocow files on a different schedule than you do the up- hierarchy parent subvolume. Second, schedule periodic manual defrags of the nocow files, so the fragmentation that does occur is at least kept manageable. If the snapshotting is daily, consider weekly or monthly defrags. If it's weekly, consider monthly or quarterly defrags. Again, various people who do need to snapshot their nocow files have reported that this really does help, keeping fragmentation to at least some sanely managed level. That's the snapshot vs. nocow problem in general. With luck, however, you can avoid snapshotting the files in question entirely, thus factoring this issue out of the equation entirely. Now to the ssd issue. On ssds in general, there are two very major differences we need to consider vs. spinning rust. One, fragmentation isn't as much of a problem as it is on spinning rust. It's still worth keeping to a minimum, because as the number of fragments increases, so does both btrfs and device overhead, but it's not the nearly everything-overriding consideration that it is on spinning rust. Two, ssds have a limited write-cycle factor to consider, where with spinning rust the write-cycle limit is effectively infinite... at least compared to the much lower limit of ssds. The weighing of these two overriding ssd factors one against the other, along with the simple fact that ssds are new enough technology and behavior differs enough between them that people simply haven't had time to come to agreement yet on best-practices, is why recommendations here differ far more than on spinning rust, where fragmentation really is the single most important overriding factor compared to very nearly everything else. The fact of the matter is, on ssds, people strongly emphasizing the limited write-cycle count will tend not to worry, perhaps at all, about fragmentation, since it's negative effects are so much lower on ssds, while those (including me) who emphasize the remaining negative effects that fragmentation has, including scaling issues should it get to bad, as well as the less easy to create a universal rule for (because devices and firmwares do differ in major ways here) effect of the larger erase block size and how that interacts with sub-erase-block- size fragmentation and write-amplification, thus perhaps triggering more write cycles due to sub-erase-block-fragmentation than the defrag would trigger, still tend to recommend at least taking fragmentation into account, and may even consider autodefrag worth enabling, for use-cases with small enough internal-rewrite-pattern files, at least. So let's address autodefrag... It's worth noting that I have autodefrag enabled here, on my ssds, and have from the first mount where I put content on them, so it has been enabled for every write on every file. However, it's not ideal in all cases, my use-case simply is one where autodefrag works well, so... Here's the deal with autodefrag. First of all, if a file isn't constantly being rewritten, or if its rewrite pattern is append-only (like most log files, but *not* systemd journal files!), it doesn't tend to get particularly fragmented in the first place, especially with a filesystem that itself isn't highly fragmented, so free-space blocks tend to be large enough that a file doesn't tend to be fragmented as initially written. So fragmentation tends to be worst on internal-rewrite-pattern files, where a block here and a block there are rewritten, normally triggering cow on a cow-based filesystem such as btrfs. But, consider that rewriting the entire file to avoid fragmentation, which is what autodefrag does, takes time, larger file, more time. And at some point, as filesizes increase, rewrites can be coming in faster than the file can be rewritten. So autodefrag works best on internal- rewrite-pattern files (as we've already established), but also on smaller files. On spinning rust, autodefrag tends to work best at file sizes under 256 MiB, a quarter GiB, where they rewrite fast enough that there's generally no problems at all. But on most spinning rust, people will begin to see performance issues with autodefrag, at somewhere between half a GiB and 3/4 GiB (512-768 MiB), and nearly everyone on spinning rust reports performance issues at 1 GiB file sizes and larger. As it happens, this quarter-GiB or so spinning-rust autodefrag limit is close to that of common desktop-only database uses such as the sqlite files firefox and thunderbird use, so this is the use-case for which autodefrag is really recommended and tuned ATM. That's really useful, since it means most desktop-only users can simply enable autodefrag and forget about it, as it'll "just work". People optimizing larger databases and GiB+ VM image files, however, are going to need to do rather more detailed optimization, which sucks, but in contrast with normal desktop users, they're generally used to doing various optimization things, at least to some extent, already, so at least the problem is hitting those generally more technically prepared to deal with it. But that's for spinning rust. On ssds, particularly fast ssds, write speeds tend to be high enough that autodefrag can work effectively with much larger files. The rub, however, is that ssd speeds vary enough, and there's few enough reports from people actually testing autodefrag with larger internal-rewrite-pattern files on ssds, that we don't have nicely wrapped up numbers for our ssd autodefrag filesize limitation recommendations, as we do for spinning rust. I'd suggest based on my own experience and the reports we /do/ have, that on most ssds, autodefrag, provided people are inclined to enable it in the first place (see above discussion of the two major ssd factors here and how emphasis on one or the other tends to put people in one of two camps regarding even worrying about fragmentation at all on ssds), should work well enough on files upto a gig in size, at least. I wouldn't be surprised to see 2 GiB work fine, particularly on fast ssds, tho I'd guess people will begin to see performance issues at the 4 GiB to 8 GiB size. You say your image file, while on ssd, is 100 GiB. Please do your own tests and report as it's possible my EWAG (educated but wild-ass-guess) is wrong, but I'm predicting that's well above the good performance limit for autodefrag, even on SSD. That said, performance may still be good /enough/ that you can deal with it, if if sufficiently simplifies the situation for you regarding /other/ files, and your balance of use tilts sufficiently toward those other files as opposed to this single very large image file. Tho at 100 GiB, the repeated rewriting of autodefrag is definitely likely to cut into your write-cycle allowance, arguably rather heavily. So I really can't recommend autodefrag, despite how very much I wish it would work for your case, since it does dramatically simplify things where it works and you can then simply forget about other alternatives and all their relative complications. Maybe someday they'll optimize it to handle such large files better, but until then, I really don't think it's a good match to your requirements. So with autodefrag out for that file, and with the previous issues discussed, here's some reasonable options to try. 1) The nothing special option. With a bit of luck, the 0-seek-time of ssd will mean that the fragmentation you're likely to see won't dramatically affect you, and the "do nothing" option will work acceptably. The biggest thing I'm worried about here is that fragmentation may well get bad enough that it affects btrfs maintenance times, etc, due to scaling issues. Btrfs balance, scrub, and check, could end up taking far longer than you might expect on ssd and than they'd take were it not for the fragmentation on this single file. And if you're keeping snapshots around, be aware that simply defragging the file isn't likely to solve the btrfs maintenance times issue, because while btrfs did have snapshot-aware-defrag for a few kernels, it did not scale well *AT* *ALL* and the snapshot awareness was disabled again, until the scaling issues could be worked thru (which they're gradually doing, but it's an exceedingly complex problem, with many sub-issues that must be solved before scaling itself can be considered solved). So defragging a file that's already highly fragmented in various snapshots of differing ages, will defrag it in the subvolume/snapshot you run the defrag in, but won't affect it in the other snapshots, so isn't likely to do much at all for the overall btrfs maintenance scaling issue. You'd have to delete all those snapshots (or not take them in the first place, if your use-case doesn't require them) to eliminate the scaling issue, if it's due to fragmentation of this file in all those snapshots as well as the working copy. So watch out for the maintenance scaling (maybe run a scrub and/or read- only check periodically, just to ensure the execution times aren't running away on you), but if it works well enough for you, this is by far the most uncomplicated option. 2) If your use-case doesn't involve snapshotting the image file, setting nocow on the dir before creation of the file, such that the file inherits the nocow, should be a reasonably uncomplicated option. If you do plan on snapshotting the parent but don't actually need to snapshot the nocow subdir and its nocow inheriting images, then use the dedicated subvolume trick to keep the image file out of your snapshots and avoid the cow1 complications. 3) As an idea taking the dedicated subvolume idea even further, consider an entirely separate dedicated filesystem for this image file. That gives you much more flexibility, because then you can, for instance, still set autodefrag on the main filesystem, if it'd be useful there, without worrying about how that huge image file and autodefrag interact. Additionally, that lets you use something other than btrfs for the image file's filesystem, if you want, while still using btrfs for the rest of the system. If you're nocowing the file, you're already killing many of the features that btrfs generally brings, and provided the additional overhead of managing the separate partition and filesystem isn't too much, you might /as/ /well/ simply use something other than btrfs for that particular file, thus avoiding the whole image file cowing complications scenario in the first place. I'd strongly consider the separate filesystem option here, as I already use multiple separate filesystems in ordered to avoid having my data eggs all in the same single filesystem basket (subvolumes don't cut it in terms of separation safety, for me). But some people are far more averse to partitioning and similar solutions, for reasons that aren't entirely clear to me. If you'd prefer to avoid the complexity of managing an entirely separate filesystem just for your image file, fine, just cross this option off your list and don't consider it further. 4) If the "do nothing" option doesn't cut it and your use-case involves snapshotting the image file, then things get much more complex. As mentioned above, the recommendation for this sort of use-case isn't going to give you a simple ideal, but others have reported it to work acceptably, even surprisingly, well, once it's all setup, and if that's the situation on spinning rust, it should be even better on ssd, since the "controlled amount of fragmentation" should be even further within acceptable levels on ssd with its zero-seek-times, than it is on spinning rust. Again, the recommendation for this use-case is to set nocow on the image- file's dir so it inherits, and aim for the low end of your acceptable snapshotting frequency range for the image file, weekly instead of daily, or daily instead of hourly. If necessary, use the separate subvolume trick to separate the image file from the rest of the content you're snapshotting, so you can use a higher frequency snapshot schedule on the other stuff, while keeping it as low frequency as you can manage on the image file. Then do scheduled periodic targeted defrag of the image file, at a frequency some fraction of the snapshot frequency, perhaps monthly or quarterly for weekly snapshots, etc. Keep in mind that defrag will only affect the working copy, not existing snapshots, but provided you do it at some reasonable fraction of the snapshotting interval, you should reset the fragmentation for further snapshots often enough that it doesn't get out of hand for them, either. Finally, orthogonal to the original fragmentation question, but particularly important if you /are/ doing scheduled snapshots... For scheduled snapshots in particular, it's very important that you setup a reasonable snapshot thinning schedule as well, the object of which should be to keep the number of snapshots as low as possible, again, for scaling reasons. At this point anyway, btrfs maintenance operations simply do /not/ scale well with snapshot numbers in the tens or hundreds of thousands range, as people often find themselves with if they aren't doing scheduled snapshot thinning as well. With reasonable thinning, it's quite possible to keep per-subvolume snapshots to 250 or so, reasonably under 300, even if starting with incredibly high snapshot frequency such as every half-hour or even every minute (tho the latter tends to be impractical because while snapshots are fast, very nearly instantaneous, removing them is rather more complex and definitely not instantaneous!). With 250 snapshots per subvolume, you keep it to 1000 snapshots per filesystem if you're snapshotting four subvolumes, 2000 per filesystem if you're doing eight, etc. Ideally, you'll target 1000 or less, possibly by thinning more drastically on some subvolume snapshots than others, but 2000 or even 3000 isn't out of hand, tho by 2500 to 3000, you'll probably notice increased maintenance times. By 10k snapshots, however, things are starting to go south, and above that, things go unreasonable pretty fast. So do try to keep to "a few thousand, at most" snapshots, or expect to btrfs balance and other maintenance tasks to take "unreasonable" amounts of time, should you need to run them. And if you can keep to under 1000, so much the better; your improved maintenance times will reward you for it. =:^) Also, as you may have already seen, my recommendation for quotas is simply leave them off on btrfs. They're broken and dramatically increase the scaling issues. You either rely on quotas working or you don't. If you don't, leave them off and avoid the issues. If you do, use a more stable and mature filesystem where they're known to work reliably. Unless of course you're specifically working with the devs to test, report and trace down quota problems and test possible fixes. In that case, please continue, as its your tolerance for the present pain that's helping to make the feature actually usable for the rest of us, someday hopefully soon. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html