Rich Freeman posted on Thu, 20 Mar 2014 22:13:51 -0400 as excerpted: > However, I am my snapshots one at a time at a rate of one every 5-30 > minutes, and while that is creating surprisingly high disk loads on my > ssd and hard drives, I don't get any panics. I figured that having only > one deletion pending per checkpoint would eliminate locking risk. > > I did get some blocked task messages in dmesg, like: > [105538.121239] INFO: task mysqld:3006 blocked for more than 120 > seconds.
These... are a continuing issue. The devs are working on it, but... The people that seem to have it the worst are doing both scripted snapshotting and large (gig+) constantly internal-rewritten files such as VM images (the most commonly reported case) or databases. Properly setting NOCOW on the files[1] helps, but... * The key thing to realize about snapshotting continually rewritten NOCOW files is that the first change to a block after a snapshot by definition MUST be COWed anyway, since the file content has changed from that of the snapshot. Further writes to the same block (until the next snapshot) will be rewritten in-place (the existing NOCOW attribute is maintained thru that mandatory COW), but next snapshot and following write, BAM! gotta COW again! So while NOCOW helps, in scenarios such as hourly snapshotting of active VM-image data loads its ability to control actual fragmentation is unfortunately rather limited. And it's precisely this fragmentation that appears to be the problem! =:^( It's almost certainly that fragmentation that's triggering your blocked for X seconds issues. But the interesting thing here is the reports even from people with fast SSDs where seek-time and even IOPs shouldn't be a huge issue. In at least some cases, the problem has been CPU time, not physical media access. Which is one reason the snapshot-aware-defrag was disabled again recently, because it simply wasn't scaling. (To answer the question, yes, defrag still works; it's only the snapshot-awareness that was disabled. Defrag is back to dumbly ignoring other snapshots and simply defragging the working file-extent-mapping the defrag is being run on, with other snapshots staying untouched.) They're reworking the whole feature now in ordered to scale better. But while that considerably reduces the pain point, people were seeing little or no defrag/balance/restripe progress in /hours/ if they had enough snapshots and that problem has been bypassed for the moment, we're still left with these nasty N-second stalls at times, especially when doing anything else involving those snapshots and the corresponding fragmentation they cover, including deleting them. Hopefully tweaking the algorithms and eventually optimizing can do away with much of this problem eventually, but I've a feeling it'll be around to some degree for some years. Meanwhile, for data that fits that known problematic profile, the current recommendation is, preferably, to isolate it to a subvolume that has only very limited or no snapshotting done. The other alternative, of course, since NOCOW already turns off many of the features a lot of people are using btrfs for in the first place (checksumming and compression are disabled with NOCOW as well, tho it turns out they're not so well suited to VM images in the first place), is that given the subvolume isolation already, just stick it on an entirely different filesystem, either btrfs with the nocow mount option, or arguably something a bit more traditional and mature such as ext4 or xfs, where xfs of course is actually targeted at large to huge file use-cases so multi-gig VMs should be an ideal fit. Of course you lose the benefits of btrfs doing that, but given its COW nature, btrfs arguably isn't the ideal solution for such huge internal-write files in the first place, and even when fully mature will likely only have /acceptable/ performance with them as suitable for use as a general purpose filesystem, with xfs or similar still likely being a better dedicated filesystem for such use- cases. Meanwhile, I think everyone agrees that getting that locking down to avoid the deadlocks, etc, really must be priority one, at least now that the huge scaling blocker of snapshot-aware-defrag is (hopefully temporarily) disabled. Blocking for a couple minutes at a time certainly isn't ideal, but since the triggering jobs such as snapshot deletion, etc, can be rescheduled to otherwise idle time, that's certainly less critical than crashes if people accidentally or in ignorance queue up too many snapshot deletions at a time! --- [1] NOCOW: chattr +C . With btrfs, this should be done while the file is zero-size, before it has content. The easiest way to do that is to create a dedicated directory for these files and set the attribute on the directory, such that the files inherit it at file creation. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html