John Navitsky posted on Mon, 10 Feb 2014 07:35:32 -0800 as excerpted: [I rearranged your upside-down posting so the reply comes in context after the quote.]
> On 2/8/2014 10:36 AM, John Navitsky wrote: >> I have a large file system that has been growing. We've resized it a >> couple of times with the following approach: >> >> lvextend -L +800G /dev/raid/virtual_machines >> btrfs filesystem resize +800G /vms >> >> I think the FS started out at 200G, we increased it by 200GB a time or >> two, then by 800GB and everything worked fine. >> >> The filesystem hosts a number of virtual machines so the file system is >> in use, although the VMs individually tend not to be overly active. >> >> VMs tend to be in subvolumes, and some of those subvolumes have >> snapshots. >> >> This time, I increased it by another 800GB, and it it has hung for many >> hours (over night) with flush-btrfs-4 near 100% cpu all that time. >> >> I'm not clear at this point that it will finish or where to go from >> here. >> >> Any pointers would be much appreciated. > As a follow-up, at some point over the weekend things did finish on > their own: > > romulus:/vms/johnn-sles11sp3 # df -h /vms > Filesystem Size Used Avail Use% Mounted on > /dev/dm-4 2.6T 1.6T 1.1T 60% /vms > romulus:/vms/johnn-sles11sp3 # > > I'd still be interested in any comments about what was going on or > suggestions. I'm guessing you don't have the VM images set NOCOW (no-copy-on-write), which means over time they'll **HEAVILY** fragment since every time something changes in the image and is written back to the file, that block is written somewhere else due to COW. We've had some reports of hundreds of thousands of extents in VM files of a few gigs! It's also worth noting that while NOCOW does normally mean in-place writes, a change after a snapshot means unsharing the data since the snapshotted data has now diverged, which means mandatory single-shot COW in ordered to keep the new change from overwriting the old snapshot version. That of course triggers fragmentation too, since everything that changes in the image between snapshots must be written elsewhere, altho the fragmentation won't be nearly as fast as the default COW mode will. So what was very likely taking the time was tracking down all those potentially hundreds of thousands of fragments/extents in ordered to re- write the files as triggered by the size increase and presumably the physical location on-device. I'd strongly suggest that you set all VMs NOCOW (chattr +C). However, there's a wrinkle. In ordered to be effective on btrfs, NOCOW must be set on a file while it is still zero-size, before it has data written to it. The easiest way to do that is to set NOCOW on the directory, which doesn't really affect the directory itself, but DOES cause all new files (and subdirs, so it nests) created in that directory to inherit the NOCOW attribute. Then copy the file in, preferably either catting it in with redirection to create/write the file, or copying it from another filesystem, such that you know it's actually copying the data and not simply hard-linking it, thus ensuring that the new copy is actually a new copy, so the NOCOW will actually take effect. By organizing your VM images into dirs, all with NOCOW set, so the images inherit it at creation, you'll save yourself the fragmentation of the repeated COW writes. However, as I mentioned, the first time a block is written after a snapshot it's still a COW write, unavoidably so. Thus, I'd suggest keeping btrfs snapshots of your VMs to a minimum (preferably 0), using ordinary full-copy backups to other media, instead, thus avoiding that first COW-after-snapshot effect, too. Meanwhile, it's worth noting that if a file is written sequentially (append only) and not written "into", as will typically be the case with the VM backups, there's nothing to trigger fragmentation. So the backups don't have to be NOCOW, since they'll be written once and left alone. But the actively in-use and thus often written to operational VM images should be NOCOW, and preferably not snapshotted, to keep fragmentation to a minimum. Finally, of course you can use btrfs defrag to manually deal with the problem. However, do note that the snapshot aware defrag introduced with kernel 3.9 simply does NOT scale well once the number of snapshots reaches near 1000, and the snapshot-awareness has just been disabled again (in kernel 3.14-rc), until the code can be reworked to scale better. So I'd suggest if you /are/ using snapshots and trying to work with defrag, you'll want a very new 3.14-rc kernel in ordered to avoid that problem, but avoiding it does come at the cost of losing space efficiency when defragging snapshotted btrfs, as the non-snapshot-aware version will tend to create separate copies of the data on each snapshot it is run on, thus decreasing shared data blocks and increasing space usage, perhaps dramatically. So again, at least for now, and at least for large (half-gig or larger) VM images and other "internal write" files such as databases, etc, I'd suggest NOCOW, and don't snapshot, backup to a separate filesystem instead. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html