I glazed over at “This is going to be long” … :) > On 28 Mar 2017, at 15:43, Peter Grandi <p...@btrfs.for.sabi.co.uk> wrote: > > This is going to be long because I am writing something detailed > hoping pointlessly that someone in the future will find it by > searching the list archives while doing research before setting > up a new storage system, and they will be the kind of person > that tolerates reading messages longer than Twitter. :-). > >> I’m currently shrinking a device and it seems that the >> performance of shrink is abysmal. > > When I read this kind of statement I am reminded of all the > cases where someone left me to decatastrophize a storage system > built on "optimistic" assumptions. The usual "optimism" is what > I call the "syntactic approach", that is the axiomatic belief > that any syntactically valid combination of features not only > will "work", but very fast too and reliably despite slow cheap > hardware and "unattentive" configuration. Some people call that > the expectation that system developers provide or should provide > an "O_PONIES" option. In particular I get very saddened when > people use "performance" to mean "speed", as the difference > between the two is very great. > > As a general consideration, shrinking a large filetree online > in-place is an amazingly risky, difficult, slow operation and > should be a last desperate resort (as apparently in this case), > regardless of the filesystem type, and expecting otherwise is > "optimistic". > > My guess is that very complex risky slow operations like that > are provided by "clever" filesystem developers for "marketing" > purposes, to win box-ticking competitions. That applies to those > system developers who do know better; I suspect that even some > filesystem developers are "optimistic" as to what they can > actually achieve. > >> I intended to shrink a ~22TiB filesystem down to 20TiB. This is >> still using LVM underneath so that I can’t just remove a device >> from the filesystem but have to use the resize command. > > That is actually a very good idea because Btrfs multi-device is > not quite as reliable as DM/LVM2 multi-device. > >> Label: 'backy' uuid: 3d0b7511-4901-4554-96d4-e6f9627ea9a4 >> Total devices 1 FS bytes used 18.21TiB >> devid 1 size 20.00TiB used 20.71TiB path /dev/mapper/vgsys-backy > > Maybe 'balance' should have been used a bit more. > >> This has been running since last Thursday, so roughly 3.5days >> now. The “used” number in devid1 has moved about 1TiB in this >> time. The filesystem is seeing regular usage (read and write) >> and when I’m suspending any application traffic I see about >> 1GiB of movement every now and then. Maybe once every 30 >> seconds or so. Does this sound fishy or normal to you? > > With consistent "optimism" this is a request to assess whether > "performance" of some operations is adequate on a filetree > without telling us either what the filetree contents look like, > what the regular workload is, or what the storage layer looks > like. > > Being one of the few system administrators crippled by lack of > psychic powers :-), I rely on guesses and inferences here, and > having read the whole thread containing some belated details. > > From the ~22TB total capacity my guess is that the storage layer > involves rotating hard disks, and from later details the > filesystem contents seems to be heavily reflinked files of > several GB in size, and workload seems to be backups to those > files from several source hosts. Considering the general level > of "optimism" in the situation my wild guess is that the storage > layer is based on large slow cheap rotating disks in teh 4GB-8GB > range, with very low IOPS-per-TB. > >> Thanks for that info. The 1min per 1GiB is what I saw too - >> the “it can take longer” wasn’t really explainable to me. > > A contemporary rotating disk device can do around 0.5MB/s > transfer rate with small random accesses with barriers up to > around 80-160MB/s in purely sequential access without barriers. > > 1GB/m of simultaneous read-write means around 16MB/s reads plus > 16MB/s writes which is fairly good *performance* (even if slow > *speed*) considering that moving extents around, even across > disks, involves quite a bit of randomish same-disk updates of > metadata; because it all depends usually on how much randomish > metadata updates need to done, on any filesystem type, as those > must be done with barriers. > >> As I’m not using snapshots: would large files (100+gb) > > Using 100GB sized VM virtual disks (never mind with COW) seems > very unwise to me to start with, but of course a lot of other > people know better :-). Just like a lot of other people know > better that large single pool storage systems are awesome in > every respect :-): cost, reliability, speed, flexibility, > maintenance, etc. > >> with long chains of CoW history (specifically reflink copies) >> also hurt? > > Oh yes... They are about one of the worst cases for using > Btrfs. But also very "optimistic" to think that kind of stuff > can work awesomely on *any* filesystem type. > >> Something I’d like to verify: does having traffic on the >> volume have the potential to delay this infinitely? [ ... ] >> it’s just slow and we’re looking forward to about 2 months >> worth of time shrinking this volume. (And then again on the >> next bigger server probably about 3-4 months). > > Those are pretty typical times for whole-filesystem operations > like that on rotating disk media. There are some reports in the > list and IRC channel archives to 'scrub' or 'balance' or 'check' > times for filetrees of that size. > >> (Background info: we’re migrating large volumes from btrfs to >> xfs and can only do this step by step: copying some data, >> shrinking the btrfs volume, extending the xfs volume, rinse >> repeat. > > That "extending the xfs volume" will have consequences too, but > not too bad hopefully. > >> If someone should have any suggestions to speed this up and >> not having to think in terms of _months_ then I’m all ears.) > > High IOPS-per-TB enterprise SSDs with capacitor backed caches :-). > >> One strategy that does come to mind: we’re converting our >> backup from a system that uses reflinks to a non-reflink based >> system. We can convert this in place so this would remove all >> the reflink stuff in the existing filesystem > > Do you have enough space to do that? Either your reflinks are > pointless or they are saving a lot of storage. But I guess that > you can do it one 100GB file at a time... > >> and then we maybe can do the FS conversion faster when this >> isn’t an issue any longer. I think I’ll > > I suspect the de-reflinking plus shrinking will take longer, but > not totally sure. > >> Right. This is wan option we can do from a software perspective >> (our own solution - https://bitbucket.org/flyingcircus/backy) > > Many thanks for sharing your system, I'll have a look. > >> but our systems in use can’t hold all the data twice. Even >> though we’re migrating to a backend implementation that uses >> less data than before I have to perform an “inplace” migration >> in some way. This is VM block device backup. So basically we >> migrate one VM with all its previous data and that works quite >> fine with a little headroom. However, migrating all VMs to a >> new “full” backup and then wait for the old to shrink would >> only work if we had a completely empty backup server in place, >> which we don’t. > >> Also: the idea of migrating on btrfs also has its downside - >> the performance of “mkdir” and “fsync” is abysmal at the >> moment. > > That *performance* is pretty good indeed, it is the *speed* that > may be low, but that's obvious. Please consider looking at these > entirely typical speeds: > > http://www.sabi.co.uk/blog/17-one.html?170302#170302 > http://www.sabi.co.uk/blog/17-one.html?170228#170228 > >> I’m waiting for the current shrinking job to finish but this >> is likely limited to the “find free space” algorithm. We’re >> talking about a few megabytes converted per second. Sigh. > > Well, if the filetree is being actively used for COW backups > while being shrunk that involves a lot of randomish IO with > barriers. > >>> I would only suggest that you reconsider XFS. You can't >>> shrink XFS, therefore you won't have the flexibility to >>> migrate in the same way to anything better that comes along >>> in the future (ZFS perhaps? or even Bcachefs?). XFS does not >>> perform that much better over Ext4, and very importantly, >>> Ext4 can be shrunk. > > ZFS is a complicated mess too with an intensely anisotropic > performance envelope too and not necessarily that good for > backup archival for various reasons. I would consider looking > instead at using a collection of smaller "silo" JFS, F2FS, > NILFS2 filetrees as well as XFS, and using MD RAID in RAID10 > mode instead of DM/LVM2: > > http://www.sabi.co.uk/blog/16-two.html?161217#161217 > http://www.sabi.co.uk/blog/17-one.html?170107#170107 > http://www.sabi.co.uk/blog/12-fou.html?121223#121223 > http://www.sabi.co.uk/blog/12-fou.html?121218b#121218b > http://www.sabi.co.uk/blog/12-fou.html?121218#121218 > > and yes, Bcachefs looks promising, but I am sticking with Btrfs: > > https://lwn.net/Articles/717379 > >> That is true. However, we do have moved the expected feature >> set of the filesystem (i.e. cow) > > That feature set is arguably not appropriate for VM images, but > lots of people know better :-). > >> down to “store files safely and reliably” and we’ve seen too >> much breakage with ext4 in the past. > > That is extremely unlikely unless your storage layer has > unreliable barriers, and then you need a lot of "optimism". > >> Of course “persistence means you’ll have to say I’m sorry” and >> thus with either choice we may be faced with some issue in the >> future that we might have circumvented with another solution >> and yes flexibility is worth a great deal. > > Enterprise SSDs with high small-random-write IOPS-per-TB can > give both excellent speed and high flexibility :-). > >> We’ve run XFS and ext4 on different (large and small) >> workloads in the last 2 years and I have to say I’m much more >> happy about XFS even with the shrinking limitation. > > XFS and 'ext4' are essentially equivalent, except for the > fixed-size inode table limitation of 'ext4' (and XFS reportedly > has finer grained locking). Btrfs is nearly as good as either on > most workloads is single-device mode without using the more > complicated features (compression, qgroups, ...) and with > appropriate use of the 'nowcow' options, and gives checksums on > data too if needed. > >> To us ext4 is prohibitive with it’s fsck performance and we do >> like the tight error checking in XFS. > > It is very pleasing to see someone care about the speed of > whole-tree operations like 'fsck', a very often forgotten > "little detail". But in my experience 'ext4' checking is quite > competitive with XFS checking and repair, at least in recent > years, as both have been hugely improved. XFS checking and > repair still require a lot of RAM though. > >> Thanks for the reminder though - especially in the public >> archive making this tradeoff with flexibility known is wise to >> communicate. :-) > > "Flexibility" in filesystems, especially on rotating disk > storage with extremely anisotropic performance envelopes, is > very expensive, but of course lots of people know better :-). > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html