Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
[I'm combining the messages again, since I feel a bit bad, when I write so many mails to the list ;) ] But from my side, feel free to split up as much as you want (perhaps not single characters or so ;) ) On Thu, 2015-12-17 at 04:06 +, Duncan wrote: > Just to mention here, that I said "integrity management features", > which > includes more than checksumming. As Austin Hemmelgarn has been > pointing > out, DBs and some VMs do COW, some DBs do checksumming or at least > have > that option, and both VMs and DBs generally do at least some level > of > consistency checking as they load. Those are all "integrity > management > features" at some level. Okay... well, but the point of that whole thread was obviously data integrity protection in the sense of what data checksumming does in btrfs for CoWed data and for meta-data. In other words: checksums at some blockleve, which are verified upon every read. > As for bittorrent, I /think/ the checksums are in the torrent files > themselves (and if I'm not mistaken, much as git, the chunks within > the > file are actually IDed by checksum, not specific position, so as long > as > the torrent is active, uploading or downloading, these will by > definition > be retained). As long as those are retained, the checksums should > be > retained. And ideally, people will continue to torrent the files > long > after they've finished downloading them, in which case they'll still > need > the torrent files themselves, along with the checksums info. Well I guess we don't need to hook up ourselves so much on the p2p formats. They're just one examples, even if these would actually be integrity protected in the sense as described above, well, fine, but there are other major use cases left, for which this is not the case. Of course one can also always argue, that users can then manually move the files out of the no-CoWed area or manually create their own checksums as I do and store them in XATTRS. But all this is not real proper full checksum protection: there are gaps, where things are not protected and normal users may simply not do/know all this (and why shouldn't they still benefit from proper checksumming if we can make it for them). IMHO, even the argument that one could manually make checksums or move the file to CoWed area, while the e.g. downloaded files are still in cache doesn't count: that wouldn't work for VMs, DBs, and certainly not for torrent files larger than the memory. > Meanwhile, if they do it correctly there's no window without > protection, > as the torrent file can be used to double-verify the file once moved, > as > well, before deleting it. Again, would work only for torrent-like files, not for VM images, only partially for DBs... plus... why requiring users to make it manually, if the fs could take care of it. On Thu, 2015-12-17 at 05:07 +, Duncan wrote: > > In kinda curios, what free space fragmentation actually means here. > > > > Ist simply like this: > > +--+-+---++ > > > F| D | F |D | > > +--+-+---++ > > Where D is data (i.e. files/metadata) and F is free space. > > In other words, (F)ree space itself is not further subdivided and > > only > > fragmented by the (D)ata extents in between. > > > > Or is it more complex like this: > > +-++-+---++ > > > F | F | D | F |D | > > +-++-+---++ > > Where the (F)ree space itself is subdivided into "extents" (not > > necessarily of the same size), and btrfs couldn't use e.g. the > > first two > > F's as one contiguous amount of free space for a larger (D)ata > > extent > At the one level, I had the simpler f/d/f/d scheme in mind, but that > would be the case inside a single data chunk. At the higher file > level, > with files significant fractions of the size of a single data chunk > to > much larger than a single data chunk, the more complex and second > f/f/d/f/d case would apply, with the chunk boundary as the > separation > between the f/f. Okay, but that's only when there are data chunks that neighbour each other... since the data chunks are rather big normally (1GB) that shouldn't be such a big issue,... so I guess the real world looks like this: DC#1 DC#2 ...+-... ...---+|+--+-+- --++ ... F ||| F| D | F |D | ...---+|+ --+-+---++ ...++-... (with DC = data chunk) but it could NOT look like this: DC#1 DC#2 ...+-... ...---+|+-++-+---++ ... F ||| F | F | D | F |D | ...---+|+-++-+---++ ...++- ... in other words, there could be =2 adjacent free space "extents", when these are actually parts of different neighbouring chunks, but there could NOT be >=2 adjacent free space "extents"
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
On Wed, 2015-12-09 at 16:36 +, Duncan wrote: > But... as I've pointed out in other replies, in many cases including > this > specific one (bittorrent), applications have already had to develop > their > own integrity management features Well let's move discussion upon that into the "dear developers, can we have notdatacow + checksumming, plz?" where I showed in one of the more recent threads that bittorrent seems rather to be the only thing which does use that per default... while on the VM image front, nothing seems to support it, and on the DB front, some support it, but don't use it per default. > In the bittorrent case specifically, torrent chunks are already > checksummed, and if they don't verify upon download, the chunk is > thrown > away and redownloaded. I'm not a bittorrent expert, because I don't use it, but that sounds to be more like the edonkey model, where - while there are checksums - these are only used until the download completes. Then you have the complete file, any checksum info thrown away, and the file again being "at risk" (i.e. not checksum protected). > And after the download is complete and the file isn't being > constantly > rewritten, it's perfectly fine to copy it elsewhere, into a dir where > nocow doesn't apply. Sure, but again, nothing the user may automatically do, and there's still the gap between the final verification from the bt software, to the time it's copied over. Arguably, that may be very short, but I see no reasons to make any breaks in the everything-verified chain from the btrfs side. > With the copy, btrfs will create checksums, and if > you're paranoid you can hashcheck the original nocow copy against the > new > checksummed/cow copy, and after that, any on-media changes will be > caught > by the normal checksum verification mechanisms. As before... of course you're right that one can do this, but nothing that happens per default. And I think that's just one of the nice things btrfs would/should give us. That the filesystem assures that data is valid, at least in terms of storage device and bus errors (it cannot protect of course against memory errors or that like). > > Hmm doesn't seem really good to me if systemd would do that, cause > > it > > then excludes any such files from being snapshot. > > Of course if the directories are already present due to systemd > upgrading > from non-btrfs-aware versions, they'll remain as normal dirs, not > subvolumes. This is the case here. Well, even if not, because one starts from a fresh system... people may not want that. > And of course you can switch them around to dirs if you like, and/or > override the shipped tmpfiles.d config with your own. ... sure but, people may not even notice that. I don't think such a decision is up to systemd. Anyway, since we're btrfs here, not systemd, that shouldn't bother us ;) > > > and small ones such as the sqlite files generated by firefox and > > > various email clients are handled quite well by autodefrag, with > > > that > > > general desktop usage being its primary target. > > Which is however not yet the default... > Distro integration bug! =:^) Nah,... really not... I'm quite sure that most distros will generally decide against diverting from upstream in such choices. > > It feels a bit, if there should be some tools provided by btrfs, > > which > > tell the users which files are likely problematic and should be > > nodatacow'ed > And there very well might be such a tool... five or ten years down > the > road when btrfs is much more mature and generally stabilized, well > beyond > the "still maturing and stabilizing" status of the moment. Hmm let's hope btrfs isn't finished only when the next-gen default fs arrives ;^) > But it can be the case that as filesystem fragmentation levels rise, > free- > space itself is fragmented, to the point where files that would > otherwise > not be fragmented as they're created once and never touched again, > end up > fragmented, because there's simply no free-space extents big enough > to > create them in unfragmented, so a bunch of smaller free-space extents > must be used where one larger one would have been used had it > existed. In kinda curios, what free space fragmentation actually means here. Ist simply like this: +--+-+---++ | F | D | F | D | +--+-+---++ Where D is data (i.e. files/metadata) and F is free space. In other words, (F)ree space itself is not further subdivided and only fragmented by the (D)ata extents in between. Or is it more complex like this: +-++-+---++ | F | F | D | F | D | +-+ +-+---++ Where the (F)ree space itself is subdivided into "extents" (not necessarily of the same size), and btrfs couldn't use e.g. the first two F's as one contiguous amount of free space for a larger (D)ata extent of that size: +--+-+---++ | D | D | F | D |
Re: btrfs: poor performance on deleting many large files
On Sun, 2015-12-13 at 07:10 +, Duncan wrote: > > So you basically mean that ro snapshots won't have their atime > > updated > > even without noatime? > > Well I guess that was anyway the recent behaviour of Linux > > filesystems, > > and only very old UNIX systems updated the atime even when the fs > > was > > set ro. > > I'd test it to be sure before relying on it (keeping in mind that my > own > use-case doesn't include subvolumes/snapshots so it's quite possible > I > could get fine details of this nature wrong), but that would be my > very > (_very_! see next) strong assumption, yes. > > Because read-only snapshots are used for btrfs-send among other > things, > with the idea being that the read-only will keep them from changing > in > the middle of the send, and ro snapshot atime updates would seem to > throw > that entirely out the window. So I can't imagine ro snapshots doing > atime > updates under any circumstance because I just can't see how send > could > rely on them then, but I'd still test it before counting on it. For those who haven't followed up the other threads: I've tried it out and yes, ro-snapshots (as well as ro mounted btrfs filesystem/subvolumes) don't have their atimes changed on e.g. read. > AFAIK, the general idea was to eventually have all the (possible, > some > are global-filesystem-scope) subvolume mount options exposed as > properties, it's just not implemented yet, but I'm not entirely sure > if > that was all /btrfs-specific/ mount options, or included the generic > ones > such as the *atime and no* (noexec/nodev/...) options as well. In > view > of that and the fact that noatime is generic, adding it as a specific > request still makes sense. Someone with more specific knowledge on > the > current plan can remove it if it's already covered. Not sure if I'd had already posted that here, but I did write some of these ideas up and added it to the wiki: https://btrfs.wiki.kernel.org/index.php?title=Project_ideas=historysubmit=29757=29743 Best wishes, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
On Mon, 2015-12-14 at 10:51 +, Duncan wrote: > > AFAIU, the one the get's fragmented then is the snapshot, right, > > and the > > "original" will stay in place where it was? (Which is of course > > good, > > because one probably marked it nodatacow, to avoid that > > fragmentation > > problem on internal writes). > > No. Or more precisely, keep in mind that from btrfs' perspective, in > terms of reflinks, once made, there's no "original" in terms of > special > treatment, all references to the extent are treated the same. Sure... you misunderstood me I guess.. > > What a snapshot actually does is create another reference (reflink) > to an > extent. [snip snap] > So in the > case of nocow, a cow1 (one-time-cow) exception must be made, > rewriting > the changed data to a new location, as the old location continues to > be > referenced by at least one other reflink. That's what I've meant. > So (with the fact that writable snapshots are available and thus it > can > be the snapshot that changed if it's what was written to) the one > that > gets the changed fragment written elsewhere, thus getting fragmented, > is > the one that changed, whether that's the working copy or the snapshot > of > that working copy. Yep,.. that's what I've suspected and asked for. The "original" file, in the sense of the file that first reflinked the contiguous blocks,... will continue to point to these continuous blocks. While the "new" file, i.e he CoW-1-ed snapshot's file, will partially reflink blocks form the contiguous range, and it's rewritten blocks will reflink somewhere else. Thus the "new" file is the one that gets fragmented. > > And one more: > > You both said, auto-defrag is generally recommended. > > Does that also apply for SSDs (where we want to avoid unnecessary > > writes)? > > It does seem to get enabled, when SSD mode is detected. > > What would it actually do on an SSD? > Did you mean it does _not_ seem to get (automatically) enabled, when > SSD > mode is detected, or that it _does_ seem to get enabled, when > specifically included in the mount options, even on SSDs? I does seem to get enabled, when specifically included in the mount options (the ssd mount option is not used), i.e.: /dev/mapper/system / btrfs subvol=/root,defaults,noatime,autodefrag0 1 leads to: [5.294205] BTRFS: device label foo devid 1 transid 13 /dev/disk/by-label/foo [5.295957] BTRFS info (device sdb3): disk space caching is enabled [5.296034] BTRFS: has skinny extents [ 67.082702] BTRFS: device label system devid 1 transid 60710 /dev/mapper/system [ 67.85] BTRFS info (device dm-0): disk space caching is enabled [ 67.111267] BTRFS: has skinny extents [ 67.305084] BTRFS: detected SSD devices, enabling SSD mode [ 68.562084] BTRFS info (device dm-0): enabling auto defrag [ 68.562150] BTRFS info (device dm-0): disk space caching is enabled > Or did you actually mean it the way you wrote it, that it seems to be > enabled (implying automatically, along with ssd), when ssd mode is > detected? No, sorry for being unclear. I meant it that way, that having the ssd detected doesn't auto-disable auto-defrag, which I thought may make sense, given that I didn't know exactly what it would do on SSDs... IIRC, Hugo or Austin, mentioned the thing with making for better IOPS, but I haven't had considered that to have impact enough... so I thought it could have made sense to ignore the "autodefrag" mount option in case an ssd was detected. > There are three factors I'm aware of here as well, all favoring > autodefrag, just as the two above favored leaving it off. > > 1) IOPS, Input/Output Operations Per Second. SSDs typically have > both an > IOPS and a throughput rating. And unlike spinning rust, where raw > non- > sequential-write IOPS are generally bottlenecked by seek times, on > SSDs > with their zero seek-times, IOPS can actually be the bottleneck. Hmm it would be really nice to get someone who has found a way to make some sound analysis/benchmarking of that. > 2) SSD physical write and erase block sizes as multiples of the > logical/ > read block size. To the extent that extent sizes are multiples of > the > write and/or erase-block size, writing larger extents will reduce > write > amplification due to writing and blocks smaller than the write or > erase > block size. Hmm... okay I don't know the details of how btrfs does this, but I'd have expected that all extents are aligned to the underlying physical devices' block structure. Thus each extent should start at such write/erase block, and at most it shouldn't perfectly at the end of the extent. If the file is fragmented (i.e. more than one extent), I'd have even hoped that all but the last one fit perfectly. So what you basically mean, AFAIU, is that by having auto-defrag, you get larger extents (i.e. smaller ones collapsed into one) and by thus you get less cut off at the end of extents where these
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
Am Wed, 9 Dec 2015 13:36:01 + (UTC) schrieb Duncan <1i5t5.dun...@cox.net>: > >> > 4) Duncan mentioned that defrag (and I guess that's also for > >> > auto- defrag) isn't ref-link aware... > >> > Isn't that somehow a complete showstopper? > > >> It is, but the one attempt at dealing with it caused massive data > >> corruption, and it was turned off again. > > IIRC, it wasn't data corruption so much, as massive scaling issues, > to the point where defrag was entirely useless, as it could take a > week or more for just one file. > > So the decision was made that a non-reflink-aware defrag that > actually worked in something like reasonable time even if it did > break reflinks and thus increase space usage, was of more use than a > defrag that basically didn't work at all, because it effectively took > an eternity. After all, you can always decide not to run it if you're > worried about the space effects it's going to have, but if it's going > to take a week or more for just one file, you effectively don't have > the choice to run it at all. > > > So... does this mean that it's still planned to be implemented some > > day or has it been given up forever? > > AFAIK it's still on the list. And the scaling issues are better, but > one big thing holding it up now is quota management. Quotas never > have worked correctly, but they were a big part (close to half, IIRC) > of the original snapshot-aware-defrag scaling issues, and thus must > be reliably working and in a generally stable state before a > snapshot-aware-defrag can be coded to work with them. And without > that, it's only half a solution that would have to be redone when > quotes stabilized anyway, so really, quota code /must/ be stabilized > to the point that it's not a moving target, before reimplementing > snapshot-aware-defrag makes any sense at all. > > But even at that point, while snapshot-aware-defrag is still on the > list, I'm not sure if it's ever going to be actually viable. It may > be that the scaling issues are just too big, and it simply can't be > made to work both correctly and in anything approaching practical > time. Time will tell, of course, but until then... I'd like to throw in an idea... Couldn't auto-defrag just be made "sort of reflink-aware" in a very simple fashion: Just let it ignore extents that are shared? That way you can still enjoy it benefits in a mixed-mode scenario where you are working with snapshots partly but other subvolumes are never taken snapshots of. Comments? -- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
Christoph Anton Mitterer posted on Wed, 16 Dec 2015 22:59:01 +0100 as excerpted: >> And there very well might be such a tool... five or ten years down the >> road when btrfs is much more mature and generally stabilized, well >> beyond the "still maturing and stabilizing" status of the moment. > Hmm let's hope btrfs isn't finished only when the next-gen default fs > arrives ;^) [Again, breaking into smaller point replies...] Well, given the development history for both zfs and btrfs to date, five to ten years down the line, with yet another even newer filesystem then already under development, is more being "real", than not. Also see the history in MS' attempt at a next-gen filesystem. The reality is these things take FAR longer than one might think. FWIW, on the wiki I see feature points and benchmarks for v0.14, introduced in April of 2008, and a link to an earlier btree filesystem on which btrfs apparently was based, dating to 2006, so while I don't have a precise beginning date, and to some extent such a thing would be rather arbitrary anyway, as Chris would certainly have done some major thinking, preliminary research and coding, before his first announcement, a project origin in late 2006 or sometime in 2007 has to be quite close. And (as I noted in a parenthetical at my discovery in a different thread), I switched to btrfs for my main filesystems when I bought my first SSDs, in June of 2013, so already a quarter decade ago. At the time btrfs was just starting to remove some of the more dire "experimental" warnings. Obviously it has stabilized quite a bit since then, but due to the oft-quoted 80/20 rule and extensions, where the last 20% of the progress takes 80% of the work, etc... It could well be another five years before btrfs is at a point I think most here would call stable. That would be 2020 or so, about 13 years for the project, and if you look at the similar projects mentioned above, that really isn't unrealistic at all. Ten years minimum, and that's with serious corporate level commitments and a lot more dedicated devs than btrfs has. 12 years not unusual at all, and a decade and a half still well within reasonable range, for a filesystem with this level of complexity, scope, and features. And realistically, by that time, yet another successor filesystem may indeed be in the early stages of development, say at the 20/80 point, 20% of required effort invested, possibly 80% of the features done, but not stabilized. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
Christoph Anton Mitterer posted on Wed, 16 Dec 2015 22:59:01 +0100 as excerpted: > In kinda curios, what free space fragmentation actually means here. > > Ist simply like this: > +--+-+---++ > | F | D | F | D | > +--+-+---++ > Where D is data (i.e. files/metadata) and F is free space. > In other words, (F)ree space itself is not further subdivided and only > fragmented by the (D)ata extents in between. > > Or is it more complex like this: > +-++-+---++ > | F | F | D | F | D | > +-++-+---++ > Where the (F)ree space itself is subdivided into "extents" (not > necessarily of the same size), and btrfs couldn't use e.g. the first two > F's as one contiguous amount of free space for a larger (D)ata extent [still breaking into smaller points for reply] At the one level, I had the simpler f/d/f/d scheme in mind, but that would be the case inside a single data chunk. At the higher file level, with files significant fractions of the size of a single data chunk to much larger than a single data chunk, the more complex and second f/f/d/f/d case would apply, with the chunk boundary as the separation between the f/f. IOW, files larger than data chunk size will always be fragmented into data chunk size fragments/extents, at the largest, because chunks are designed to be movable using balance, device remove, replace, etc. So (using the size numbers from a recent comment from Qu in a different thread), on a filesystem with under 100 GiB total space-effective (space- effective, space available, accounting for the replication type, raid1, etc, and I'm simplifying here...), data chunks should be 1 GiB, while above that, with striping, they might be upto 10 GiB. Using the 1 GiB nominal figure, files over 1 GiB would always be broken into 1 GiB maximum size extents, corresponding to 1 extent per chunk. But while 4 KiB extents are clearly tiny and inefficient at today's scale, in practice, efficiency gains break down at well under GiB scale, with AFAIK 128 MiB being the upper bound at which any efficiency gains could really be expected, and 1 MiB arguably being a reasonable point at which further increases in extent size likely won't have a whole lot of effect even on SSD erase-block (where 1 MiB is a nominal max), but that's that's still 256X the usual 4 KiB minimum data block size, 8X the 128 KiB btrfs compression-block size, and 4X the 256 KiB defrag default "don't bother with extents larger than this" size. Basically, the 256 KiB btrfs defrag "don't bother with anything larger than this" default is quite reasonable, tho for massive multi-gig VM images, the number of 256 KiB fragments will still look pretty big, so while technically a very reasonable choice, the "eye appeal" still isn't that great. But based on real reports posting before and after numbers from filefrag (on uncompressed btrfs), we do have cases where defrag can't find 256 KiB free-space blocks and thus can actually fragment a file worse than it was before, so free-space fragmentation is indeed a very real problem. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
Christoph Anton Mitterer posted on Wed, 16 Dec 2015 22:59:01 +0100 as excerpted: > I'm a bit unsure how to read filefrag's output... (even in the > uncompressed case). > What would it show me if there was fragmentation /path/to/file: 18 extents found It tells you the number of extents found. Nominally, each extent should be a fragment, but as has been discussed elsewhere, on btrfs compressed files it will interpret each 128 KiB btrfs compression block as its own extent, even if (as seen in verbose mode) the next one begins where the previous one ends so it's really just a single extent. Apparently on ext3/4, it's possible to have multi-gig files as a single extent, thus unfragmented, but as explained in an earlier reply to a point earlier in your post, on btrfs, extents of a GiB are nominally the best you can do as that's the nominal data chunk size, tho in limited circumstances larger extents are still possible on btrfs. In the case above, where I took the 18 extents result from a real file (tho obviously the posted path isn't real), it was 4 MiB in size (I think exactly, it's a 4 MiB BIOS image =:^), so doing the math, extents average 227 KiB. That's on a filesystem that is always mounted with autodefrag, but it's also always mounted with compress, so it's possible some of the reported extents are compressed. Actually, looking at filefrag -v output (which I've never used before but which someone noted could be used to check fragmentation on compressed files, tho it's not as straightforward as you might think), it looks like all but two of the listed extents are 32 blocks long (with 4096 byte blocks), which equates to 128 KiB, the btrfs compression-block size, and the two remaining extents are 224 blocks long or 896 KiB, an exact 7 multiple of 128 KiB, so this file would indeed appear to be compressed except for those two uncompressed extents. (As for figuring out how to interpret the full -v output to know whether the compressed blocks are actually single extents or not, as I said this is my first time trying -v, and I didn't bother going that far with it.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
Christoph Anton Mitterer posted on Wed, 16 Dec 2015 22:59:01 +0100 as excerpted: >> he obviously didn't think thru the fact that compression MUST be a >> rewrite, thereby breaking snapshot reflinks, even were normal >> non-compression defrag to be snapshot aware, because compression >> substantially changes the way the file is stored), that's _implied_, >> not explicit. > So you mean, even if ref-link aware defrag would return, it would still > break them again when compressing/uncompressing/recompressing? > I'd have hoped that then, all snapshots respectively other reflinks > would simply also change to being compressed, You're correct. I "obviously didn't thing thru" that the whole way, myself. =:^( But meanwhile, we don't have snapshot-aware-defrag, and in that case, the implication... and his result... remains. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
Christoph Anton Mitterer posted on Wed, 16 Dec 2015 22:59:01 +0100 as excerpted: >> It's certainly in quite a few on-list posts over the years > okay,.. in other words: no ;-) > scatter over the years list posts don't count as documentation :P =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
Christoph Anton Mitterer posted on Wed, 16 Dec 2015 22:59:01 +0100 as excerpted: > On Wed, 2015-12-09 at 16:36 +, Duncan wrote: >> But... as I've pointed out in other replies, in many cases including >> this specific one (bittorrent), applications have already had to >> develop their own integrity management features > Well let's move discussion upon that into the "dear developers, can we > have notdatacow + checksumming, plz?" where I showed in one of the more > recent threads that bittorrent seems rather to be the only thing which > does use that per default... while on the VM image front, nothing seems > to support it, and on the DB front, some support it, but don't use it > per default. > >> In the bittorrent case specifically, torrent chunks are already >> checksummed, and if they don't verify upon download, the chunk is >> thrown away and redownloaded. > I'm not a bittorrent expert, because I don't use it, but that sounds to > be more like the edonkey model, where - while there are checksums - > these are only used until the download completes. Then you have the > complete file, any checksum info thrown away, and the file again being > "at risk" (i.e. not checksum protected). [I'm breaking this into smaller replies again.] Just to mention here, that I said "integrity management features", which includes more than checksumming. As Austin Hemmelgarn has been pointing out, DBs and some VMs do COW, some DBs do checksumming or at least have that option, and both VMs and DBs generally do at least some level of consistency checking as they load. Those are all "integrity management features" at some level. As for bittorrent, I /think/ the checksums are in the torrent files themselves (and if I'm not mistaken, much as git, the chunks within the file are actually IDed by checksum, not specific position, so as long as the torrent is active, uploading or downloading, these will by definition be retained). As long as those are retained, the checksums should be retained. And ideally, people will continue to torrent the files long after they've finished downloading them, in which case they'll still need the torrent files themselves, along with the checksums info. And for longer term storage, people really should be copying/moving their torrented files elsewhere, in such a way that they either eliminate the fragmentation if the files weren't nocowed, or eliminate the nocow attribute and get them checksum-protected as normal for files not intended to be constantly randomly rewritten, which will be the case once they're no longer being actively downloaded. Of course that's at the slightly technically oriented user level, but then, the whole nocow thing, or even caring about checksums and longer term file integrity in the first place, is also technically oriented user level. Normal users will just download without worrying about the nocow in the first place, and perhaps wonder why the disk is thrashing so, but not be inclined to do anything about it except perhaps switch back to their old filesystem, where it was faster and the disk didn't sound as bad. In doing so, they'll either automatically get the checksuming along with the worse performance, or go back to a filesystem without the checksumming, and think it's fine as they know no different. Meanwhile, if they do it correctly there's no window without protection, as the torrent file can be used to double-verify the file once moved, as well, before deleting it. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs: poor performance on deleting many large files
Lionel Bouton posted on Tue, 15 Dec 2015 03:38:33 +0100 as excerpted: > I just checked: this has only be made crystal-clear in the latest > man-pages version 4.03 released 10 days ago. > > The mount(8) page of Gentoo's current stable man-pages (4.02 release in > August) which is installed on my systems states for noatime: > "Do not update inode access times on this filesystem (e.g., for faster > access on the news spool to speed up news servers)." Hmm... I hadn't synced and updated in about that time, and sure enough, while I've just synced I've not yet updated, and still have man-pages 4.02 installed. But, the mount.8(.bz2 in my case as that's the compression I'm configured for, I had to use man -d mount to debug-dump what file it was actually loading) manpage actually belongs to util-linux, according to equery belongs, while equery files man-pages | grep mount only returns hits for mount.2(.bz2 and umount). So at least here, it's util-linux providing the mount (8) manpage, not man-pages. Tho I'm on ~amd64 and IIRC just updated util-linux in the last update, so the cross-ref to nodiratime in the noatime entry (saying it isn't necessary as noatime covers it) probably came from there, or a similar recent util-linux update. Let's see... My current util-linux (with the xref in both noatime and nodiratime to the other, saying nodiratime isn't needed if noatime is used) is 2.27.1. The oldest version I still have in my binpkg cache (tho I likely have older on the backup) is util-linux 2.24.2. For noatime it has the wording you mention, don't update inode access times, but for nodiratime, it specifically mentions directory inode access times. So from util-linux 2.24.2 at least, the information was there, but you had to read between the lines a bit more, because nodiratime mentions dir inodes, and noatime says don't update atime on inodes, so it's there but you have to be a reasonably astute reader to see it. In between those two I have other versions including 2.26.2 and 2.27. Looks like 2.27 added both the "implies nodiratime" wording to the noatime entry, and the nodiratime unneeded if noatime set notation to the nodiratime entry. If there was a util-linux 2.26.x beyond x=2, I apparently never installed it, so the wording likely changed with 2.27, but may have changed with late 2.26 versions as well, if there were any beyond 2.26.2. And on gentoo, 2.26.2 appears to be the latest stable-keyworded, so that's what stable users would have. But as I said, the info is there at least as of 2.24.2, you just have to note in the nodiratime entry that it says dir inodes, while the noatime entry simply says inodes, without excluding dir inodes. So it's there, you just have to be a somewhat astute reader to note it. Anywhere else, say on-the-net recommendations for nodiratime, /should/ mention that they aren't necessary if noatime is used as well, but of course not all of them will. (Tho I'd actually find it a bit strange to see discussion of nodiratime without discussion of noatime as well, as I'd guess any discussion of just one of the two would likely be on noatime, leaving nodiratime unmentioned if they're only covering one, as it shouldn't be necessary to mention, since it's already included in noatime.) But there's probably a bunch of folks who originally read coverage of noatime, then saw nodiratime later, and thought "Oh, that's separate? Well I want that too!" and simply enabled them both, without actually checking the manpage or other documentation including on-the-net discussion. I know here I originally saw noatime and decided I wanted it, then was confused when I saw nodiratime sometime later. But I don't just enable stuff without having some idea what I'm enabling, so I did my research, and saw noatime implied nodiratime as well, so the only reason nodiratime might be needed would be if you wanted atime in general, but not on dirs. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs: poor performance on deleting many large files
Le 15/12/2015 02:49, Duncan a écrit : > Christoph Anton Mitterer posted on Tue, 15 Dec 2015 00:25:05 +0100 as > excerpted: > >> On Mon, 2015-12-14 at 22:30 +0100, Lionel Bouton wrote: >> >>> I use noatime and nodiratime >> FYI: noatime implies nodiratime :-) > Was going to post that myself. Is there some reason you: > > a) use nodiratime when noatime is already enabled, despite the fact that > the latter already includes the former, or I don't (for some time). I didn't check for nodiratime on all the systems I admin so there could be some left around but as they are harmless I only remove them when I happen to stumble on them. > > b) didn't sufficiently research the option (at least the current mount > manpage documents that noatime includes nodiratime under both the noatime > and nodiratime options, I just checked: this has only be made crystal-clear in the latest man-pages version 4.03 released 10 days ago. The mount(8) page of Gentoo's current stable man-pages (4.02 release in August) which is installed on my systems states for noatime: "Do not update inode access times on this filesystem (e.g., for faster access on the news spool to speed up news servers)." This is prone to misinterpretation: directories are inodes but that may not be self-explanatory for everyone. At least it could leave me with a doubt if I wasn't absolutely certain of the behavior (see below): I'm not sure myself that there isn't a difference between a VFS inode (the in-memory structure) and an on-disk structure called inode which some filesystems may not have (I may have been mistaken but IIRC ReiserFS left me with the impression that it wasn't storing directory entries in inodes or it didn't call it that). In fact I remember that when I read statements about noatime implying nodiratime I had to check fs/inode.c after I found a random discussion on the subject mentioning the proof being in the code to make sure of the behavior. > and at least some hint of that has been in the > manpage for years as I recall reading it when I first read of nodiratime > and checked whether my noatime options included it) before standardizing > on it, or > > c) might have actually been talking in general, and there's some mounts > you don't actually choose to make noatime, but still want nodiratime, or I probably used this case for testing purposes (but don't remember a case where it was useful to me). The expression I used was not meant to describe the exact flags in fstab on my systems but the general idea of avoiding files and directories atime updates as by using noatime I'm implicitly using nodiratime too. Sorry for the confusion (I've been confused about the subject a long time which probably didn't help express myself clearly). Best regards, Lionel -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs: poor performance on deleting many large files
Christoph Anton Mitterer posted on Tue, 15 Dec 2015 00:25:05 +0100 as excerpted: > On Mon, 2015-12-14 at 22:30 +0100, Lionel Bouton wrote: > >> I use noatime and nodiratime > FYI: noatime implies nodiratime :-) Was going to post that myself. Is there some reason you: a) use nodiratime when noatime is already enabled, despite the fact that the latter already includes the former, or b) didn't sufficiently research the option (at least the current mount manpage documents that noatime includes nodiratime under both the noatime and nodiratime options, and at least some hint of that has been in the manpage for years as I recall reading it when I first read of nodiratime and checked whether my noatime options included it) before standardizing on it, or c) might have actually been talking in general, and there's some mounts you don't actually choose to make noatime, but still want nodiratime, or d) chose that isn't otherwise reflected in the above? If so, please describe, as it could be a learning experience for me, and possibly others as well. >> Finally Linus Torvalds has been quite vocal and consistent on the >> general subject of the kernel not breaking user-space APIs no matter >> what so I wouldn't have much hope for default kernel mount options >> changes... > He surely is right in general,... but when the point has been reached, > where only a minority actually requires the feature... and the minority > actually starts to suffer from that... it may change. Generally speaking, the practical rule is that you don't break userspace, but that a break that isn't noticed and reported by someone within a few release cycles is considered OK, as obviously nobody who actually cares enough about the possibility of old userspace breaking on new kernels enough to test for it was (still) using that functionality anyway. (This is sometimes known as the "if a tree falls in the forest and there's nobody around to hear it, did it actually fall", rule. =:^) But if it's noticed and reported before the new behavior itself is locked into place by other userspace relying on it, the change in behavior must be reverted. (There have actually been a few cases over the years where they went to rather exceptional lengths to make two otherwise incompatible userspace-exposed behaviors both continue to work for the userspace that expected that behavior, without actually coding in such obvious hacks as executable name conditionals or the like, as others have been known to do at times. Sometimes these fixes do end up bending the rules a bit, particularly the no-policy-in-the-kernel rule, but they do reinforce the now userspace breakage rule.) The possible workarounds include the handful of kernel compatibility options that when enabled continue otherwise userspace breaking behavior such as removing old kernel API procfs files and the like. That practical rule does in effect make it possible to do userspace- breaking changes if you wait around long enough that there's nobody who will complain still actually using the old behavior. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs: poor performance on deleting many large files
Austin S. Hemmelgarn posted on Mon, 14 Dec 2015 15:27:11 -0500 as excerpted: > FWIW, both Duncan and I have our own copy of the sources patched to > default to noatime, and I know a number of embedded Linux developers who > do likewise, and I've even heard talk in the past of some distributions > possibly using such patches themselves (although it always ends up not > happening, because of Mutt). And FWIW, while I was reasonably conservative with my original patch and simply defaulted to noatime, turning it off if any of the atime-enabling options were found, I'm beginning to think I might as well simply hard- code noatime, removing the conditions. This is due to initr* behavior that ends up not disabling atime for early, mostly virtual/memory-based filesystems like procfs, sysfs, devfs, tmp-on-tmpfs, etc, but could extend to initial initr* mount of the root filesystem as well, if I decide to make it rw on the kernel commandline or some such. Of course atime on a memory-based-fs isn't normally a huge problem since its all memory-based anyway, and it would enable stuff like atime based tmpwatch since I do a tmpfs based tmp, so I've not worried about it much. But at the same time, I'm now assuming noatime on my systems, and anything that breaks that assumption could trigger hard to trace down bugs, and hardcoding the noatime assumption would bring a consistency that I don't have ATM. If/when I change my patch in that regard, I may look into adding other conditional options, perhaps defaulting to autodefrag if it's btrfs, for instance, if my limited sysadmin-not-developer-level patching/coding skills allow it. I'd have to see... But I'd certainly start with making autodefrag a default, not hard-coded, if I did patch in autodefrag, because while I don't have large VM images and the like, where autodefrag can be a performance bottleneck, to worry about now, I'd like to keep that option available for me in the future, and would thus make autodefrag the default, not hard-coded. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
Christoph Anton Mitterer posted on Mon, 14 Dec 2015 02:44:55 +0100 as excerpted: > Two more on these: > > On Thu, 2015-11-26 at 00:33 +, Hugo Mills wrote: >> 3) When I would actually disable datacow for e.g. a subvolume that >> > holds VMs or DBs... what are all the implications? >> After snapshotting, modifications are CoWed precisely once, and >> then it reverts to nodatacow again. This means that making a snapshot >> of a nodatacow object will cause it to fragment as writes are made to >> it. > AFAIU, the one the get's fragmented then is the snapshot, right, and the > "original" will stay in place where it was? (Which is of course good, > because one probably marked it nodatacow, to avoid that fragmentation > problem on internal writes). No. Or more precisely, keep in mind that from btrfs' perspective, in terms of reflinks, once made, there's no "original" in terms of special treatment, all references to the extent are treated the same. What a snapshot actually does is create another reference (reflink) to an extent. What btrfs normally does on change as a cow-based filesystem is of course copy-on-write the change. What nocow does, in the absence of other references to that extent, is rewrite the change in-place. But if there's another reference to that extent, the change can't be in- place because that would change the file reached by that other reference as well, and the change was only to be made to one of them. So in the case of nocow, a cow1 (one-time-cow) exception must be made, rewriting the changed data to a new location, as the old location continues to be referenced by at least one other reflink. So (with the fact that writable snapshots are available and thus it can be the snapshot that changed if it's what was written to) the one that gets the changed fragment written elsewhere, thus getting fragmented, is the one that changed, whether that's the working copy or the snapshot of that working copy. > I'd assume the same happens when I do a reflink cp. Yes. It's the same reflinking mechanism, after all. If there's other reflinks to the extent, snapshot or otherwise, changes must be written elsewhere, even if they'd otherwise be nocow. > Can one make a copy, where one still has atomicity (which I guess > implies CoW) but where the destination file isn't heavily fragmented > afterwards,... i.e. there's some pre-allocation, and then cp really does > copy each block (just everything's at the state of time where I stared > cp, not including any other internal changes made on the source in > between). The way that's handled is via ro snapshots which are then copied, which of course is what btrfs send does (at least in non-incremental mode, and incremental mode still uses the ro snapshot part to get atomicity), in effect. > And one more: > You both said, auto-defrag is generally recommended. > Does that also apply for SSDs (where we want to avoid unnecessary > writes)? > It does seem to get enabled, when SSD mode is detected. > What would it actually do on an SSD? Did you mean it does _not_ seem to get (automatically) enabled, when SSD mode is detected, or that it _does_ seem to get enabled, when specifically included in the mount options, even on SSDs? Or did you actually mean it the way you wrote it, that it seems to be enabled (implying automatically, along with ssd), when ssd mode is detected? Because the latter would be a shock to me, as that behavior hasn't been documented anywhere, but I can't imagine it's actually doing it and that you actually meant what you actually wrote. If you look waaayyy back to shortly before I did my first more or less permanent deployment (I had initially posted some questions and did an initial experimental deployment several months earlier, but it didn't last long, because $reasons), you'll see a post I made to the list with pretty much the same general question, autodefrag on ssd, or not. I believe the most accurate short answer is that the benefit of autodefrag on SSD is fuzzy, and thus left to local choice/policy, without an official recommendation either way. There are two points that we know for certain: (1) the zero-seek-time of SSD effectively nullifies the biggest and most direct cost associated with fragmentation on spinning rust, thereby lessening the advantage of autodefrag as seen on spinning rust by an equally large degree, and (2) autodefrag will without question lead to a relatively limited number of near-time additional writes, as the rewrite is queued and eventually processed. To the extent that an admin considers these undisputed factors alone, or weighs them less heavily than the more controversial factors below, they're likely to consider autodefrag on ssd a net negative and leave it off. But I was persuaded by the discussion when I asked the question, to enable autodefrag on my all-ssd btrfs deployment here. Why? Those other, less direct and arguably less directly
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
Christoph Anton Mitterer posted on Mon, 14 Dec 2015 03:46:01 +0100 as excerpted: >> Same here. In fact, my most anticipated feature is N-way-mirroring, > Hmm ... not totally sure about that... > AFAIU, N-way-mirroring is what currently the currently wrongly called > RAID1 is in btrfs, i.e. having N replicas of everything on M devices, > right? > In other words, not being a N-parity-RAID and not guaranteeing that > *any* N disks could fail, right? No. N-way-mirroring, at least in simplest form (as in md/raid1) is N replicas on N devices, so loss of N-1 devices is permitted without loss of data. Normally the best thing about this is that unlike parity, once the general support is in, you can increase redundancy at will, with guaranteed device-loss protection of as many devices as you care to insure against. At one point with somewhat old devices that I didn't particularly trust any more and because I had them from a previous raid6 setup, I was running 4-way-md/raid1. Of course with md/raid1, the problem is lack of any sort of data integrity assurance, even scrubbing just arbitrarily chooses one and in the case of difference, simply copies that to the others, not even plurality-vote most authoritative version. With btrfs checksumming, the value of N-way-mirroring is increased dramatically, since it allows individual block verification and fallback, as opposed to whole-device-loss. While my own sweet-spot balance will tend to be three-way, avoiding the "if one copy is bad (perhaps because of a device that's known failing/ failed), you better /hope/ your only remaining copy is good" problem of the present two-way-only solution, I could easily see people finding value in 4/5/6-way mirroring as well. And of course if that is extended to raid10, three-way-mirroring, two-way- striping, on six total devices, would be my preferred, over the three-way- striped, two-way-mirrored, that's the only current choice for six-device btrfs raid10. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs: poor performance on deleting many large files
On Mon, Dec 14, 2015 at 7:24 AM, Austin S. Hemmelgarnwrote: > > If you have software that actually depends on atimes, then that software is > broken (and yes, I even feel this way about Mutt). The way atimes are > implemented on most systems breaks the semantics that almost everyone > expects from them, because they get updated for anything that even looks > sideways at the inode from across the room. Most software that uses them > expects them to answer the question 'When were the contents of this file > last read?', but they can get updated even for stuff like calculating file > sizes, listing directory contents, or modifying the file's metadata. This Jonathan Corbet article still applies: http://lwn.net/Articles/397442/ What a mess! Hey. The 5 year anniversary was in July. Wanna bring it up again, Austin? Haha. http://thread.gmane.org/gmane.linux.kernel.cifs/294 Users want file creation time. Specifically, an immutable time for that file that persists across file system copies. The time of its first occurrence on a particular volume is not useful information. Getting that requires what seems to be an unlikely consensus. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs: poor performance on deleting many large files
On 2015-12-12 17:15, Christoph Anton Mitterer wrote: On Sat, 2015-11-28 at 06:49 +, Duncan wrote: Christoph Anton Mitterer posted on Sat, 28 Nov 2015 04:57:05 +0100 as excerpted: Still, specifically for snapshots that's a bit unhandy, as one typically doesn't mount each of them... one rather mount e.g. the top level subvol and has a subdir snapshots there... So perhaps the idea of having snapshots that are per se noatime is still not too bad. Read-only snapshots? So you basically mean that ro snapshots won't have their atime updated even without noatime? Well I guess that was anyway the recent behaviour of Linux filesystems, and only very old UNIX systems updated the atime even when the fs was set ro. Unless things have changed very recently, even many modern systems update atime on read-only filesystems, unless the media itself is read-only. This is part of the reason for some of the forensics tools out there that drop write commands to the block devices connected to them. That'd do it, and of course you can toggle the read- only property (see btrfs property and its btrfs-property manpage). Sure, but then it would still be nice for rw snapshots. I guess what I probably actually want is the ability to set noatime as a property. I'll add that in a "feature request" on the project ideas wiki. Alternatively, mount the toplevel subvol read-only or noatime on one mountpoint, and bind-mount it read-write or whatever other appropriate Well it's of course somehow possible... but that seems a bit ugly to me... the best IMHO, would really be if one could set a property on snapshots that marks them noatime. If you have software that actually depends on atimes, then that software is broken (and yes, I even feel this way about Mutt). The way atimes are implemented on most systems breaks the semantics that almost everyone expects from them, because they get updated for anything that even looks sideways at the inode from across the room. Most software that uses them expects them to answer the question 'When were the contents of this file last read?', but they can get updated even for stuff like calculating file sizes, listing directory contents, or modifying the file's metadata. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs: poor performance on deleting many large files
Le 14/12/2015 21:27, Austin S. Hemmelgarn a écrit : > AFAIUI, the _only_ reason that that is still the default is because of > Mutt, and that won't change as long as some of the kernel developers > are using Mutt for e-mail and the Mutt developers don't realize that > what they are doing is absolutely stupid. > Mutt is often used as an example but tmpwatch uses atime by default too and it's quite useful. If you have a local cache of remote files for which you want a good hit ratio and don't care too much about its exact size (you should have Nagios/Zabbix/... alerting you when a filesystem reaches a %free limit if you value your system's availability anyway), using tmpwatch with cron to maintain it is only one single line away and does the job. For an example of this particular case, on Gentoo the /usr/portage/distfiles directory is used in one of the tasks you can uncomment to activate in the cron.daily file provided when installing tmpwatch. Using tmpwatch/cron is far more convenient than using a dedicated cache (which might get tricky if the remote isn't HTTP-based, like an rsync/ftp/nfs/... server or doesn't support HTTP IMS requests for example). Some http frameworks put sessions in /tmp: in this case if you want sessions to expire based on usage and not creation time, using tmpwatch or similar with atime is the only way to clean these files. This can even become a performance requirement: I've seen some servers slowing down with tens/hundreds of thousands of session files in /tmp because it was only cleaned at boot and the systems were almost never rebooted... I use noatime and nodiratime on some BTRFS filesystems for performance reasons: Ceph OSDs, heavily snapshotted first-level backup servers and filesystems dedicated to database server files (in addition to nodatacow) come to mind, but the cases where these options are really useful even with BTRFS doesn't seem to be the common ones. Finally Linus Torvalds has been quite vocal and consistent on the general subject of the kernel not breaking user-space APIs no matter what so I wouldn't have much hope for default kernel mount options changes... Lionel -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs: poor performance on deleting many large files
On 2015-12-14 14:39, Christoph Anton Mitterer wrote: On Mon, 2015-12-14 at 09:24 -0500, Austin S. Hemmelgarn wrote: Unless things have changed very recently, even many modern systems update atime on read-only filesystems, unless the media itself is read-only. Seriously? Oh... *sigh*... You mean as in Linux, ext*, xfs? Possibly, I know that Windows 7 does it, and I think OS X and OpenBSD do it, but I'm not sure about Linux. If you have software that actually depends on atimes, then that software is broken (and yes, I even feel this way about Mutt). I don't disagree here :D The way atimes are implemented on most systems breaks the semantics that almost everyone expects from them, because they get updated for anything that even looks sideways at the inode from across the room. Most software that uses them expects them to answer the question 'When were the contents of this file last read?', but they can get updated even for stuff like calculating file sizes, listing directory contents, or modifying the file's metadata. Sure... my point here again was, that I try to look every now and then at the whole thing from the pure-end-user side: For them, the default is relatime, and they likely may not want to change that because they have no clue on how much further effects this may have (or not). So as long as Linux doesn't change it's defaults to noatime, leaving things up to broken software (i.e. to get fixed), I think it would be nice for the end-user, to have e.g. snapshots be "save" (from the write-amplification on read) out of the box. AFAIUI, the _only_ reason that that is still the default is because of Mutt, and that won't change as long as some of the kernel developers are using Mutt for e-mail and the Mutt developers don't realize that what they are doing is absolutely stupid. FWIW, both Duncan and I have our own copy of the sources patched to default to noatime, and I know a number of embedded Linux developers who do likewise, and I've even heard talk in the past of some distributions possibly using such patches themselves (although it always ends up not happening, because of Mutt). My idea would be basically, that having a noatime btrfs-property, which is perhaps even set automatically, would be an elegant way of doing that. I just haven't had time to properly write that up and add is as a "feature request" to the projects idea wiki page. I like this idea. Cheers, Chris. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs: poor performance on deleting many large files
On Mon, 2015-12-14 at 09:24 -0500, Austin S. Hemmelgarn wrote: > Unless things have changed very recently, even many modern systems > update atime on read-only filesystems, unless the media itself is > read-only. Seriously? Oh... *sigh*... You mean as in Linux, ext*, xfs? > If you have software that actually depends on atimes, then that > software > is broken (and yes, I even feel this way about Mutt). I don't disagree here :D > The way atimes > are implemented on most systems breaks the semantics that almost > everyone expects from them, because they get updated for anything > that > even looks sideways at the inode from across the room. Most software > that uses them expects them to answer the question 'When were the > contents of this file last read?', but they can get updated even for > stuff like calculating file sizes, listing directory contents, or > modifying the file's metadata. Sure... my point here again was, that I try to look every now and then at the whole thing from the pure-end-user side: For them, the default is relatime, and they likely may not want to change that because they have no clue on how much further effects this may have (or not). So as long as Linux doesn't change it's defaults to noatime, leaving things up to broken software (i.e. to get fixed), I think it would be nice for the end-user, to have e.g. snapshots be "save" (from the write-amplification on read) out of the box. My idea would be basically, that having a noatime btrfs-property, which is perhaps even set automatically, would be an elegant way of doing that. I just haven't had time to properly write that up and add is as a "feature request" to the projects idea wiki page. Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: btrfs: poor performance on deleting many large files
On Mon, 2015-12-14 at 15:27 -0500, Austin S. Hemmelgarn wrote: > On 2015-12-14 14:39, Christoph Anton Mitterer wrote: > > On Mon, 2015-12-14 at 09:24 -0500, Austin S. Hemmelgarn wrote: > > > Unless things have changed very recently, even many modern > > > systems > > > update atime on read-only filesystems, unless the media itself is > > > read-only. > > Seriously? Oh... *sigh*... > > You mean as in Linux, ext*, xfs? > Possibly, I know that Windows 7 does it, and I think OS X and OpenBSD > do > it, but I'm not sure about Linux. I've just checked it via loopback image and strictatime: - ro snapshot doesn't get atime updated - rw snapshot does atime get update - ro mounted fs (top level subvol) doesn't get atimes updated (neither in subvols) - rw mounted fs (top level subvol) does get atimes updated Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
project idea: per-object default mount-options / more btrfs-properties / chattr attributes (was: btrfs: poor performance on deleting many large files)
Just FYI: On Mon, 2015-12-14 at 15:27 -0500, Austin S. Hemmelgarn wrote: > > My idea would be basically, that having a noatime btrfs-property, > > which > > is perhaps even set automatically, would be an elegant way of doing > > that. > > I just haven't had time to properly write that up and add is as a > > "feature request" to the projects idea wiki page. > I like this idea. I've just compiled some thoughts and ideas into: https://btrfs.wiki.kernel.org/index.php/Project_ideas#Per-object_default_mount-options_.2F_btrfs-properties_.2F_chattr.281.29_attributes_and_reasonable_userland_defaults As usual, this is mostly from my admin/end-user side, i.e. what I could imagine would ease in the maintenance of large/complex (in terms of subvols, nesting, snapshots) btrfs filesystems... And of course, any developer or more expert user than me is happily invited to comment/remove any (possibly stupid) ideas of mine therein, or summon the inquisition for my heresy ;) Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: btrfs: poor performance on deleting many large files
On Mon, 2015-12-14 at 22:30 +0100, Lionel Bouton wrote: > Mutt is often used as an example but tmpwatch uses atime by default > too > and it's quite useful. Hmm one could probably argue that these few cases justify the use of separate filesystems (or btrfs subvols ;) ), so that the majority could benefit of noatime. > If you have a local cache of remote files for which you want a good > hit > ratio and don't care too much about its exact size (you should have > Nagios/Zabbix/... alerting you when a filesystem reaches a %free > limit > if you value your system's availability anyway), using tmpwatch with > cron to maintain it is only one single line away and does the job. > For > an example of this particular case, on Gentoo the > /usr/portage/distfiles > directory is used in one of the tasks you can uncomment to activate > in > the cron.daily file provided when installing tmpwatch. > Using tmpwatch/cron is far more convenient than using a dedicated > cache > (which might get tricky if the remote isn't HTTP-based, like an > rsync/ftp/nfs/... server or doesn't support HTTP IMS requests for > example). > Some http frameworks put sessions in /tmp: in this case if you want > sessions to expire based on usage and not creation time, using > tmpwatch > or similar with atime is the only way to clean these files. This can > even become a performance requirement: I've seen some servers slowing > down with tens/hundreds of thousands of session files in /tmp because > it > was only cleaned at boot and the systems were almost never > rebooted... Okay there are probably some usecases, ... the session cleaning I'd however rather consider a bug in the respective software, especially if it really depends on it to expire the session (what if for some reason tmpwatch get's broken, uninstalled, etc.) > I use noatime and nodiratime FYI: noatime implies nodiratime :-) > Finally Linus Torvalds has been quite vocal and consistent on the > general subject of the kernel not breaking user-space APIs no matter > what so I wouldn't have much hope for default kernel mount options > changes... He surely is right in general,... but when the point has been reached, where only a minority actually requires the feature... and the minority actually starts to suffer from that... it may change. Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
On Wed, 2015-12-09 at 13:36 +, Duncan wrote: > Answering the BTW first, not to my knowledge, and I'd be > skeptical. In > general, btrfs is cowed, and that's the focus. To the extent that > nocow > is necessary for fragmentation/performance reasons, etc, the idea is > to > try to make cow work better in those cases, for example by working on > autodefrag to make it better at handling large files without the > scaling > issues it currently has above half a gig or so, and thus to confine > nocow > to a smaller and smaller niche use-case, rather than focusing on > making > nocow better. > Of course it remains to be seen how much better they can do with > autodefrag, etc, but at this point, there's way more project > possibilities than people to develop them, so even if they do find > they > can't make cow work much better for these cases, actually working on > nocow > would still be rather far down the list, because there's so many > other > improvement and feature opportunities that will get the focus > first. > Which in practice probably puts it in "it'd be nice, but it's low > enough > priority that we're talking five years out or more, unless of course > someone else qualified steps up and that's their personal itch they > want > to scratch", territory. I guess I'll split out my answer on that, in a fresh thread about checksums for nodatacow later, hoping to attract some more devs there :-) I think however, again with my naive understanding on how CoW works and what it inherently implies, that there cannot be a real good solution to the fragmentation problem for DB/etc. files. And as such, I'd think that having a checksumming feature for notdatacow as well, even if it's not perfect, is definitely worth it. > As for the updated checksum after modification, the problem with that > is > that in the mean time, the checksum wouldn't verify, Well one could either implement some locking,.. but I don't see the general problem here... if the block is still being written (and I count updating the meta-data, including checksum, to that) it cannot be read anyway, can it? It may be only half written and the data returned would be garbage. > and while btrfs > could of course keep status in memory during normal operations, > that's > not the problem, the problem is what happens if there's a crash and > in- > memory state vaporizes. In that case, when btrfs remounted, it'd > have no > way of knowing why the checksum didn't match, just that it didn't, > and > would then refuse access to that block in the file, because for all > it > knows, it /is/ a block error. And this would only happen in the rare cases that anything crashes, where it's anyway quite likely that this no-CoWed block will be garbage. I'll talk about that more in the separate thread... so let's move things there. > Same here. In fact, my most anticipated feature is N-way-mirroring, Hmm ... not totally sure about that... AFAIU, N-way-mirroring is what currently the currently wrongly called RAID1 is in btrfs, i.e. having N replicas of everything on M devices, right? In other words, not being a N-parity-RAID and not guaranteeing that *any* N disks could fail, right? Hmm I guess that would be definitely nice to have, especially since then we could have true RAID1, i.e. N=M. But it's probably rather important for those scenarios, where either resilience matters a lot... and/or those where write speed doesn't but read speed does, right? Taking the example of our use case at the university, i.e. the LHC Tier-2 we run,... that would rather be uninteresting. We typically have storage nodes (and many of them) of say 16-24 devices, and based on funding constraints, resilience concerns and IO performance, we place them in RAID6 (yeah i know, RAID5 is faster, but even with hotspares in place, practise lead too often to lost RAIDs). Especially for the bigger nodes, with more disks, we'd rather have a N- parity RAID, where any N disks can fail)... of course performance considerations may kill that desire again ;) > It is a big and basic feature, but turning it off isn't the end of > the > world, because then it's still the same level of reliability other > solutions such as raid generally provide. Sure... I never meant it as "loss to what we already have in other systems"... but as "loss compared to how awesome[0] btrfs could be ;-)" > But as it happens, both VM image management and databases tend to > come > with their own integrity management, in part precisely because the > filesystem could never provide that sort of service. Well that's only partially true, to my knowledge. a) I wouldn't know that hypervisors do that at all. b) DBs have of course their journal, but that protects only against crashes,... not against bad blocks nor does it help you to decide which block is good when you have multiple. > After all, you can always decide not to run it if you're worried > about the space effects it's going to have
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
Two more on these: On Thu, 2015-11-26 at 00:33 +, Hugo Mills wrote: > 3) When I would actually disable datacow for e.g. a subvolume that > > holds VMs or DBs... what are all the implications? > > Obviously no checksumming, but what happens if I snapshot such a > > subvolume or if I send/receive it? > After snapshotting, modifications are CoWed precisely once, and > then it reverts to nodatacow again. This means that making a snapshot > of a nodatacow object will cause it to fragment as writes are made to > it. AFAIU, the one the get's fragmented then is the snapshot, right, and the "original" will stay in place where it was? (Which is of course good, because one probably marked it nodatacow, to avoid that fragmentation problem on internal writes). I'd assume the same happens when I do a reflink cp. Can one make a copy, where one still has atomicity (which I guess implies CoW) but where the destination file isn't heavily fragmented afterwards,... i.e. there's some pre-allocation, and then cp really does copy each block (just everything's at the state of time where I stared cp, not including any other internal changes made on the source in between). And one more: You both said, auto-defrag is generally recommended. Does that also apply for SSDs (where we want to avoid unnecessary writes)? It does seem to get enabled, when SSD mode is detected. What would it actually do on an SSD? Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: btrfs: poor performance on deleting many large files
Christoph Anton Mitterer posted on Sat, 12 Dec 2015 23:15:38 +0100 as excerpted: > On Sat, 2015-11-28 at 06:49 +, Duncan wrote: >> Christoph Anton Mitterer posted on Sat, 28 Nov 2015 04:57:05 +0100 as >> excerpted: >> > Still, specifically for snapshots that's a bit unhandy, as one >> > typically doesn't mount each of them... one rather mount e.g. the top >> > level subvol and has a subdir snapshots there... >> > So perhaps the idea of having snapshots that are per se noatime is >> > still not too bad. >> Read-only snapshots? > So you basically mean that ro snapshots won't have their atime updated > even without noatime? > Well I guess that was anyway the recent behaviour of Linux filesystems, > and only very old UNIX systems updated the atime even when the fs was > set ro. I'd test it to be sure before relying on it (keeping in mind that my own use-case doesn't include subvolumes/snapshots so it's quite possible I could get fine details of this nature wrong), but that would be my very (_very_! see next) strong assumption, yes. Because read-only snapshots are used for btrfs-send among other things, with the idea being that the read-only will keep them from changing in the middle of the send, and ro snapshot atime updates would seem to throw that entirely out the window. So I can't imagine ro snapshots doing atime updates under any circumstance because I just can't see how send could rely on them then, but I'd still test it before counting on it. >> That'd do it, and of course you can toggle the read- >> only property (see btrfs property and its btrfs-property manpage). > Sure, but then it would still be nice for rw snapshots. > > I guess what I probably actually want is the ability to set noatime as a > property. > I'll add that in a "feature request" on the project ideas wiki. AFAIK, the general idea was to eventually have all the (possible, some are global-filesystem-scope) subvolume mount options exposed as properties, it's just not implemented yet, but I'm not entirely sure if that was all /btrfs-specific/ mount options, or included the generic ones such as the *atime and no* (noexec/nodev/...) options as well. In view of that and the fact that noatime is generic, adding it as a specific request still makes sense. Someone with more specific knowledge on the current plan can remove it if it's already covered. >> Alternatively, mount the toplevel subvol read-only or noatime on one >> mountpoint, and bind-mount it read-write or whatever other appropriate > Well it's of course somehow possible... but that seems a bit ugly to > me... the best IMHO, would really be if one could set a property on > snapshots that marks them noatime. Yes. Possible is good, but "just works", as one would hope the properties solution to eventually be, is still better than "possible by jumping thru mount-bind hoops", the current "possibility method". =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs: poor performance on deleting many large files
On Sat, 2015-11-28 at 06:49 +, Duncan wrote: > Christoph Anton Mitterer posted on Sat, 28 Nov 2015 04:57:05 +0100 as > excerpted: > > Still, specifically for snapshots that's a bit unhandy, as one > > typically > > doesn't mount each of them... one rather mount e.g. the top level > > subvol > > and has a subdir snapshots there... > > So perhaps the idea of having snapshots that are per se noatime is > > still > > not too bad. > Read-only snapshots? So you basically mean that ro snapshots won't have their atime updated even without noatime? Well I guess that was anyway the recent behaviour of Linux filesystems, and only very old UNIX systems updated the atime even when the fs was set ro. > That'd do it, and of course you can toggle the read- > only property (see btrfs property and its btrfs-property manpage). Sure, but then it would still be nice for rw snapshots. I guess what I probably actually want is the ability to set noatime as a property. I'll add that in a "feature request" on the project ideas wiki. > Alternatively, mount the toplevel subvol read-only or noatime on one > mountpoint, and bind-mount it read-write or whatever other > appropriate Well it's of course somehow possible... but that seems a bit ugly to me... the best IMHO, would really be if one could set a property on snapshots that marks them noatime. Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
Christoph Anton Mitterer posted on Wed, 09 Dec 2015 06:43:01 +0100 as excerpted: > Hey Hugo, > > > On Thu, 2015-11-26 at 00:33 +, Hugo Mills wrote: > >> The issue is that nodatacow bypasses the transactional nature of >> the FS, making changes to live data immediately. This then means that >> if you modify a modatacow file, the csum for that modified section is >> out of date, and won't be back in sync again until the latest >> transaction is committed. So you can end up with an inconsistent >> filesystem if there's a crash between the two events. > Sure,... (and btw: is there some kind of journal planned for > nodatacow'ed files?),... but why not simply trying to write an updated > checksum after the modified section has been flushed to disk... of > course there's no guarantee that both are consistent in case of crash ( > but that's also the case without any checksum)... but at least one would > have the csum protection against everything else (blockerrors and that > like) in case no crash occurs? Answering the BTW first, not to my knowledge, and I'd be skeptical. In general, btrfs is cowed, and that's the focus. To the extent that nocow is necessary for fragmentation/performance reasons, etc, the idea is to try to make cow work better in those cases, for example by working on autodefrag to make it better at handling large files without the scaling issues it currently has above half a gig or so, and thus to confine nocow to a smaller and smaller niche use-case, rather than focusing on making nocow better. Of course it remains to be seen how much better they can do with autodefrag, etc, but at this point, there's way more project possibilities than people to develop them, so even if they do find they can't make cow work much better for these cases, actually working on nocow would still be rather far down the list, because there's so many other improvement and feature opportunities that will get the focus first. Which in practice probably puts it in "it'd be nice, but it's low enough priority that we're talking five years out or more, unless of course someone else qualified steps up and that's their personal itch they want to scratch", territory. As for the updated checksum after modification, the problem with that is that in the mean time, the checksum wouldn't verify, and while btrfs could of course keep status in memory during normal operations, that's not the problem, the problem is what happens if there's a crash and in- memory state vaporizes. In that case, when btrfs remounted, it'd have no way of knowing why the checksum didn't match, just that it didn't, and would then refuse access to that block in the file, because for all it knows, it /is/ a block error. And there's already a mechanism for telling btrfs to ignore checksums, and nocow already activates it, so... there's really nothing more to be done. >> > For me the checksumming is actually the most important part of btrfs >> > (not that I wouldn't like its other features as well)... so turning >> > it off is something I really would want to avoid. Same here. In fact, my most anticipated feature is N-way-mirroring, since that will allow three copies (or more, but three is my sweet spot balance between the space and reliability factors) instead of the current limit of two. It just disturbs me than in the event of one copy being bad, the other copy /better/ be good, because there's no further fallback! With a third copy, there'd be that one further fallback, and the chances of all three copies failing checksum verification are remote enough I'm willing to risk it, given the incremental cost of additional copies. >> > Plus it opens questions like: When there are no checksums, how can it >> > (in the RAID cases) decide which block is the good one in case of >> > corruptions? >> It doesn't decide -- both copies look equally good, because >> there's no checksum, so if you read the data, the FS will return >> whatever data was on the copy it happened to pick. > Hmm I see... so one gets basically the behaviour of RAID. > Isn't that kind of a big loss? I always considered the guarantee against > block errors and that like one of the big and basic features of btrfs. It is a big and basic feature, but turning it off isn't the end of the world, because then it's still the same level of reliability other solutions such as raid generally provide. And the choice to turn it off is just that, a choice, tho it's currently the recommended one in some cases, such as with large VM images, etc. But as it happens, both VM image management and databases tend to come with their own integrity management, in part precisely because the filesystem could never provide that sort of service. So to the extent that btrfs must turn off its integrity management features when dealing with that sort of file, it's no bigger deal than it would be on any other filesystem, it's simply returning what's normally a
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
Christoph Anton Mitterer posted on Wed, 09 Dec 2015 06:45:47 +0100 as excerpted: > On 2015-11-27 00:08, Duncan wrote: >> Christoph Anton Mitterer posted on Thu, 26 Nov 2015 01:23:59 +0100 as >> excerpted: >>> 1) AFAIU, the fragmentation problem exists especially for those files >>> that see many random writes, especially, but not limited to, big >>> files. Now that databases and VMs are affected by this, is probably >>> broadly known in the meantime (well at least by people on that list). >>> But I'd guess there are n other cases where such IO patterns can >>> happen which one simply never notices, while the btrfs continues to >>> degrade. >> >> The two other known cases are: >> >> 1) Bittorrent download files, where the full file size is preallocated >> (and I think fsynced), then the torrent client downloads into it a >> chunk at a time. > Okay, sounds obvious. > >> The more general case would be any time a file of some size is >> preallocated and then written into more or less randomly, the problem >> being the preallocation, which on traditional rewrite-in-place >> filesystems helps avoid fragmentation (as well as ensuring space to >> save the full file), but on COW-based filesystems like btrfs, triggers >> exactly the fragmentation it was trying to avoid. > Is it really just the case when the file storage *is* actually fully > pre-allocated? > Cause that wouldn't (necessarily) be the case for e.g. VM images (e.g. > qcow2, or raw images when these are sparse files). > Or is it rather any case where, in larger file, many random (file > internal) writes occur? It's the second case, or rather, the reverse of the first case, since preallocation and fsync, then write into it, is one specific subset case of the broader case of random rewrites into existing files. VM images and database files are two other specific subset cases of the same broader case superset. >> arranging to have the client write into a dir with the nocow attribute >> set, so newly created torrent files inherit it and do rewrite-in-place, >> is highly recommended. > At the IMHO pretty high expense of loosing the checksumming :-( > Basically loosing half of the main functionalities that make btrfs > interesting for me. But... as I've pointed out in other replies, in many cases including this specific one (bittorrent), applications have already had to develop their own integrity management features, because other filesystems didn't supply them and the apps simply didn't work reliably without those features. In the bittorrent case specifically, torrent chunks are already checksummed, and if they don't verify upon download, the chunk is thrown away and redownloaded. And after the download is complete and the file isn't being constantly rewritten, it's perfectly fine to copy it elsewhere, into a dir where nocow doesn't apply. With the copy, btrfs will create checksums, and if you're paranoid you can hashcheck the original nocow copy against the new checksummed/cow copy, and after that, any on-media changes will be caught by the normal checksum verification mechanisms. Further, at least some bittorrent clients make preallocation an option. Here, on btrfs I'd simply turn off that option, rather than bothering with nocow in the first place. That should already reduce fragmentation significantly due to the 30-second by default commit frequency, tho there will likely still be some fragmentation due to the out-of-order downloading. But either autodefrag or the previously mentioned post- download recopy should deal with that. > For databases, will e.g. the vacuuming maintenance tasks solve the > fragmentation issues (cause I guess at least when doing full vacuuming, > it will rewrite the files). If it does full rewrite, it should, provided the freespace itself isn't so fragmented that it's impossible to find sufficiently large extents to avoid fragmentation. Of course there's also autodefrag, if the database isn't so busy and/or the database files are small enough that the defragging rewrites don't trigger bottlenecking, the primary downside risk with autodefrag. >> The problem is much reduced in newer systemd, which is btrfs aware and >> in fact uses btrfs-specific features such as subvolumes in a number of >> cases (creating subvolumes rather than directories where it makes sense >> in some shipped tmpfiles.d config files, for instance), if it's running >> on btrfs. > Hmm doesn't seem really good to me if systemd would do that, cause it > then excludes any such files from being snapshot. Of course if the directories are already present due to systemd upgrading from non-btrfs-aware versions, they'll remain as normal dirs, not subvolumes. This is the case here. And of course you can switch them around to dirs if you like, and/or override the shipped tmpfiles.d config with your own. Meanwhile, distros that both ship systemd and offer btrfs as a filesystem option (or use it by default), should
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
Hey Hugo, On Thu, 2015-11-26 at 00:33 +, Hugo Mills wrote: > Answering the second part first, no, it can't. Thanks so far :) > The issue is that nodatacow bypasses the transactional nature of > the FS, making changes to live data immediately. This then means that > if you modify a modatacow file, the csum for that modified section is > out of date, and won't be back in sync again until the latest > transaction is committed. So you can end up with an inconsistent > filesystem if there's a crash between the two events. Sure,... (and btw: is there some kind of journal planned for nodatacow'ed files?),... but why not simply trying to write an updated checksum after the modified section has been flushed to disk... of course there's no guarantee that both are consistent in case of crash ( but that's also the case without any checksum)... but at least one would have the csum protection against everything else (blockerrors and that like) in case no crash occurs? > > For me the checksumming is actually the most important part of > > btrfs > > (not that I wouldn't like its other features as well)... so turning > > it > > off is something I really would want to avoid. > > > > Plus it opens questions like: When there are no checksums, how can > > it > > (in the RAID cases) decide which block is the good one in case of > > corruptions? > It doesn't decide -- both copies look equally good, because > there's > no checksum, so if you read the data, the FS will return whatever > data > was on the copy it happened to pick. Hmm I see... so one gets basically the behaviour of RAID. Isn't that kind of a big loss? I always considered the guarantee against block errors and that like one of the big and basic features of btrfs. It seems that for certain (not too unimportant cases: DBs, VMs) one has to decide between either evil, loosing the guaranteed consistency via checksums... or basically running into severe troubles (like Mitch's reported fragmentation issues). > > 3) When I would actually disable datacow for e.g. a subvolume that > > holds VMs or DBs... what are all the implications? > > Obviously no checksumming, but what happens if I snapshot such a > > subvolume or if I send/receive it? > > After snapshotting, modifications are CoWed precisely once, and > then it reverts to nodatacow again. This means that making a snapshot > of a nodatacow object will cause it to fragment as writes are made to > it. I see... something that should possibly go to some advanced admin documentation (if not already in). It means basically, that one must assure that any such files (VM images, DB data dirs) are already created with nodatacow (perhaps on a subvolume which is mounted as such. > > 4) Duncan mentioned that defrag (and I guess that's also for auto- > > defrag) isn't ref-link aware... > > Isn't that somehow a complete showstopper? > It is, but the one attempt at dealing with it caused massive data > corruption, and it was turned off again. So... does this mean that it's still planned to be implemented some day or has it been given up forever? And is it (hopefully) also planned to be implemented for reflinks when compression is added/changed/removed? Given that you (or Duncan?,... sorry I sometimes mix up which of said exactly what, since both of you are notoriously helpful :-) ) mentioned that autodefrag basically fails with larger files,... and given that it seems to be quite important for btrfs to not be fragmented too heavily, it sounds a bit as if anything that uses (multiple) reflinks (e.g. snapshots) cannot be really used very well. > autodefrag, however, has > always been snapshot aware and snapshot safe, and would be the > recommended approach here. Ahhh... so autodefag *is* snapshot aware, and that's basically why the suggestion is (AFAIU) that it's turned on, right? So, I'm afraid O:-), that triggers a follow-up question: Why isn't it the default? Or in other words what are its drawbacks (e.g. other cases where ref-links would be broken up,... or issues with compression)? And also, when I now activate it on an already populated fs, will it defrag also any old files (even if they're not rewritten or so)? I tried to have a look for some general (rather "for dummies" than for core developers) description of how defrag and autodefrag work... but couldn't find anything in the usual places... :-( btw: The wiki (https://btrfs.wiki.kernel.org/index.php/UseCases#How_do_ I_defragment_many_files.3F) doesn't mention that auto-defrag doesn't suffer from that problem. > (Actually, it was broken in the same > incident I just described -- but fixed again when the broken patches > were reverted). So it just couldn't be fixed (hopfully: yet) for the (manual) online defragmentation?! > > 5) Especially keeping (4) in mind but also the other comments in > > from > > Duncan and Austin... > > Is auto-defrag now recommended to be generally used? > > Absolutely, yes. I see... well, I'll probably wait
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
On 2015-11-27 00:08, Duncan wrote: > Christoph Anton Mitterer posted on Thu, 26 Nov 2015 01:23:59 +0100 as > excerpted: >> 1) AFAIU, the fragmentation problem exists especially for those files >> that see many random writes, especially, but not limited to, big files. >> Now that databases and VMs are affected by this, is probably broadly >> known in the meantime (well at least by people on that list). >> But I'd guess there are n other cases where such IO patterns can happen >> which one simply never notices, while the btrfs continues to degrade. > > The two other known cases are: > > 1) Bittorrent download files, where the full file size is preallocated > (and I think fsynced), then the torrent client downloads into it a chunk > at a time. Okay, sounds obvious. > The more general case would be any time a file of some size is > preallocated and then written into more or less randomly, the problem > being the preallocation, which on traditional rewrite-in-place > filesystems helps avoid fragmentation (as well as ensuring space to save > the full file), but on COW-based filesystems like btrfs, triggers exactly > the fragmentation it was trying to avoid. Is it really just the case when the file storage *is* actually fully pre-allocated? Cause that wouldn't (necessarily) be the case for e.g. VM images (e.g. qcow2, or raw images when these are sparse files). Or is it rather any case where, in larger file, many random (file internal) writes occur? > arranging to > have the client write into a dir with the nocow attribute set, so newly > created torrent files inherit it and do rewrite-in-place, is highly > recommended. At the IMHO pretty high expense of loosing the checksumming :-( Basically loosing half of the main functionalities that make btrfs interesting for me. > It's also worth noting that once the download is complete, the files > aren't going to be rewritten any further, and thus can be moved out of > the nocow-set download dir and treated normally. Sure... but this requires manual intervention. For databases, will e.g. the vacuuming maintenance tasks solve the fragmentation issues (cause I guess at least when doing full vacuuming, it will rewrite the files). > The problem is much reduced in newer systemd, which is btrfs aware and in > fact uses btrfs-specific features such as subvolumes in a number of cases > (creating subvolumes rather than directories where it makes sense in some > shipped tmpfiles.d config files, for instance), if it's running on > btrfs. Hmm doesn't seem really good to me if systemd would do that, cause it then excludes any such files from being snapshot. > For the journal, I /think/ (see the next paragraph) that it now > sets the journal files nocow, and puts them in a dedicated subvolume so > snapshots of the parent won't snapshot the journals, thereby helping to > avoid the snapshot-triggered cow1 issue. The same here, kinda disturbing if systemd would decide that on it's own, i.e. excluding files from being checksum protected... >> So is there any general approach towards this? > The general case is that for normal desktop users, it doesn't tend to be > a problem, as they don't do either large VMs or large databases, Well depends a bit on how one defines the "normal desktop user",... for e.g. developers or more "power users" it's probably not so unlikely that they do run local VMs for testing or whatever. > and > small ones such as the sqlite files generated by firefox and various > email clients are handled quite well by autodefrag, with that general > desktop usage being its primary target. Which is however not yet the default... > For server usage and the more technically inclined workstation users who > are running VMs and larger databases, the general feeling seems to be > that those adminning such systems are, or should be, technically inclined > enough to do their research and know when measures such as nocow and > limited snapshotting along with manual defrags where necessary, are > called for. mhh... well it's perhaps simple to expect that knowledge for few things like VMs, DBs and that like... but there are countless of software systems, many of them being more or less like a black box, at least with respect to their internals. It feels a bit, if there should be some tools provided by btrfs, which tell the users which files are likely problematic and should be nodatacow'ed > And if they don't originally, they find out when they start > researching why performance isn't what they expected and what to do about > it. =:^) Which can take quite a while to be found out... >> And what are the actual possible consequences? Is it just that fs gets >> slower (due to the fragmentation) or may I even run into other issues to >> the point the space is eaten up or the fs becomes basically unusable? > It's primarily a performance issue, tho in severe cases it can also be a > scaling issue, to the point that maintenance tasks such
Re: btrfs: poor performance on deleting many large files
On Fri, 2015-11-27 at 03:38 +, Duncan wrote: > AFAIK, per-subvolume *atime mounts should already be working. Ah I see. :) Still, specifically for snapshots that's a bit unhandy, as one typically doesn't mount each of them... one rather mount e.g. the top level subvol and has a subdir snapshots there... So perhaps the idea of having snapshots that are per se noatime is still not too bad. Cheers, Chris smime.p7s Description: S/MIME cryptographic signature
Re: btrfs: poor performance on deleting many large files
Christoph Anton Mitterer posted on Sat, 28 Nov 2015 04:57:05 +0100 as excerpted: > On Fri, 2015-11-27 at 03:38 +, Duncan wrote: >> AFAIK, per-subvolume *atime mounts should already be working. > Ah I see. :) > > Still, specifically for snapshots that's a bit unhandy, as one typically > doesn't mount each of them... one rather mount e.g. the top level subvol > and has a subdir snapshots there... > So perhaps the idea of having snapshots that are per se noatime is still > not too bad. Read-only snapshots? That'd do it, and of course you can toggle the read- only property (see btrfs property and its btrfs-property manpage). Alternatively, mount the toplevel subvol read-only or noatime on one mountpoint, and bind-mount it read-write or whatever other appropriate *atime elsewhere (or the reverse, if more appropriate). Then use the noatime or read-only one unless you specifically wanted atimes updated. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs: poor performance on deleting many large files
Christoph Anton Mitterer posted on Fri, 27 Nov 2015 01:06:45 +0100 as excerpted: > And additionally, allow people to mount subvols with different > noatime/relatime/atime settings (unless that's already working)... that > way, they could enable it for things where they want/need it,... and > disable it where not. AFAIK, per-subvolume *atime mounts should already be working. The *atime mount options are filesystem-generic (aka Linux vfs level), and while I my own use-case doesn't involve subvolumes, the wiki says they should be working (wrapped link I'm not bothering to jump thru the hoops to properly unwrap): https://btrfs.wiki.kernel.org/index.php/FAQ #Can_I_mount_subvolumes_with_different_mount_options.3F So while personally untested, per-subvolume *atime mount options /should/ "just work". Meanwhile, I've simply grown to hate atime as an inefficient and mostly useless drain on resources, so I pretty much just noatime everything, the reason I decided to bother patching my kernel to make that the default, instead of having yet another option I use everywhere anyway, clogging up the options field in my fstab. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
Christoph Anton Mitterer posted on Thu, 26 Nov 2015 01:23:59 +0100 as excerpted: > Hey. > > I've worried before about the topics Mitch has raised. > Some questions. > > 1) AFAIU, the fragmentation problem exists especially for those files > that see many random writes, especially, but not limited to, big files. > Now that databases and VMs are affected by this, is probably broadly > known in the meantime (well at least by people on that list). > But I'd guess there are n other cases where such IO patterns can happen > which one simply never notices, while the btrfs continues to degrade. The two other known cases are: 1) Bittorrent download files, where the full file size is preallocated (and I think fsynced), then the torrent client downloads into it a chunk at a time. The more general case would be any time a file of some size is preallocated and then written into more or less randomly, the problem being the preallocation, which on traditional rewrite-in-place filesystems helps avoid fragmentation (as well as ensuring space to save the full file), but on COW-based filesystems like btrfs, triggers exactly the fragmentation it was trying to avoid. At least some torrent clients (ktorrent at least) have an option to turn off that preallocation, however, and that would be recommended where possible. Where disabling the preallocation isn't possible, arranging to have the client write into a dir with the nocow attribute set, so newly created torrent files inherit it and do rewrite-in-place, is highly recommended. It's also worth noting that once the download is complete, the files aren't going to be rewritten any further, and thus can be moved out of the nocow-set download dir and treated normally. For those who will continue to seed the files for some time, this could be done, provided the client can seed from a directory different than the download dir. 2) As a subcase of the database file case that people may not think about, systemd journal files are known to have had the internal-rewrite- pattern problem in the past. Apparently, while they're mostly append- only in general, they do have an index at the beginning of the file that gets rewritten quite a bit. The problem is much reduced in newer systemd, which is btrfs aware and in fact uses btrfs-specific features such as subvolumes in a number of cases (creating subvolumes rather than directories where it makes sense in some shipped tmpfiles.d config files, for instance), if it's running on btrfs. For the journal, I /think/ (see the next paragraph) that it now sets the journal files nocow, and puts them in a dedicated subvolume so snapshots of the parent won't snapshot the journals, thereby helping to avoid the snapshot-triggered cow1 issue. On my own systems, however, I've configured journald to only use the volatile tmpfs journals in /run, not the permanent /var location, tweaking the size of the tmpfs mounted on /run and the journald config so it normally stores a full boot session, but of course doesn't store journals from previous sessions as they're wiped along with the tmpfs at reboot. I run syslog-ng as well, configured to work with journald, and thus have its more traditional append-only plain-text syslogs for previous boot sessions. For my usage that actually seems the best of both worlds as I get journald benefits such as service status reports showing the last 10 log entries for that service, etc, with those benefits mostly applying to the current session only, while I still have the traditional plain-text greppable, etc, syslogs, from both the current and previous sessions, back as far as my log rotation policy keeps them. It also keeps the journals entirely off of btrfs, so that's one particular problem I don't have to worry about at all, the reason I'm a bit fuzzy on the exact details of systemd's solution to the journal on btrfs issue. > So is there any general approach towards this? The general case is that for normal desktop users, it doesn't tend to be a problem, as they don't do either large VMs or large databases, and small ones such as the sqlite files generated by firefox and various email clients are handled quite well by autodefrag, with that general desktop usage being its primary target. For server usage and the more technically inclined workstation users who are running VMs and larger databases, the general feeling seems to be that those adminning such systems are, or should be, technically inclined enough to do their research and know when measures such as nocow and limited snapshotting along with manual defrags where necessary, are called for. And if they don't originally, they find out when they start researching why performance isn't what they expected and what to do about it. =:^) > And what are the actual possible consequences? Is it just that fs gets > slower (due to the fragmentation) or may I even run into other issues to > the point the space is
Re: btrfs: poor performance on deleting many large files
Christoph Anton Mitterer posted on Thu, 26 Nov 2015 19:25:47 +0100 as excerpted: > On Thu, 2015-11-26 at 16:52 +, Duncan wrote: >> For people doing snapshotting in particular, atime updates can be a big >> part of the differences between snapshots, so it's particularly >> important to set noatime if you're snapshotting. > What everything happens when that is left at relatime? > > I'd guess that obviously everytime the atime is updated there will be > some CoW, but only on meta-data blocks, right? Yes. > Does this then lead to fragmentation problems in the meta-data block > groups? I don't believe so. I think individual metadata elements tend to be small enough that several fit in a metadata node (16 KiB by default these days, IIRC), so there's no "metadata fragmentation" to speak of. > And how serious are the effects on space that is eaten up... say I have > n snapshots and access all of their files... then I'd probably get n > times the metadata, right? Which would sound quite dramatic... > > Or is just parts of the metadate copied with new atimes? I think it's whole 4 KiB blocks and possibly whole metadata nodes (16 KiB), copy-on-write, and these would be relatively small changes triggering cow of the entire block/node, aka write amplification. While not too large in themselves, it's the number of them that becomes a problem. IIRC relatime updates once a day on access. If you're doing daily snapshots, updating metadata blocks for all files accessed in the last 24 hours... Again, individual snapshots aren't so much of a problem, and if you're thinning to the 250 snapshots per subvolume or less as I recommend, the problem will remain controlled, but at 250, starting at daily snapshots so they all have atime changes for at least all files accessed during that 24 hours, that's still a sizable set of unnecessarily modified and thus space-taking snapshotted metadata. But I wouldn't worry about it too much if you're doing say monthly snapshots and only keeping a year's worth or less, 12-13 snapshots per subvolume total. In my case, I'm on SSD with their limited write cycles, so while the snapshot thing doesn't affect me since my use-case doesn't involve snapshots, the SSD write cycle count thing certainly does, and noatime is worth it to me for that alone. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs: poor performance on deleting many large files
On Thu, 2015-11-26 at 23:29 +, Duncan wrote: > > but only on meta-data blocks, right? > Yes. Okay... so it'll at most get the whole meta-data for a snapshot separately and not shared anymore... And when these are chained as in ZFS,.. it probably amplifies... i.e. a change deep down in the tree changes all the upper elements as well? Which shouldn't be a too big problem unless I have a lot snapshots or extremely many files. > I think it's whole 4 KiB blocks and possibly whole metadata nodes (16 > KiB), copy-on-write, and these would be relatively small changes > triggering cow of the entire block/node, aka write > amplification. While > not too large in themselves, it's the number of them that becomes a > problem. Ah... there you say it already =) But still it's always only meta-data that is copied, never the data, right?! > IIRC relatime updates once a day on access. If you're doing daily > snapshots, updating metadata blocks for all files accessed in the > last 24 > hours... Yes... Wouldn't it be a way to handle that problem if btrfs allowed to create snapshots for which the atime never gets updated, regardless of any mount option? And additionally, allow people to mount subvols with different noatime/relatime/atime settings (unless that's already working)... that way, they could enable it for things where they want/need it,... and disable it where not. > In my case, I'm on SSD with their limited write cycles, so while the > snapshot thing doesn't affect me since my use-case doesn't involve > snapshots, the SSD write cycle count thing certainly does, and > noatime is > worth it to me for that alone. I'm always a bit unsure about that... I've used to do it as well as for the wear.. but is that really necessary? With relatime, atime updates happen at most once a day... so at worst you rewrite... what... some 100 MB (at least in the ext234 case)... and SSDs seem to bare much more write cycles than advertised. Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: btrfs: poor performance on deleting many large files
Mitchell Fossen wrote on 2015/11/25 15:49 -0600: On Mon, 2015-11-23 at 06:29 +, Duncan wrote: Using subvolumes was the first recommendation I was going to make, too, so you're on the right track. =:^) Also, in case you are using it (you didn't say, but this has been demonstrated to solve similar issues for others so it's worth mentioning), try turning btrfs quota functionality off. While the devs are working very hard on that feature for btrfs, the fact is that it's simply still buggy and doesn't work reliably anyway, in addition to triggering scaling issues before they'd otherwise occur. So my recommendation has been, and remains, unless you're working directly with the devs to fix quota issues (in which case, thanks!), if you actually NEED quota functionality, use a filesystem where it works reliably, while if you don't, just turn it off and avoid the scaling and other issues that currently still come with it. I did indeed have quotas turned on for the home directories! Since they were mostly to calculate space used by everyone (since du -hs is so slow) and not actually needed to limit people, I disabled them. [[About quota]] Personally speaking, I'd like to have some comparison between quota enabled and disabled, to help locate if it's quota causing the problem. If you can find a good and reliable reproducer, it would be very helpful for developers to improve btrfs. BTW, it's also a good idea to us ps to locate what process is running at the time your btrfs hangs. If it's kernel thread named btrfs-transaction, then it may be related to quota. As for defrag, that's quite a topic of its own, with complications related to snapshots and the nocow file attribute. Very briefly, if you haven't been running it regularly or using the autodefrag mount option by default, chances are your available free space is rather fragmented as well, and while defrag may help, it may not reduce fragmentation to the degree you'd like. (I'd suggest using filefrag to check fragmentation, but it doesn't know how to deal with btrfs compression, and will report heavy fragmentation for compressed files even if they're fine. Since you use compression, that kind of eliminates using filefrag to actually see what your fragmentation is.) Additionally, defrag isn't snapshot aware (they tried it for a few kernels a couple years ago but it simply didn't scale), so if you're using snapshots (as I believe Ubuntu does by default on btrfs, at least taking snapshots for upgrade-in-place), so using defrag on files that exist in the snapshots as well can dramatically increase space usage, since defrag will break the reflinks to the snapshotted extents and create new extents for defragged files. Meanwhile, the absolute worst-case fragmentation on btrfs occurs with random-internal-rewrite-pattern files (as opposed to never changed, or append-only). Common examples are database files and VM images. For /relatively/ small files, to say 256 MiB, the autodefrag mount option is a reasonably effective solution, but it tends to have scaling issues with files over half a GiB so you can call this a negative recommendation for trying that option with half-gig-plus internal-random-rewrite-pattern files. There are other mitigation strategies that can be used, but here the subject gets complex so I'll not detail them. Suffice it to say that if the filesystem in question is used with large VM images or database files and you haven't taken specific fragmentation avoidance measures, that's very likely a good part of your problem right there, and you can call this a hint that further research is called for. If your half-gig-plus files are mostly write-once, for example most media files unless you're doing heavy media editing, however, then autodefrag could be a good option in general, as it deals well with such files and with random-internal-rewrite-pattern files under a quarter gig or so. Be aware, however, that if it's enabled on an already heavily fragmented filesystem (as yours likely is), it's likely to actually make performance worse until it gets things under control. Your best bet in that case, if you have spare devices available to do so, is probably to create a fresh btrfs and consistently use autodefrag as you populate it from the existing heavily fragmented btrfs. That way, it'll never have a chance for the fragmentation to build up in the first place, and autodefrag used as a routine mount option should keep it from getting bad in normal use. Thanks for explaining that! Most of these files are written once and then read from for the rest of their "lifetime" until the simulations are done and they get archived/deleted. I'll try leaving autodefrag on and defragging directories over the holiday weekend when no one is using the server. There is some database usage, but I turned off COW for its folder and it only gets used sporadically and shouldn't be a huge factor in day-to-day usage. Also, is there a
Re: btrfs: poor performance on deleting many large files
Mitchell Fossen posted on Wed, 25 Nov 2015 15:49:58 -0600 as excerpted: > Also, is there a recommendation for relatime vs noatime mount options? I > don't believe anything that runs on the server needs to use file access > times, so if it can help with performance/disk usage I'm fine with > setting it to noatime. FWIW I finally got tired enough of always setting noatime (for over a decade, since kernel 2.4 and my standardizing to then reiserfs) that I finally found the spot in the kernel where the relatime default is set, and patched it to be noatime by default. My kernel scripts apply that on top of my git kernel pulls, now. For people doing snapshotting in particular, atime updates can be a big part of the differences between snapshots, so it's particularly important to set noatime if you're snapshotting. If you're not doing snapshots, it's somewhat less important, but IIRC it was still someone more of a performance issue than with ext*, tho I don't remember the details but I'd guess it's to do with COWing the metadata triggering metadata fragmentation. Bottom line, use noatime unless you have something that needs atime. It's not going to hurt for sure, and should improve performance at least somewhat even on ext*. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs: poor performance on deleting many large files
On Thu, 2015-11-26 at 16:52 +, Duncan wrote: > For people doing snapshotting in particular, atime updates can be a > big > part of the differences between snapshots, so it's particularly > important > to set noatime if you're snapshotting. What everything happens when that is left at relatime? I'd guess that obviously everytime the atime is updated there will be some CoW, but only on meta-data blocks, right? Does this then lead to fragmentation problems in the meta-data block groups? And how serious are the effects on space that is eaten up... say I have n snapshots and access all of their files... then I'd probably get n times the metadata, right? Which would sound quite dramatic... Or is just parts of the metadate copied with new atimes? Thanks, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
On Thu, Nov 26, 2015 at 01:23:59AM +0100, Christoph Anton Mitterer wrote: > 2) Why does notdatacow imply nodatasum and can that ever be decoupled? Answering the second part first, no, it can't. The issue is that nodatacow bypasses the transactional nature of the FS, making changes to live data immediately. This then means that if you modify a modatacow file, the csum for that modified section is out of date, and won't be back in sync again until the latest transaction is committed. So you can end up with an inconsistent filesystem if there's a crash between the two events. > For me the checksumming is actually the most important part of btrfs > (not that I wouldn't like its other features as well)... so turning it > off is something I really would want to avoid. > > Plus it opens questions like: When there are no checksums, how can it > (in the RAID cases) decide which block is the good one in case of > corruptions? It doesn't decide -- both copies look equally good, because there's no checksum, so if you read the data, the FS will return whatever data was on the copy it happened to pick. > 3) When I would actually disable datacow for e.g. a subvolume that > holds VMs or DBs... what are all the implications? > Obviously no checksumming, but what happens if I snapshot such a > subvolume or if I send/receive it? After snapshotting, modifications are CoWed precisely once, and then it reverts to nodatacow again. This means that making a snapshot of a nodatacow object will cause it to fragment as writes are made to it. > I'd expect that then some kind of CoW needs to take place or does that > simply not work? > > > 4) Duncan mentioned that defrag (and I guess that's also for auto- > defrag) isn't ref-link aware... > Isn't that somehow a complete showstopper? It is, but the one attempt at dealing with it caused massive data corruption, and it was turned off again. autodefrag, however, has always been snapshot aware and snapshot safe, and would be the recommended approach here. (Actually, it was broken in the same incident I just described -- but fixed again when the broken patches were reverted). > As soon as one uses snapshot, and would defrag or auto defrag any of > them, space usage would just explode, perhaps to the extent of ENOSPC, > and rendering the fs effectively useless. > > That sounds to me like, either I can't use ref-links, which are crucial > not only to snapshots but every file I copy with cp --reflink auto ... > or I can't defrag... which however will sooner or later cause quite > some fragmentation issues on btrfs? > > > 5) Especially keeping (4) in mind but also the other comments in from > Duncan and Austin... > Is auto-defrag now recommended to be generally used? Absolutely, yes. It's late for me, and this email was longer than I suspected, so I'm going to stop here, but I'll try to pick it up again and answer your other questions tomorrow. Hugo. > Are both auto-defrag and defrag considered stable to be used? Or are > there other implications, like when I use compression > > > 6) Does defragmentation work with compression? Or is it just filefrag > which can't cope with it? > > Any other combinations or things with the typicaly btrfs technologies > (cow/nowcow, compression, snapshots, subvols, compressions, defrag, > balance) that one can do but which lead to unexpected problems (I, for > example, wouldn't have expected that defragmentation isn't ref-link > aware... still kinda shocked ;) ) > > For example, when I do a balance and change the compression, and I have > multiple snaphots or files within one subvol that share their blocks... > would that also lead to copies being made and the space growing > possibly dramatically? > > > 7) How das free-space defragmentation happen (or is there even such a > thing)? > For example, when I have my big qemu images, *not* using nodatacow, and > I copy the image e.g. with qemu-img old.img new.img ... and delete the > old then. > Then I'd expect that the new.img is more or less not fragmented,... but > will my free space (from the removed old.img) still be completely > messed up sooner or later driving me into problems? > > > 8) why does a balance not also defragment? Since everything is anyway > copied... why not defragmenting it? > I somehow would have hoped that a balance cleans up all kinds of > things,... like free space issues and also fragmentation. > > > Given all these issues,... fragmentation, situations in which space may > grow dramatically where the end-user/admin may not necessarily expect > it (e.g. the defrag or the balance+compression case?)... btrfs seem to > require much more in-depth knowledge and especially care (that even > depends on the type of data) on the end-user/admin side than the > traditional filesystems. > Are there for example any general recommendations what to regularly to > do keep the fs in a clean and proper shape (and I don't count "start > with a fresh one and
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
Hey. I've worried before about the topics Mitch has raised. Some questions. 1) AFAIU, the fragmentation problem exists especially for those files that see many random writes, especially, but not limited to, big files. Now that databases and VMs are affected by this, is probably broadly known in the meantime (well at least by people on that list). But I'd guess there are n other cases where such IO patterns can happen which one simply never notices, while the btrfs continues to degrade. So is there any general approach towards this? And what are the actual possible consequences? Is it just that fs gets slower (due to the fragmentation) or may I even run into other issues to the point the space is eaten up or the fs becomes basically unusable? This is especially important for me, because for some VMs and even DBs I wouldn't want to use nodatacow, because I want to have the checksumming. (i.e. those cases where data integrity is much more important than security) 2) Why does notdatacow imply nodatasum and can that ever be decoupled? For me the checksumming is actually the most important part of btrfs (not that I wouldn't like its other features as well)... so turning it off is something I really would want to avoid. Plus it opens questions like: When there are no checksums, how can it (in the RAID cases) decide which block is the good one in case of corruptions? 3) When I would actually disable datacow for e.g. a subvolume that holds VMs or DBs... what are all the implications? Obviously no checksumming, but what happens if I snapshot such a subvolume or if I send/receive it? I'd expect that then some kind of CoW needs to take place or does that simply not work? 4) Duncan mentioned that defrag (and I guess that's also for auto- defrag) isn't ref-link aware... Isn't that somehow a complete showstopper? As soon as one uses snapshot, and would defrag or auto defrag any of them, space usage would just explode, perhaps to the extent of ENOSPC, and rendering the fs effectively useless. That sounds to me like, either I can't use ref-links, which are crucial not only to snapshots but every file I copy with cp --reflink auto ... or I can't defrag... which however will sooner or later cause quite some fragmentation issues on btrfs? 5) Especially keeping (4) in mind but also the other comments in from Duncan and Austin... Is auto-defrag now recommended to be generally used? Are both auto-defrag and defrag considered stable to be used? Or are there other implications, like when I use compression 6) Does defragmentation work with compression? Or is it just filefrag which can't cope with it? Any other combinations or things with the typicaly btrfs technologies (cow/nowcow, compression, snapshots, subvols, compressions, defrag, balance) that one can do but which lead to unexpected problems (I, for example, wouldn't have expected that defragmentation isn't ref-link aware... still kinda shocked ;) ) For example, when I do a balance and change the compression, and I have multiple snaphots or files within one subvol that share their blocks... would that also lead to copies being made and the space growing possibly dramatically? 7) How das free-space defragmentation happen (or is there even such a thing)? For example, when I have my big qemu images, *not* using nodatacow, and I copy the image e.g. with qemu-img old.img new.img ... and delete the old then. Then I'd expect that the new.img is more or less not fragmented,... but will my free space (from the removed old.img) still be completely messed up sooner or later driving me into problems? 8) why does a balance not also defragment? Since everything is anyway copied... why not defragmenting it? I somehow would have hoped that a balance cleans up all kinds of things,... like free space issues and also fragmentation. Given all these issues,... fragmentation, situations in which space may grow dramatically where the end-user/admin may not necessarily expect it (e.g. the defrag or the balance+compression case?)... btrfs seem to require much more in-depth knowledge and especially care (that even depends on the type of data) on the end-user/admin side than the traditional filesystems. Are there for example any general recommendations what to regularly to do keep the fs in a clean and proper shape (and I don't count "start with a fresh one and copy the data over" as a valid way). Thanks, Chris. > smime.p7s Description: S/MIME cryptographic signature
Re: btrfs: poor performance on deleting many large files
On Mon, 2015-11-23 at 06:29 +, Duncan wrote: > Using subvolumes was the first recommendation I was going to make, too, > so you're on the right track. =:^) > > Also, in case you are using it (you didn't say, but this has been > demonstrated to solve similar issues for others so it's worth > mentioning), try turning btrfs quota functionality off. While the devs > are working very hard on that feature for btrfs, the fact is that it's > simply still buggy and doesn't work reliably anyway, in addition to > triggering scaling issues before they'd otherwise occur. So my > recommendation has been, and remains, unless you're working directly with > the devs to fix quota issues (in which case, thanks!), if you actually > NEED quota functionality, use a filesystem where it works reliably, while > if you don't, just turn it off and avoid the scaling and other issues > that currently still come with it. > I did indeed have quotas turned on for the home directories! Since they were mostly to calculate space used by everyone (since du -hs is so slow) and not actually needed to limit people, I disabled them. > As for defrag, that's quite a topic of its own, with complications > related to snapshots and the nocow file attribute. Very briefly, if you > haven't been running it regularly or using the autodefrag mount option by > default, chances are your available free space is rather fragmented as > well, and while defrag may help, it may not reduce fragmentation to the > degree you'd like. (I'd suggest using filefrag to check fragmentation, > but it doesn't know how to deal with btrfs compression, and will report > heavy fragmentation for compressed files even if they're fine. Since you > use compression, that kind of eliminates using filefrag to actually see > what your fragmentation is.) > Additionally, defrag isn't snapshot aware (they tried it for a few > kernels a couple years ago but it simply didn't scale), so if you're > using snapshots (as I believe Ubuntu does by default on btrfs, at least > taking snapshots for upgrade-in-place), so using defrag on files that > exist in the snapshots as well can dramatically increase space usage, > since defrag will break the reflinks to the snapshotted extents and > create new extents for defragged files. > > Meanwhile, the absolute worst-case fragmentation on btrfs occurs with > random-internal-rewrite-pattern files (as opposed to never changed, or > append-only). Common examples are database files and VM images. For > /relatively/ small files, to say 256 MiB, the autodefrag mount option is > a reasonably effective solution, but it tends to have scaling issues with > files over half a GiB so you can call this a negative recommendation for > trying that option with half-gig-plus internal-random-rewrite-pattern > files. There are other mitigation strategies that can be used, but here > the subject gets complex so I'll not detail them. Suffice it to say that > if the filesystem in question is used with large VM images or database > files and you haven't taken specific fragmentation avoidance measures, > that's very likely a good part of your problem right there, and you can > call this a hint that further research is called for. > > If your half-gig-plus files are mostly write-once, for example most media > files unless you're doing heavy media editing, however, then autodefrag > could be a good option in general, as it deals well with such files and > with random-internal-rewrite-pattern files under a quarter gig or so. Be > aware, however, that if it's enabled on an already heavily fragmented > filesystem (as yours likely is), it's likely to actually make performance > worse until it gets things under control. Your best bet in that case, if > you have spare devices available to do so, is probably to create a fresh > btrfs and consistently use autodefrag as you populate it from the > existing heavily fragmented btrfs. That way, it'll never have a chance > for the fragmentation to build up in the first place, and autodefrag used > as a routine mount option should keep it from getting bad in normal use. Thanks for explaining that! Most of these files are written once and then read from for the rest of their "lifetime" until the simulations are done and they get archived/deleted. I'll try leaving autodefrag on and defragging directories over the holiday weekend when no one is using the server. There is some database usage, but I turned off COW for its folder and it only gets used sporadically and shouldn't be a huge factor in day-to-day usage. Also, is there a recommendation for relatime vs noatime mount options? I don't believe anything that runs on the server needs to use file access times, so if it can help with performance/disk usage I'm fine with setting it to noatime. I just tried copying a 70GB folder and then rm -rf it and it didn't appear to impact performance, and I plan to try some larger
Re: btrfs: poor performance on deleting many large files
On 2015-11-22 20:43, Mitch Fossen wrote: Hi all, I have a btrfs setup of 4x2TB HDDs for /home in btrfs RAID0 on Ubuntu 15.10 (kernel 4.2) and btrfs-progs 4.3.1. Root is on a separate SSD also running btrfs. About 6 people use it via ssh and run simulations. One of these simulations generates a lot of intermediate data that can be discarded after it is run, it usually ends up being around 100GB to 300GB spread across dozens of files 500M to 5GB apiece. The problem is that, when it comes time to do a "rm -rf ~/working_directory" the entire machine locks up and sporadically allows other IO requests to go through, with a 5 to 10 minute delay before other requests seem to be served. It can end up taking half an hour or more to fully remove the offending directory, with the hangs happening frequently enough to be frustrating. This didn't seem to happen when the system was using ext4 on LVM. Based on this description, this sounds to me like an issue with fragmentation. Is there a way to fix this performance issue or at least mitigate it? Would using ionice and the CFQ scheduler help? As far as I know Ubuntu uses deadline by default which ignores ionice values. This depends on a number of factors. If you are on a new enough kernel, you may actually be using the blk-mq code instead of one of the traditional I/O schedulers, which does honor ionice values, and is generally a lot better than CFQ or deadline at actual fairness and performance. If you aren't running on that code path, then whether deadline or CFQ is better is pretty hard to determine. In general, CFQ needs some serious effort and benchmarking to get reasonable performance out of it. CFQ can beat deadline in performance when properly tuned to the workload (except if you have really small rotational media (smaller than 32G or so), or if you absolutely need deterministic scheduling), but when you don't take the time to tune CFQ, deadline is usually better (except on SSD's, where CFQ is generally better than deadline even without performance tuning). Alternatively, would balancing and defragging data more often help? The current mount options are compress=lzo and space_cache, and I will try it with autodefrag enabled as well to see if that helps. Balance is not likely to help much, but defragmentation might. I would suggest running the defrag when nobody has any other data on the filesystem, as it will likely cause a severe drop in performance the first time it's run. Autodefrag might help, but it may also make performance worse while writing the files in the first place. You might also try with compress=none, depending on your storage hardware, using in-line compression can actually make things go significantly slower (I see this a lot with SSD's, and also with some high-end storage controllers, and especially when dealing with large data-sets that aren't very compressible). For now I think I'll recommend that everyone use subvolumes for these runs and then enable user_subvol_rm_allowed. As Duncan said, this is probably the best option short term. It is worth noting however that removing a subvolume still has some overhead (which appears to scale linearly with the amount of data in the subvolume). This overhead isn't likely to be an issue however unless a bunch of subvolumes get removed in bulk however. smime.p7s Description: S/MIME Cryptographic Signature
btrfs: poor performance on deleting many large files
Hi all, I have a btrfs setup of 4x2TB HDDs for /home in btrfs RAID0 on Ubuntu 15.10 (kernel 4.2) and btrfs-progs 4.3.1. Root is on a separate SSD also running btrfs. About 6 people use it via ssh and run simulations. One of these simulations generates a lot of intermediate data that can be discarded after it is run, it usually ends up being around 100GB to 300GB spread across dozens of files 500M to 5GB apiece. The problem is that, when it comes time to do a "rm -rf ~/working_directory" the entire machine locks up and sporadically allows other IO requests to go through, with a 5 to 10 minute delay before other requests seem to be served. It can end up taking half an hour or more to fully remove the offending directory, with the hangs happening frequently enough to be frustrating. This didn't seem to happen when the system was using ext4 on LVM. Is there a way to fix this performance issue or at least mitigate it? Would using ionice and the CFQ scheduler help? As far as I know Ubuntu uses deadline by default which ignores ionice values. Alternatively, would balancing and defragging data more often help? The current mount options are compress=lzo and space_cache, and I will try it with autodefrag enabled as well to see if that helps. For now I think I'll recommend that everyone use subvolumes for these runs and then enable user_subvol_rm_allowed. Regards, Mitch Fossen -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs: poor performance on deleting many large files
Mitch Fossen posted on Sun, 22 Nov 2015 19:43:28 -0600 as excerpted: > Hi all, > > I have a btrfs setup of 4x2TB HDDs for /home in btrfs RAID0 on Ubuntu > 15.10 (kernel 4.2) and btrfs-progs 4.3.1. Root is on a separate SSD also > running btrfs. > > About 6 people use it via ssh and run simulations. One of these > simulations generates a lot of intermediate data that can be discarded > after it is run, it usually ends up being around 100GB to 300GB spread > across dozens of files 500M to 5GB apiece. > > The problem is that, when it comes time to do a "rm -rf > ~/working_directory" the entire machine locks up and sporadically allows > other IO requests to go through, with a 5 to 10 minute delay before > other requests seem to be served. It can end up taking half an hour or > more to fully remove the offending directory, with the hangs happening > frequently enough to be frustrating. This didn't seem to happen when the > system was using ext4 on LVM. > > Is there a way to fix this performance issue or at least mitigate it? > Would using ionice and the CFQ scheduler help? As far as I know Ubuntu > uses deadline by default which ignores ionice values. > > Alternatively, would balancing and defragging data more often help? The > current mount options are compress=lzo and space_cache, and I will try > it with autodefrag enabled as well to see if that helps. > > For now I think I'll recommend that everyone use subvolumes for these > runs and then enable user_subvol_rm_allowed. Using subvolumes was the first recommendation I was going to make, too, so you're on the right track. =:^) Also, in case you are using it (you didn't say, but this has been demonstrated to solve similar issues for others so it's worth mentioning), try turning btrfs quota functionality off. While the devs are working very hard on that feature for btrfs, the fact is that it's simply still buggy and doesn't work reliably anyway, in addition to triggering scaling issues before they'd otherwise occur. So my recommendation has been, and remains, unless you're working directly with the devs to fix quota issues (in which case, thanks!), if you actually NEED quota functionality, use a filesystem where it works reliably, while if you don't, just turn it off and avoid the scaling and other issues that currently still come with it. As for defrag, that's quite a topic of its own, with complications related to snapshots and the nocow file attribute. Very briefly, if you haven't been running it regularly or using the autodefrag mount option by default, chances are your available free space is rather fragmented as well, and while defrag may help, it may not reduce fragmentation to the degree you'd like. (I'd suggest using filefrag to check fragmentation, but it doesn't know how to deal with btrfs compression, and will report heavy fragmentation for compressed files even if they're fine. Since you use compression, that kind of eliminates using filefrag to actually see what your fragmentation is.) Additionally, defrag isn't snapshot aware (they tried it for a few kernels a couple years ago but it simply didn't scale), so if you're using snapshots (as I believe Ubuntu does by default on btrfs, at least taking snapshots for upgrade-in-place), so using defrag on files that exist in the snapshots as well can dramatically increase space usage, since defrag will break the reflinks to the snapshotted extents and create new extents for defragged files. Meanwhile, the absolute worst-case fragmentation on btrfs occurs with random-internal-rewrite-pattern files (as opposed to never changed, or append-only). Common examples are database files and VM images. For /relatively/ small files, to say 256 MiB, the autodefrag mount option is a reasonably effective solution, but it tends to have scaling issues with files over half a GiB so you can call this a negative recommendation for trying that option with half-gig-plus internal-random-rewrite-pattern files. There are other mitigation strategies that can be used, but here the subject gets complex so I'll not detail them. Suffice it to say that if the filesystem in question is used with large VM images or database files and you haven't taken specific fragmentation avoidance measures, that's very likely a good part of your problem right there, and you can call this a hint that further research is called for. If your half-gig-plus files are mostly write-once, for example most media files unless you're doing heavy media editing, however, then autodefrag could be a good option in general, as it deals well with such files and with random-internal-rewrite-pattern files under a quarter gig or so. Be aware, however, that if it's enabled on an already heavily fragmented filesystem (as yours likely is), it's likely to actually make performance worse until it gets things under control. Your best bet in that case, if you have spare devices available to do so, is