Re: lots of snapshots, forcing defragmentation, and figuring out which files to defrag or nodatacow

Duncan Tue, 29 Sep 2015 23:34:44 -0700

James Cook posted on Mon, 28 Sep 2015 22:51:05 -0700 as excerpted:

> The context of these three questions is that I'm experiencing occasional
> hangs for several seconds while btrfs-transacti works, and very long
> boot times. General comments welcome. System info at bottom,
> end part of dmesg.log attached.
> 
> Q1:
> 
> I keep a lot of read-only snapshots (> 300 total) of all of my
> subvolumes and haven't deleted any so far. Is this in itself a problem
> or unanticipated use of btrfs?


Very large numbers of snapshots do slow things down, but ~300 isn't what 
I'd call "very large" -- we're talking thousands to tens of thousands.

My general recommendation is to keep it to ~250ish (under 300) per 
snapshotted subvolume, preferably under 2000 (and if possible 1000) 
total, easy enough to do even with automated frequent snapshotting (on 
the order of an hour apart, initially), as long as an equally automated 
snapshot thinning program is also established.  At ~250 per subvolume, 
1000 is 4 subvolumes worth, 2000 8 subvolumes worth.

A bit over 300, assuming they're all of the same subvolume, is getting a 
bit high, but it shouldn't be causing a lot of trouble yet.  It's just 
time to start thinking about a thinning program.

There's one exception, quotas.  Quotas continue to be an issue on btrfs; 
they're on their third rewrite now and while they believe it will work 
this time, there's still some serious bugs that will take a couple more 
kernels to work out.  And besides not working right, they dramatically 
increase scalability issues.  So my recommendation, unless you're 
directly working with the devs to test, report problems with, and bug-
trace various quota issues, just don't run them on btrfs at this time.  
If you need quotas, use a filesystem where they're mature and work.  If 
you don't, use btrfs without them.

Really.  I've seen at least two confirmed cases posted where people 
running quotas turned them off and their scaling issues disappeared.
So if you have them on, that could well be your problem, right there.

> Q2:
> 
> I have some files that remain heavily fragmented (according to filefrag)
> even after I run btrfs fi defragment. I think this happens because btrfs
> doesn't want to unlink parts of the files from their snapshotted copies.
> Can I tell btrfs to defragment them anyway, and not worry about wasting
> space? And can I make the autodefrag mount option do this?
> 
> For example (not all output shown):
> 
> # filefrag *
> ...
> 
system@1973a03e3af1449ba5dd93362953fd5f-0000000000000001-00051f9377f11af6.journal:
> 553 extents found ...
> 
> # btrfs fi defragment -rf .
> 
> # filefrag *
> ...
> 
system@1973a03e3af1449ba5dd93362953fd5f-0000000000000001-00051f9377f11af6.journal:
> 331 extents found ...

Several points to note, here:

1) Filefrag doesn't understand btrfs compression.

If you don't use btrfs compression, this doesn't apply, but for btrfs 
compressed files, filefrag reports huge numbers of extents -- generally 
one per btrfs compression block (128 KiB), so 8 per MiB, 8192 per GiB of 
(before compression, not like btrfs give you a way to see post-
compression file size anyway) file size.

But unless you run compress-force you won't see it everywhere, because 
btrfs only compresses some files.

2) Btrfs defrag isn't snapshot aware, and will only defrag the files it's 
directly pointed at, using more space as it breaks the reflinks to the 
snapshotted copy.  Around 3.9 snapshot-aware defrag was introduced, but 
it turned out to have *severe* scalability issues, so that was rolled 
back and snapshot-aware defrag was turned off again in, IIRC, 3.12 (thus 
well before what you're running).

So worrying about breaking snapshot reflinks while defragging isn't going 
to be your problem, that, per se, is simply not an issue.

3) What /can/ be an issue is dealt with using defrag's -t parameter.  I 
don't remember what the default target extent size is, but it's somewhat 
smaller than you might expect, well under a gig.  Extent sizes larger 
than this are considered to be already defragged and aren't touched.  
(While this does touch on #2 above as well, not unnecessarily breaking 
reflinks to extents shared with other snapshots, the mechanism is one of 
extent size, not whether the extent is shared with another snapshot.  So 
even if it's a new file not yet snapshotted, extents over this size won't 
be touched.)

It's worth keeping in mind that btrfs' nominal data chunk size is 1 GiB.  
As such, that's the nominal largest extent size as well, altho in some 
cases (data chunks created on nearly empty TiB-scale filesystems) data 
chunk size can be larger, multiple GiB, in which case extent size can be 
larger as well.

Regardless, extent sizes > 1 GiB really aren't going to be a performance 
issue anyway, so while using the -t 1G or -t 2G option is a good idea and 
should reduce fragmentation further for extents between the default size 
and your -t size, going above that isn't likely to do any good anyway.

Which means for multi-gig files ideal minimum fragmentation will be a 
number of extents equal to their size in GiB, perhaps plus 2, assuming 
the first extent is in a partially used data chunk and thus under a gig, 
with the last being under a gig as well.  You are extremely unlikely to 
get better than that, so for file sizes over a GiB, you /will/ see 
multiple extents.  (Again, keeping in mind the exception where a data 
chunk is multi-GiB sized itself.)

So using defrag's -t 1G is likely to get you somewhat better results, but 
don't expect multi-gig files to ever be a single extent.

4) Recently (3.17?), btrfs got a new feature that allowed it to 
automatically delete chunk-allocations with zero usage.  This solved a 
serious problem, as before this, btrfs could allocate new chunks but 
couldn't deallocate them, and over time and many normal file-creation/
deletion cycles, most people ended up with a huge number of empty data 
chunks, often filling their filesystem until no new metadata chunks could 
be allocated, resulting in ENOSPC errors despite df reporting tens or 
occasionally even hundreds of GiB free.  (Also possible, but less common, 
was the reverse case, all otherwise free space allocated to empty 
metadata chunks, with no space left to allocate new data chunks.) A btrfs 
balance start -dusage=0 -musage=0 could quickly get rid of them, but 
before this feature, it had to be done manually.

Unfortunately, it appears defrag's previously great strategy of trying to 
use the space in existing data chunks before allocating new ones hasn't 
yet been adjusted for this, and it will first try to fill up holes in 
existing data chunks before allocating new ones.  With btrfs 
automatically deleting empty chunks now, that means defrag will be 
working with partially full chunks with free space, but where 
unfortunately that free space is itself fragmented.  So for a large multi-
gig file, instead of allocating new GiB sized chunks and defragging to 
whole-chunk GiB sized extents, it's likely to be trying to work with much 
smaller free-space extents in existing chunks.

Obviously this is going to rather negatively affect the results of a 
defrag and post-defrag filefrag results will still be higher numbers of 
extent counts than desired. =:^(

This is a known bug, a result of the fairly recent automated deleting 
empty data chunks and defrag's reluctance to allocate new ones, that will 
no doubt be fixed in time, but meanwhile, we deal with the code we have.


Here's an *entirely* *untested* idea for a workaround that I just came up 
with as I was typing the explanation above.  If you'd try it and report 
back whether it works, we might well have at least some way to work 
around the issue.

Before your defrag, do a btrfs fi df (or usage), and see what the numbers 
for data are.  (If you haven't run a balance recently and you see a big 
spread tens of GiB, between data size and data used, try something like 
btrfs balance start -dusage=20, which will only try to balance data 
chunks under 20% usage.  This will go much faster than balancing the 
whole filesystem, and often gets you quite a bit of space returned to 
unallocated from partially empty data chunks.  If it doesn't have the 
desired effect, try -dusage=50, or even higher, but be aware that above 
50, you're dealing with mostly full chunks already, which will take far 
longer to balance, with much less return for the time taken.  If you do 
such a balance, check again the btrfs fi df results and see how you did.)

Then do a truncate -s 100G some-not-yet-existing-file (adjusting the 100G 
as appropriate for your level of unallocated space, see either btrfs fi 
show or btrfs fi usage), and a sync, to ensure it's allocated in the 
filesystem.  Do another fi df and you should see the data size has 
increased accordingly.

Now delete that file, sync again, and do a fi df to double-check that you 
now have a bunch of empty data space (a spread between data size and used 
of near the file's size. 

Since I think btrfs doesn't delete those empty data chunks immediately, 
this /should/ give defrag a bunch of empty data chunks to use, and post-
defrag filefrag results should end up much better (accounting for the 
first three points above, of course)! =:^)

You might have to play with the idea a bit, perhaps using fallocate 
instead of truncate, etc, but as long as you get something to allocate a 
bunch of otherwise empty data chunks, delete it to empty them, and do the 
defrag before btrfs has deleted those empty chunks, I think it should 
work. =:^)

5) Judging by the filename of your example file, you run systemd, and 
it's a systemd journal file.  These files (along with database and VM 
image files) have a non-append-only rewrite pattern that is known to be 
problematic on COW-based filesystems such as btrfs, triggering very high 
levels of fragmentation.

What version of systemd are you running?  221 and later (219 first tried 
but there were bugs in the initial 219 implementation that remained in 
220, that weren't worked out until 221) are more btrfs-friendly than 
previous versions.

For reasons I'll explain in a moment, while I run btrfs and systemd (226 
ATM), I don't know how well systemd's efforts at btrfs-friendliness 
turned out, at least not as pertains to the journal, but AFAIK, they 
consisted of two parts, which together should definitely help with the 
problem.  However, the one part won't trigger on older systems updated to 
the newer btrfs-friendly systemd, without some manual intervention.  I'll 
explain...

First of all, it can be noted that if you're running an older systemd, or 
if you're on a newer systemd now but it was updated from an older one, 
you can take these steps manually, as well.  I'll outline them that way, 
tho newer systemd would take these steps automatically if installed 
fresh, not updated from an older installation.

a) With systemd running in emergency/rescue mode or at least with 
journaling shut down so it won't interfere...

b) Move/rename the existing journal directories (see the notes under (c), 
you might wish to do this with all of /var, or only with /var/log or
/var/log/journal, the latter being what I'd do) to something else.

c) Create a new btrfs /subvolume/ in place of the old /directory/ that 
you just moved, so it can be named the same.

The reason for this is that btrf snapshots stop at subvolume boundaries.  
What we're doing here is taking the journal dir out of future / (or /var) 
snapshots, by making it a subvolume of its own.

It should be noted that systemd's auto-create mechanism is the systemd-
tmpfiles-setup service, as configured in the various tmpfiles.d 
locations, namely /usr/lib/tmpfiles.d/*, specifically the var.conf file, 
here.  If you look at that file, you'll see that systemd (226 at least) 
actually creates a directory (d) for /var/log, with the subvolume 
creation (v) as /var itself.  Of course, only if they don't already exist 
(and v/subvolume-creation gracefully degrades to the d/dir-creation 
behavior it was before 219, on non-btrfs).

If I were doing it manually, however, I'd create the subvolume as 
/var/log/journal, keeping /var either as its own subvolume (as systemd 
now creates it) or as a directory on /, depending on whether I wanted the 
rest of /var to be snapshotted separately from / or not.

d) Set /var/log/journal (creating it if you just created /var as its own 
subvolume) nocow.

Systemd/tmpfiles.d has this (h line, setting +C) in /usr/lib/tmpfile.d/
journal-nocow.conf.

To do it manually, use chattr +C.  Do the same for subdirs (the remote 
subdir and individual machines subdirs), if applicable.

What nocow does is turn off copy-on-write for the file(s) in question, 
making it/them rewrite-in-place instead of cow.  As a (btrfs) side 
effect, it turns off compression (if otherwise on) and checksumming for 
the file as well.  On a directory, nocow does nothing, /except/ that 
newly created files (and subdirs) in that directory now inherit the nocow 
attribute.

The idea is that our journal files should be nocow, and with the +C set 
on the containing subdir, /newly/ /created/ journals should now be just 
that.

e) Now copy your old journals back in from the renamed backup, but 
there's a caveat...

On btrfs, for existing files that have nocow set after they already have 
content (are not zero size), when they actually start behaving as nocow 
isn't defined.  Thus, the files have to be created new in the target 
directory, in ordered to ensure that the (inherited) nocow takes effect 
immediately.

The easiest way to ensure this is to copy them in from a different 
filesystem (not subvolume).  Assuming your memory is sufficiently large, 
the easiest way to do /that/ is to copy the files to tmpfs, then copy/
move them to their new location under the nocow dir.  Because it's a 
cross-filesystem copy/move, that will ensure that they are actually 
created in the new filesystem, not just reflinked or some such.

(AFAIK, current cp without the reflink option does the right thing 
anyway, but there has been quite some discussion about making reflink 
copies the default, in which case you'd have to set --reflink=no to turn 
it off.  Mv will always take a shortcut and simply create a reflink at 
the new location, deleting the old one, if it can, which it normally can 
if both the source and destination are on the same filesystem, so it'd 
definitely be a problem from the same filesystem.  So there's ways to do 
it without using a separate filesystem as an intermediate, but the 
easiest way to be sure it's right is to just use the separate filesystem 
as an intermediate, and not have to worry about whether it's actually a 
newly created file at the destination, because you /know/ it is due to 
the cross-filesystem copy/move.)


OK, so why go to all this trouble?

First of all, nocow means the file is (normally) updated in-place, so 
fragmentation isn't an issue -- as long as that remains true.

The problem is that snapshots depend on cow because they lock in place 
the existing version.  With the existing version locked in place, 
obviously new writes must be cowed elsewhere, killing the intent of nocow.

What btrfs actually does with nocow files when a write comes in after a 
snapshot, is use what some call cow1.  The first write to a (4k) block 
cows it elsewhere, as it normally would, but the file's nocow attribute 
remains, and further writes to the now new block write in-place to it... 
until the next snapshot locks it /too/ in place.

So frequent snapshots pretty well disable nocow and trigger fragmentation 
just as if the file wasn't nocow... depending of course on how relatively 
frequent and widely spread out those writes are into a file, vs the 
frequency of snapshots locking the existing version in place.

So, what systemd did is ensure that newly created journal files are nocow, 
by setting the attribute on the existing dirs containing them so they'll 
inherit the attribute at creation.  That's half the solution.

Unfortunately, systemd didn't get the other half of the solution, the 
subvolume side, quite right.  They created a subvolume at the /var 
level.  Which means /var will not be snapshotted with /, OK, but what if 
you want to snapshot other files in /var?  Now you'll setup a 
snapshotting schedule for the /var subvol, but because /var/log/journal 
is a directory not a subvolume of its own, it'll be included in the /var 
snapshot, triggering the very same cow1 problems on the nocow journals 
(tho at a different frequency if /var is snapshotted at a different 
frequency) as they'd have if they were still part of the main / subvolume!

Which is why I recommend setting up /var/log/journal as its own subvolume, 
to exclude it from the snapshots above (either /var if it's its own subvol 
or / if not) and not snapshotting the journal subvolume at all, thus 
avoiding snapshot-triggered cow1 fragmentation with the journal files.

Of course not snapshotting the journals is its own tradeoff.  If you want 
to snapshot them, you can do so, but try to do it at as low a frequency 
as possible, and consider running defrag on the files.  Depending on your 
frequency of snapshotting and how frequently your journal files are 
cycled based on your journal configuration, if they're actually rotated 
out within a week or so and you're snapshotting also either every week or 
perhaps every couple days, then fragmentation shouldn't get bad enough 
due to the cow1 to be worth worrying about defrag.

6) Meanwhile, one final point to note, as hinted above, my own journal 
configuration, which bypasses all this.

Basically, I did this:

a) When I converted to systemd, I kept my old syslog (syslog-ng), only 
now built with the build-time-optional systemd/journal integration.  
(FWIW I'm on gentoo, so toggling the systemd integration was as simple as 
setting the systemd USE flag and rebuilding syslog-ng.  However, binary-
based distros based on systemd will probably already have such 
integration enabled in sysloggers they still ship.)

b) Because systemd's journal has some very nice current-session features 
(and because it's difficult/impossible to entirely disable in any case, 
but the features meant I didn't want to), I configured it to handle 
current session stuff only, basically by setting things up in 
journald.conf so it only used the /run/log/journal location, which being 
on tmpfs by default, is memory-only and thus blown away on reboot or 
when /run is unmounted.

This way I was able to keep the systemctl status last-ten-messages from 
that service feature, since the journal was still operational, but only 
to tmpfs. =:^)

c) I then configured my normal syslogger (syslog-ng in my case) to pull 
from systemd/journald, and to continue to log to the normally configured 
files in /var/log, as it normally did.  And logrotate and various scripts 
continue to rotate those logs as they normally would.

** Critically, because the normal syslogger writes in append-only mode, 
its files aren't subject to the fragmentation that rewrite-all-over-the-
file files get.  Thus, journals are subject to the problem, but the only 
journals I have are in tmpfs which is memory-only so access is fast no 
matter what it does, while the normal logs in (the btrfs) /var/log are 
append-only and thus not subject to the problem.

So for current-session investigations, I can use the journal with its 
best features, or the normal syslog files.  If I'm investigating 
something that happened in an earlier session/boot, I don't have the 
journal to use (but also don't have the problems that writing journals to 
btrfs brings), but the normal syslog files are still there and usable, 
just as they always were.

IOW, I've lost none of the old log style features, while still being able 
to use the new journal features in the current session, where they're of 
most use anyway.

(As an additional benefit, despite scouring the journald documentation, I 
found absolutely no way to filter out "noise" entries so they didn't hit 
the journal at all -- it's possible to filter them on the journal-read 
side, but not before they're written to journal.  With syslog-ng and 
presumably with other syslogs as well, filtering out such "noise" entries 
before they're written to the logs at all, is simple, once the config 
file format is understood, at least.  For noise that happens several 
times a minute, that's a lot of permanent filesystem writes that I'm 
avoiding!  Of course they're still stored in the tmpfs journals, using 
memory, but luckily the journal's binary format does compression, so 10K 
identical noise entries don't actually take up that much more room than 
just one does.)

d) I did find that with journald journaling a whole session to the /run 
tmpfs, instead of switching to /var/log/journal once it was available, I 
did have to increase the size of the /run tmpfs some.  However, before 
systemd, my old init system (the sysvinit-based openrc) only used /run 
for a few PID files, etc, so I had it clamped down rather tight, to 
something like 20 MiB max.  Now it's 512 MiB, half a GiB, reasonable on 
my 16 GiB memory system.

I then had to tweak the journald.conf settings a bit, so it'd use most of 
that half-GiB for itself (IIRC it uses only 50% by default, for safety, 
then stops journaling), while still leaving a bit of room for the 
standard PID files and the like.

FWIW, these are my non-default journald.conf settings; others remain at 
default:

ForwardToSyslog=yes
RuntimeKeepFree=48M
RuntimeMaxFileSize=64M
RuntimeMaxUse=448M
Storage=volatile
TTYPath=/dev/tty12

Storage=volatile is the critical one for confining journals to tmpfs.  
ForwardToSyslog=yes simply tells journald that I have a syslogger too, 
and I want it to get the logs.  The Runtime* settings determine systemd's 
usage on that tmpfs, and TTYPath is more or less unrelated.  (Gentoo 
defaults to printing the messages log to tty12 when it's syslog, and here 
I just decided to let journald do that, giving me an easy way to check on 
the "noise" that syslog-ng is filtering if I want to, instead of having 
syslog-ng print to that tty.  Obviously there's a corresponding non-
default setting in syslog-ng's config turning off printing to the tty, 
there.)

So there you have it, why I don't have to worry about journal behavior on 
btrfs at all, should you want to do similar.  Otherwise, simply follow 
the recommendations in point #5 (as well as the others), and your 
problems should go down dramatically, tho obviously I prefer the method I 
described here in point #6, thus not having to deal with the problem, for 
journal files at least, at all.

> Q3:
> 
> What's the best way to tell which files are causing the hangs? My
> current method is to make an educated guess (e.g. think of programs that
> store large database files) then use filefrag to see if there's
> fragmentation. I'm not confident I've found all the sources of bad
> performance.

That one I don't really have a good answer for... *except* to remind you 
that systemd, which you do appear to be using, has reasonably good 
boottime taken reports available.  If the problem's primarily at boot 
time or it at least exists there as well, as seems to be the case, you 
should be able to use those to at least get some idea of what services 
are taking the time, so you can focus on them.

One more thing for completeness, tho based on the below it's unlikely to 
apply to you.  People report that when their filesystem has many devices 
(more than four, so not applying to your two), the mount will sometimes 
take so long it times out, tho they can mount the filesystem fine at the 
command prompt.  One workaround here is to simply lengthen the timeout 
for that mount service.  IIRC some people had ideas for shortening that 
mount time, but I don't remember what they were.

... And few more (short) comments below the system info...

> System info:
> 
> [root@angel-nixos:~]# uname -a Linux angel-nixos 3.18.20 #1-NixOS SMP
> Thu Jan 1 00:00:01 UTC 1970 x86_64 GNU/Linux
> 
> [root@angel-nixos:~]# btrfs --version btrfs-progs v4.2
> 
> [root@angel-nixos:~]# btrfs fi show Label: 'AngelBtrfs'  uuid:
> 7f4b4b5d-1ba5-46cc-b782-938e3600a427
>         Total devices 2 FS bytes used 1.06TiB
>         devid    5 size 2.00TiB used 1.07TiB path /dev/mapper/[snip]
>         devid    6 size 2.00TiB used 1.07TiB path /dev/mapper/[snip]
> 
> btrfs-progs v4.2
> 
> [root@angel-nixos:~]# btrfs fi df /
> Data, RAID1: total=1.01TiB, used=1.01TiB
> System, RAID1: total=32.00MiB, used=192.00KiB
> Metadata, RAID1: total=60.00GiB, used=58.65GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B

3.18 series kernel, second to last LTS series, so you're good there.
4.2 userspace, current AFAIK.

Btrfs is raid1 with two devices, and seems a bit more than half full (2 
TiB devices, just over 1 TiB data plus 60 GiB metadata on each).  The fi 
df says data is basically full, with metadata close enough as well, but 
as I implied, nearly a full TiB unallocated, so it appears the empty-
chunk removal is functioning fine, and the filesystem is healthy in terms 
of space available.

All around, pretty healthy! =:^)  The only thing I'd suggest in general 
is to setup a schedule of snapshot thinning, before it becomes a 
problem.  But at ~300 snapshots, it shouldn't be anything like a problem 
yet.

And as I said above, disable btrfs quotas if they're turned on (unless 
you're specifically helping to fix them), as right now they're simply 
more trouble than they're worth, and we have posts demonstrating that 
very point.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: lots of snapshots, forcing defragmentation, and figuring out which files to defrag or nodatacow

Reply via email to