Re: is BTRFS_IOC_DEFRAG behavior optimal?

Chris Murphy Wed, 10 Feb 2021 16:21:51 -0800

On Wed, Feb 10, 2021 at 12:14 PM Goffredo Baroncelli <kreij...@inwind.it> wrote:
>
> Hi Chris,
>
> it seems that systemd-journald is more smart/complex than I thought:
>
> 1) systemd-journald set the "live" journal as NOCOW; *when* (see below) it
> closes the files, it mark again these as COW then defrag [1]


Found that in commit 11689d2a021d95a8447d938180e0962cd9439763 from 2015.

But archived journals are still all nocow for me on systemd 247. Is it
because the enclosing directory has file attribute 'C' ?

Another example:

Active journal "system.journal" INODE_ITEM contains
        sequence 4515 flags 0x13(NODATASUM|NODATACOW|PREALLOC)

7 day old archived journal "systemd.journal" INODE_ITEM shows:
        sequence 227 flags 0x13(NODATASUM|NODATACOW|PREALLOC)

So if it ever was COW, it flipped to NOCOW before the defrag. Is it expected?


and also this archived file's INODE_ITEM shows
        generation 1748644 transid 1760983 size 16777216 nbytes 16777216

with EXTENT_ITEMs show
        generation 1755533 type 1 (regular)
        generation 1753668 type 1 (regular)
        generation 1755533 type 1 (regular)
        generation 1753989 type 1 (regular)
        generation 1755533 type 1 (regular)
        generation 1753526 type 1 (regular)
        generation 1755533 type 1 (regular)
        generation 1755531 type 1 (regular)
        generation 1755533 type 1 (regular)
        generation 1755531 type 2 (prealloc)

file tree output for this file
https://pastebin.com/6uDFNDdd


> 2) looking at the code, I suspect that systemd-journald closes the
> file asynchronously [2]. This means that looking at the "live" journal
> is not sufficient. In fact:
>
> /var/log/journal/e84907d099904117b355a99c98378dca$ sudo lsattr $(ls -rt *)
> [...]
> --------------------- 
> user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd4f-0005baed61106a18.journal
> --------------------- 
> system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000bd64-0005baed659feff4.journal
> --------------------- 
> user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd67-0005baed65a0901f.journal
> ---------------C----- 
> system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cc63-0005bafed4f12f0a.journal
> ---------------C----- 
> user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cc85-0005baff0ce27e49.journal
> ---------------C----- 
> system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cd38-0005baffe9080b4d.journal
> ---------------C----- 
> user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cd3b-0005baffe908f244.journal
> ---------------C----- user-1000.journal
> ---------------C----- system.journal
>
> The output above means that the last 6 files are "pending" for a 
> de-fragmentation. When these will be
> "closed", the NOCOW flag will be removed and a defragmentation will start.
>
> Now my journals have few (2 or 3 extents). But I saw cases where the extents
> of the more recent files are hundreds, but after few "journalct --rotate" the 
> older files become less
> fragmented.

Josef explained to me that BTRFS_IOC_DEFRAG is pretty simple and just
dirties extents it considers too small, and they end up just going
through the normal write path, along with anything else pending. And
also that fsync() will set the extents on disk so that the defrag
ioctl know what to dirty, but that ordinarily it's not required and
might have to do with the interleaving write pattern for the journals.

I'm not sure what this ioctl considers big enough that it's worth just
leaving alone. But in any case it sounds like the current write
workload at the time of defrag could affect the allocation, unlike
BTRFS_IOC_DEFRAG_RANGE which has a few knobs to control the outcome.
Or maybe the knobs just influence the outcome. Not sure.

If the device is HDD, it might be nice if the nodatacow journals are
datacow again so they could be compressed. But my evaluation shows
that nodatacow journals stick to an 8MB extent pattern, correlating to
fallocated append as they grow. It's not significantly fragmented to
start out with, whether HDD or SSD.



-- 
Chris Murphy

Re: is BTRFS_IOC_DEFRAG behavior optimal?

Reply via email to