On Wed, Feb 10, 2021 at 12:14 PM Goffredo Baroncelli <kreij...@inwind.it> wrote: > > Hi Chris, > > it seems that systemd-journald is more smart/complex than I thought: > > 1) systemd-journald set the "live" journal as NOCOW; *when* (see below) it > closes the files, it mark again these as COW then defrag [1]
Found that in commit 11689d2a021d95a8447d938180e0962cd9439763 from 2015. But archived journals are still all nocow for me on systemd 247. Is it because the enclosing directory has file attribute 'C' ? Another example: Active journal "system.journal" INODE_ITEM contains sequence 4515 flags 0x13(NODATASUM|NODATACOW|PREALLOC) 7 day old archived journal "systemd.journal" INODE_ITEM shows: sequence 227 flags 0x13(NODATASUM|NODATACOW|PREALLOC) So if it ever was COW, it flipped to NOCOW before the defrag. Is it expected? and also this archived file's INODE_ITEM shows generation 1748644 transid 1760983 size 16777216 nbytes 16777216 with EXTENT_ITEMs show generation 1755533 type 1 (regular) generation 1753668 type 1 (regular) generation 1755533 type 1 (regular) generation 1753989 type 1 (regular) generation 1755533 type 1 (regular) generation 1753526 type 1 (regular) generation 1755533 type 1 (regular) generation 1755531 type 1 (regular) generation 1755533 type 1 (regular) generation 1755531 type 2 (prealloc) file tree output for this file https://pastebin.com/6uDFNDdd > 2) looking at the code, I suspect that systemd-journald closes the > file asynchronously [2]. This means that looking at the "live" journal > is not sufficient. In fact: > > /var/log/journal/e84907d099904117b355a99c98378dca$ sudo lsattr $(ls -rt *) > [...] > --------------------- > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd4f-0005baed61106a18.journal > --------------------- > system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000bd64-0005baed659feff4.journal > --------------------- > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd67-0005baed65a0901f.journal > ---------------C----- > system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cc63-0005bafed4f12f0a.journal > ---------------C----- > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cc85-0005baff0ce27e49.journal > ---------------C----- > system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cd38-0005baffe9080b4d.journal > ---------------C----- > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cd3b-0005baffe908f244.journal > ---------------C----- user-1000.journal > ---------------C----- system.journal > > The output above means that the last 6 files are "pending" for a > de-fragmentation. When these will be > "closed", the NOCOW flag will be removed and a defragmentation will start. > > Now my journals have few (2 or 3 extents). But I saw cases where the extents > of the more recent files are hundreds, but after few "journalct --rotate" the > older files become less > fragmented. Josef explained to me that BTRFS_IOC_DEFRAG is pretty simple and just dirties extents it considers too small, and they end up just going through the normal write path, along with anything else pending. And also that fsync() will set the extents on disk so that the defrag ioctl know what to dirty, but that ordinarily it's not required and might have to do with the interleaving write pattern for the journals. I'm not sure what this ioctl considers big enough that it's worth just leaving alone. But in any case it sounds like the current write workload at the time of defrag could affect the allocation, unlike BTRFS_IOC_DEFRAG_RANGE which has a few knobs to control the outcome. Or maybe the knobs just influence the outcome. Not sure. If the device is HDD, it might be nice if the nodatacow journals are datacow again so they could be compressed. But my evaluation shows that nodatacow journals stick to an 8MB extent pattern, correlating to fallocated append as they grow. It's not significantly fragmented to start out with, whether HDD or SSD. -- Chris Murphy