subject:"Re\: \[systemd\-devel\] consider dropping defrag of journals on btrfs"

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-11 Thread Vito Caputo

On Thu, Feb 11, 2021 at 09:19:07AM -0500, Phillip Susi wrote:
> 
> Phillip Susi writes:
> 
> > Wait, what do you mean the inode nr changes?  I thought the whole point
> > of the block donating thing was that you get a contiguous set of blocks
> > in the new file, then transfer those blocks back to the old inode so
> > that the inode number and timestamps of the file don't change.
> 
> I just tested this with e4defrag and the inode nr does not change.
> Oddly, it refused to improve my archived journals which had 12-15
> fragments.  I finally found /var/log/btmp.1 which despite being less
> than 8mb had several hundred fragments.  e4defrag got it down to 1
> fragment, but for some reason, it is still described by 3 separate
> entries in the extent tree.
> 
> Looking at the archived journals though, I wonder why am I seeing so
> many unwritten areas?  Just the last extent of this file has nearly 4 mb
> that were never written to.  This system has never had an unexpected
> shutdown.  Attached is the extent map.
> 


The mid-journal unwritten areas are likely entry arrays.  They grow
exponentially and get filled in as more entries are appended
containing their respective objects.  If you're unfamiliar with the
format, there's a chain of entry arrays constructed per recurring data
object.

At the end of the journal, it's currently expected to find some
unwritten space due to the 8MiB fallocate.  A future version will
likely truncate this off while archiving.

I added a journal object layout introspection feature to jio [0],
which might be interesting for you to correlate the extent list with
the application-level object list.

You can access the feature by running `jio report layout`, it will
produce a .layout file in the cwd for every journal it opened.  Here's
a sample:

---8<---8<---8<---8<
Layout for "user-1000.journal"
Legend:
? OBJECT_UNUSED
d OBJECT_DATA
f OBJECT_FIELD
e OBJECT_ENTRY
D OBJECT_DATA_HASH_TABLE
F OBJECT_FIELD_HASH_TABLE
A OBJECT_ENTRY_ARRAY
t OBJECT_TAG

|N|object spans N page boundaries (page size used=4096)
|  single page boundary
+N N bytes of alignment padding
+  single byte of alignment padding

F|5344 D|448|1834896 d81+7 f50+6 d74+6 f48 d82+6 f55+ d84+4 f57+7 d79+ f50+6 
d104 f47+ d73+7 f44+4 d73+7 f44+4 d73+7 f44+4 d72 f45+3 d76+4 f44+4 d75+5 f48 
d90+6 f54+2 d80 f54+2 d84+4 f55+ d123+5 f55+ d82+6 f56 d87+ f58+6 d93+3 f53+3 
d|94+2 f54+2 d91+5 f59+5 d119+ f62+2 d107+5 f66+6 d105+7 f48 d108+4 f51+5 d82+6 
f49+7 e480 A56 d97+7 d107+5 e480 A56 A56 A56 A56 A56 A56 A56 A56 A56 A56 A56 
A56 A56 A56 A56 A56 A56 A56 A56 A56 A56 A56 A56 A56 d142+2 d70+2 d107+5 e|480 
d74+6 d148+4 d107+5 e480 A56 d78+2 d122+6 d72 d107+5 e480 A88 d79+ d73+7 d107+5 
e480 A88 A88 A88 A56 A88 A88 A88 A88 A88 A88 A88 A88 A|88 A88 A88 A88 A88 A88 
A88 A88 A88 d97+7 d107+5 e480 A88 A56 A56 d107+5 e480 A56 d107+5 e480 A56 A56 
d107+5 e480 A88 A56 
---8<---8<---8<---8<

Regards,
Vito Caputo

[0] git://git.pengaru.com/jio   (clone recursively w/--recursive)
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-11 Thread Phillip Susi

Colin Guthrie writes:

> I think the defaults are more complex than just "each journal file can 
> grow to 128M" no?

Not as far as I can see.

> I mean there is SystemMaxUse= which defaults to 10% of the partition on 
> which journal files live (this is for all journal files, not just the 
> SystemMaxFileSize= which refers to just one file).

That controls when to delete old journals, not when to rotate a
journal.  It looks like you can manually request a rotation, and you can
set a time based rotation, but it defaults to off, so that leaves
rotating once the file reaches the max size ( 128M ).
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-11 Thread Colin Guthrie


Phillip Susi wrote on 11/02/2021 16:29:


Colin Guthrie writes:


Are those journal files suffixed with a ~. Only ~ suffixed journals
represent a dirty journal file (i.e. from an unexpected shutdown).


Nope.


Journals rotate for other reason too (e.g. user request, overall space
requirements etc.) which might explain this wasted space?


I've made no requests to rotate and config is default, which afaics
means only rotate when the log hits max size of 128MB.  Thus I wouldn't
expect to really see any holes in the log, especially in the middle.


I think the defaults are more complex than just "each journal file can 
grow to 128M" no?


I mean there is SystemMaxUse= which defaults to 10% of the partition on 
which journal files live (this is for all journal files, not just the 
SystemMaxFileSize= which refers to just one file).


The default semantics are described in man journald.conf(5)

Again, could be a red herring, so just my first thought.

Col



--

Colin Guthrie
gmane(at)colin.guthr.ie
http://colin.guthr.ie/

Day Job:
  Tribalogic Limited http://www.tribalogic.net/
Open Source:
  Mageia Contributor http://www.mageia.org/
  PulseAudio Hacker http://www.pulseaudio.org/
  Trac Hacker http://trac.edgewall.org/

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-11 Thread Phillip Susi

Colin Guthrie writes:

> Are those journal files suffixed with a ~. Only ~ suffixed journals 
> represent a dirty journal file (i.e. from an unexpected shutdown).

Nope.

> Journals rotate for other reason too (e.g. user request, overall space 
> requirements etc.) which might explain this wasted space?

I've made no requests to rotate and config is default, which afaics
means only rotate when the log hits max size of 128MB.  Thus I wouldn't
expect to really see any holes in the log, especially in the middle.

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-11 Thread Colin Guthrie


Phillip Susi wrote on 11/02/2021 14:19:

Looking at the archived journals though, I wonder why am I seeing so
many unwritten areas?  Just the last extent of this file has nearly 4 mb
that were never written to.  This system has never had an unexpected
shutdown.  Attached is the extent map.


Are those journal files suffixed with a ~. Only ~ suffixed journals 
represent a dirty journal file (i.e. from an unexpected shutdown).


Journals rotate for other reason too (e.g. user request, overall space 
requirements etc.) which might explain this wasted space?


Just a thought.

Col


--

Colin Guthrie
gmane(at)colin.guthr.ie
http://colin.guthr.ie/

Day Job:
  Tribalogic Limited http://www.tribalogic.net/
Open Source:
  Mageia Contributor http://www.mageia.org/
  PulseAudio Hacker http://www.pulseaudio.org/
  Trac Hacker http://trac.edgewall.org/

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-11 Thread Phillip Susi


Phillip Susi writes:

> Wait, what do you mean the inode nr changes?  I thought the whole point
> of the block donating thing was that you get a contiguous set of blocks
> in the new file, then transfer those blocks back to the old inode so
> that the inode number and timestamps of the file don't change.

I just tested this with e4defrag and the inode nr does not change.
Oddly, it refused to improve my archived journals which had 12-15
fragments.  I finally found /var/log/btmp.1 which despite being less
than 8mb had several hundred fragments.  e4defrag got it down to 1
fragment, but for some reason, it is still described by 3 separate
entries in the extent tree.

Looking at the archived journals though, I wonder why am I seeing so
many unwritten areas?  Just the last extent of this file has nearly 4 mb
that were never written to.  This system has never had an unexpected
shutdown.  Attached is the extent map.

Filesystem type is: ef53
File size of 
system@13a67b4b418d4869b37247eda6ebe494-00151338-0005b9ee46d7d4a9.journal
 is 117440512 (28672 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..   0:1712667..   1712667:  1:
   1:1..2047:1591168..   1593214:   2047:1712668:
   2: 2048..2132:3012608..   3012692: 85:1593215:
   3: 2133..2139:3012693..   3012699:  7: unwritten
   4: 2140..4095:3012700..   3014655:   1956:
   5: 4096..6143:3041280..   3043327:   2048:3014656:
   6: 6144..8191:3010560..   3012607:   2048:3043328:
   7: 8192..9011:3002368..   3003187:820:3012608:
   8: 9012..9013:3003188..   3003189:  2: unwritten
   9: 9014..   10239:3003190..   3004415:   1226:
  10:10240..   11255:3024896..   3025911:   1016:3004416:
  11:11256..   11268:3025912..   3025924: 13: unwritten
  12:11269..   11348:3025925..   3026004: 80:
  13:11349..   11352:3026005..   3026008:  4: unwritten
  14:11353..   11360:3026009..   3026016:  8:
  15:11361..   11364:3026017..   3026020:  4: unwritten
  16:11365..   11373:3026021..   3026029:  9:
  17:11374..   11376:3026030..   3026032:  3: unwritten
  18:11377..   11642:3026033..   3026298:266:
  19:11643..   11688:3026299..   3026344: 46: unwritten
  20:11689..   11961:3026345..   3026617:273:
  21:11962..   11962:3026618..   3026618:  1: unwritten
  22:11963..   12287:3026619..   3026943:325:
  23:12288..   12347:3033088..   3033147: 60:3026944:
  24:12348..   12381:3033148..   3033181: 34: unwritten
  25:12382..   12466:3033182..   3033266: 85:
  26:12467..   12503:3033267..   3033303: 37: unwritten
  27:12504..   13007:3033304..   3033807:504:
  28:13008..   13024:3033808..   3033824: 17: unwritten
  29:13025..   13044:3033825..   3033844: 20:
  30:13045..   13061:3033845..   3033861: 17: unwritten
  31:13062..   13081:3033862..   3033881: 20:
  32:13082..   13098:3033882..   3033898: 17: unwritten
  33:13099..   13642:3033899..   3034442:544:
  34:13643..   13648:3034443..   3034448:  6: unwritten
  35:13649..   13655:3034449..   3034455:  7:
  36:13656..   13660:3034456..   3034460:  5: unwritten
  37:13661..   13667:3034461..   3034467:  7:
  38:13668..   13673:3034468..   3034473:  6: unwritten
  39:13674..   13680:3034474..   3034480:  7:
  40:13681..   13685:3034481..   3034485:  5: unwritten
  41:13686..   13692:3034486..   3034492:  7:
  42:13693..   13698:3034493..   3034498:  6: unwritten
  43:13699..   14276:3034499..   3035076:578:
  44:14277..   14277:3035077..   3035077:  1: unwritten
  45:14278..   14458:3035078..   3035258:181:
  46:14459..   14529:3035259..   3035329: 71: unwritten
  47:14530..   14570:3035330..   3035370: 41:
  48:14571..   14641:3035371..   3035441: 71: unwritten
  49:14642..   14928:3035442..   3035728:287:
  50:14929..   15002:3035729..   3035802: 74: unwritten
  51:15003..   15837:3035803..

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-10 Thread Phillip Susi

Lennart Poettering writes:

> inode, and then donate the old blocks over. This means the inode nr
> changes, which is something I don't like. Semantically it's only
> marginally better than just creating a new file from scratch.

Wait, what do you mean the inode nr changes?  I thought the whole point
of the block donating thing was that you get a contiguous set of blocks
in the new file, then transfer those blocks back to the old inode so
that the inode number and timestamps of the file don't change.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-10 Thread Lennart Poettering

On Di, 09.02.21 10:17, Phillip Susi (ph...@thesusis.net) wrote:

>
> Chris Murphy writes:
>
> > And I agree 8MB isn't a big deal. Does anyone complain about journal
> > fragmentation on ext4 or xfs? If not, then we come full circle to my
> > second email in the thread which is don't defragment when nodatacow,
> > only defragment when datacow. Or use BTRFS_IOC_DEFRAG_RANGE and
> > specify 8MB length. That does seem to consistently no op on nodatacow
> > journals which have 8MB extents.
>
> Ok, I agree there.
>
> > The reason I'm dismissive is because the nodatacow fragment case is
> > the same as ext4 and XFS; the datacow fragment case is both
> > spectacular and non-deterministic. The workload will matter where
>
> Your argument seems to be that it's no worse than ext4 and so if we
> don't defrag there, why on btrfs?  Lennart seems to be arguing that the
> only reason systemd doesn't defrag on ext4 is because the ioctl is
> harder to use.

It's not just harder to use, it's uglier: you have to create a new
inode, and then donate the old blocks over. This means the inode nr
changes, which is something I don't like. Semantically it's only
marginally better than just creating a new file from scratch.

Lennart

--
Lennart Poettering, Berlin
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-10 Thread Lennart Poettering

On Mo, 08.02.21 22:13, Chris Murphy (li...@colorremedies.com) wrote:

> On Mon, Feb 8, 2021 at 7:56 AM Phillip Susi  wrote:
> >
> >
> > Chris Murphy writes:
> >
> > >> It sounds like you are arguing that it is better to do the wrong thing
> > >> on all SSDs rather than do the right thing on ones that aren't broken.
> > >
> > > No I'm suggesting there isn't currently a way to isolate
> > > defragmentation to just HDDs.
> >
> > Yes, but it sounded like you were suggesting that we shouldn't even try,
> > not just that it isn't 100% accurate.  Sure, some SSDs will be stupid
> > and report that they are rotational, but most aren't stupid, so it's a
> > good idea to disable the defragmentation on drives that report that they
> > are non rotational.
>
> So far I've seen, all USB devices report rotational. All USB flash
> drives, and any SSD in an enclosure.
>
> Maybe some way of estimating rotational based on latency standard
> deviation, and stick that in sysfs, instead of trusting device
> reporting. But in the meantime, the imperfect rule could be do not
> defragment unless it's SCSI/SATA/SAS and it reports it's rotational.

btrfs itelf has a knob declaring whether something is ssd or not ssd,
configurable via the mount option. Of course, one would bind any
higher level logic to that same thing, and thus make it btrfs' own
problem, or the admin's.

Lennart

--
Lennart Poettering, Berlin
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-09 Thread Phillip Susi

Chris Murphy writes:

> And I agree 8MB isn't a big deal. Does anyone complain about journal
> fragmentation on ext4 or xfs? If not, then we come full circle to my
> second email in the thread which is don't defragment when nodatacow,
> only defragment when datacow. Or use BTRFS_IOC_DEFRAG_RANGE and
> specify 8MB length. That does seem to consistently no op on nodatacow
> journals which have 8MB extents.

Ok, I agree there.

> The reason I'm dismissive is because the nodatacow fragment case is
> the same as ext4 and XFS; the datacow fragment case is both
> spectacular and non-deterministic. The workload will matter where

Your argument seems to be that it's no worse than ext4 and so if we
don't defrag there, why on btrfs?  Lennart seems to be arguing that the
only reason systemd doesn't defrag on ext4 is because the ioctl is
harder to use.  Maybe it should defrag as well, so he's asking for
actual performance data to evaluate whether the defrag is pointless or
whether maybe ext4 should also start doing a defrag.  At least I think
that's his point.  Personally I agree ( and showed the calculations in a
previous post ) that 8 MB/fragment is only going to have a negligiable
impact on performance and so isn't worth bothering with a defrag, but he
has asked for real world data...

> And also, only defragmenting on rotation strikes me as leaving
> performance on the table, right? If there is concern about fragmented

No, because fragmentation only causes additional latency on HDD, not SSD.

> But it sounds to me like you want to learn what the performance is of
> journals defragmented with BTFS_IOC_DEFRAG specifically? I don't think
> it's interesting because you're still better off leaving nodatacow
> journals alone, and something still has to be done in the datacow

Except that you're not.  Your definition of better off appears to be
only on SSD and only because it is preferable to have fewer writes than
less fragmentation.  On HDD defragmenting is a good thing.  Lennart
seems to want real world performance data to evaluate just *how* good
and whether it's worth the bother, at least for HDD.  For SSDs, I
believe he agreed that it may as well be shut off there since it
provides no benefit, but your patch kills it on HDDs as well.

> Is there a test mode for journald to just dump a bunch of random stuff
> into the journal to age it? I don't want to wait weeks to get a dozen
> journal files.

The cause of the fragmentation is slowly appending to the file over
time, so if you dump a bunch of data in too quickly you would eliminate
the fragmentation.  You might try:

while true ; do logger "This is a test log message to act as filler" ;
sleep 1 ; done

To speed things up a little bit.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-08 Thread Chris Murphy

On Mon, Feb 8, 2021 at 8:20 AM Phillip Susi  wrote:
>
>
> Chris Murphy writes:
>
> > I showed that the archived journals have way more fragmentation than
> > active journals. And the fragments in active journals are
> > insignificant, and can even be reduced by fully allocating the journal
>
> Then clearly this is a problem with btrfs: it absolutely should not be
> making the files more fragmented when asked to defrag them.

I've asked. We'll see..

> > file to final size rather than appending - which has a good chance of
> > fragmenting the file on any file system, not just Btrfs.
>
> And yet, you just said the active journal had minimal fragmentation.

Yes, the extents are consistently 8MB in the nodatacow case, old and
new file system alike. Same as ext4 and XFS.

> That seems to mean that the 8mb fallocates that journald does is working
> well.  Sure, you could proabbly get fewer fragments by fallocating the
> whole 128 mb at once, but there are tradeoffs to that that are not worth
> it.  One fragment per 8 mb isn't a big deal.  Ideally a filesystem will
> manage to do better than that ( didn't btrfs have a persistent
> reservation system for this purpose? ), but it certainly should not
> commonly do worse.

I don't think any of the file systems guarantee a contiguous block
range upon fallocate, they only guarantee that writes to fallocated
space will succeed. i.e. it's a space reservation. But yeah in
practice, 8MB is small enough that chances are you'll see one 8MB
extent.

And I agree 8MB isn't a big deal. Does anyone complain about journal
fragmentation on ext4 or xfs? If not, then we come full circle to my
second email in the thread which is don't defragment when nodatacow,
only defragment when datacow. Or use BTRFS_IOC_DEFRAG_RANGE and
specify 8MB length. That does seem to consistently no op on nodatacow
journals which have 8MB extents.

> > Further, even *despite* this worse fragmentation of the archived
> > journals, bcc-tools fileslower shows no meaningful latency as a
> > result. I wrote this in the previous email. I don't understand what
> > you want me to show you.
>
> *Of course* it showed no meaningful latency because you did the test on
> an SSD, which has no meaningful latency penalty due to fragmentation.
> The question is how bad is it on HDD.

The reason I'm dismissive is because the nodatacow fragment case is
the same as ext4 and XFS; the datacow fragment case is both
spectacular and non-deterministic. The workload will matter where
these random 4KiB journal writes end up on an HDD. I've seen journals
with hundreds to thousands of extents. I'm not sure what we learn from
me doing a single isolated test on an HDD.

And also, only defragmenting on rotation strikes me as leaving
performance on the table, right? If there is concern about fragmented
archived journals, then isn't there concern about fragmented active
journals?

But it sounds to me like you want to learn what the performance is of
journals defragmented with BTFS_IOC_DEFRAG specifically? I don't think
it's interesting because you're still better off leaving nodatacow
journals alone, and something still has to be done in the datacow
case. It's two extremes. What the performance is doesn't matter, it's
not going to tell you anything you can't already infer from the two
layouts.

> > And since journald offers no ability to disable the defragment on
> > Btrfs, I can't really do a longer term A/B comparison can I?
>
> You proposed a patch to disable it.  Test before and after the patch.

Is there a test mode for journald to just dump a bunch of random stuff
into the journal to age it? I don't want to wait weeks to get a dozen
journal files.

>
> > I did provide data. That you don't like what the data shows: archived
> > journals have more fragments than active journals, is not my fault.
> > The existing "optimization" is making things worse, in addition to
> > adding a pile of unnecessary writes upon journal rotation.
>
> If it is making things worse, that is definately a bug in btrfs.  It
> might be nice to avoid the writes on SSD though since there is no
> benefit there.

Agreed.

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-08 Thread Chris Murphy

On Mon, Feb 8, 2021 at 7:56 AM Phillip Susi  wrote:
>
>
> Chris Murphy writes:
>
> >> It sounds like you are arguing that it is better to do the wrong thing
> >> on all SSDs rather than do the right thing on ones that aren't broken.
> >
> > No I'm suggesting there isn't currently a way to isolate
> > defragmentation to just HDDs.
>
> Yes, but it sounded like you were suggesting that we shouldn't even try,
> not just that it isn't 100% accurate.  Sure, some SSDs will be stupid
> and report that they are rotational, but most aren't stupid, so it's a
> good idea to disable the defragmentation on drives that report that they
> are non rotational.

So far I've seen, all USB devices report rotational. All USB flash
drives, and any SSD in an enclosure.

Maybe some way of estimating rotational based on latency standard
deviation, and stick that in sysfs, instead of trusting device
reporting. But in the meantime, the imperfect rule could be do not
defragment unless it's SCSI/SATA/SAS and it reports it's rotational.

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-08 Thread Lennart Poettering

On Mo, 08.02.21 10:09, Phillip Susi (ph...@thesusis.net) wrote:

> That's a fair point: if btrfs isn't any worse than other filessytems,
> then why is it the only one that gets a defrag?

As answered elsewhere:

1. only btrfs has a cow mode, where fragmentation is through the roof
   for randomly written files

2. only btrfs as a somewhat nice API for this (i.e. a single best
   effort ioctl with no params). (ext4 has a defrag API, but it's
   weird, and xfs I never checked, I never used it)

3. noone was annoyed by journal performance on non-btrfs enough to
   determine if this is worth it.

--
Lennart Poettering, Berlin
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-08 Thread Lennart Poettering

On Sa, 06.02.21 12:51, Chris Murphy (li...@colorremedies.com) wrote:

> The original commit description only mentions COW, it doesn't mention
> being predicated on nodatacow. In effect commit
> f27a386430cc7a27ebd06899d93310fb3bd4cee7 is obviated by commit
> 3a92e4ba470611ceec6693640b05eb248d62e32d four months later. I don't
> think they were ever intended to be used together, and combining them
> seems accidental.

Nah, both commits are for a common goal: make access time behaviour
OK'ish on btrfs, where it otherwise is terrible (on rotating media
particularly).

It's optimized for access times, not for minimal iops.

I'd be totally open to revisit this all, and take iops more into
account, but again, we'd need a bit of profiling that compares access
times, iops, and stuff with and without this, on rotating and ssd.

Lennart

--
Lennart Poettering, Berlin
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-08 Thread Lennart Poettering

On Sa, 06.02.21 19:47, Chris Murphy (li...@colorremedies.com) wrote:
65;6201;1c
> On Fri, Feb 5, 2021 at 8:23 AM Phillip Susi  wrote:
>
> > Chris Murphy writes:
> >
> > > But it gets worse. The way systemd-journald is submitting the journals
> > > for defragmentation is making them more fragmented than just leaving
> > > them alone.
> >
> > Wait, doesn't it just create a new file, fallocate the whole thing, copy
> > the contents, and delete the original?
>
> Same inode, so no. As to the logic, I don't know. I'll ask upstream to
> document it.
>
> ?How can that possibly make
> > fragmentation *worse*?
>
> I'm only seeing this pattern with journald journals, and
> BTRFS_IOC_DEFRAG. But I'm also seeing it with all archived journals.
>
> Meanwhile, active journals exhibit no different pattern from ext4 and
> xfs, no worse fragmentation.

That's not surprising, these file systems don't have a defrag
ioctl with similar generic semantics.

> Is there a VFS API for handling these isues? Should there be? I really
> don't think any application, including journald, should be having to
> micromanage these kinds of things on a case by case basis. General
> problems like this need general solutions.

We don't micromanage. We call a simple, extremely generic ioctl, that
takes exactly zero parameters, asking btrfs to do its best.

> > It sounds like you are arguing that it is better to do the wrong thing
> > on all SSDs rather than do the right thing on ones that aren't broken.
>
> No I'm suggesting there isn't currently a way to isolate
> defragmentation to just HDDs.

We could add one. For example, the $SYSTEMD_JOURNAL_DEFRAG env var I
proposed in that other mail could have a special value besides yes/no
of "ssd" or so, where we'd use btrfs own understanding if it's backed
by ssd or rotating media, as controlled with the ssd/nossd mount
option. (though ideally we'd have a better way to query it that to
parse out the mount options string)

Lennart

--
Lennart Poettering, Berlin
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-08 Thread Lennart Poettering

On Fr, 05.02.21 17:44, Chris Murphy (li...@colorremedies.com) wrote:

> On Fri, Feb 5, 2021 at 3:55 PM Lennart Poettering
>  wrote:
> >
> > On Fr, 05.02.21 20:58, Maksim Fomin (ma...@fomin.one) wrote:
> >
> > > > You know, we issue the btrfs ioctl, under the assumption that if the
> > > > file is already perfectly defragmented it's a NOP. Are you suggesting
> > > > it isn't a NOP in that case?
> > >
> > > So, what is the reason for defragmenting journal is BTRFS is
> > > detected? This does not happen at other filesystems. I have read
> > > this thread but has not found a clear answer to this question.
> >
> > btrfs like any file system fragments files with nocow a bit. Without
> > nocow (i.e. with cow) it fragments files horribly, given our write
> > pattern (wich is: append something to the end, and update a few
> > pointers in the beginning). By upstream default we set nocow, some
> > downstreams/users undo that however. (this is done via tmpfiles,
> > i.e. journald doesn't actually set nocow ever).
>
> I don't see why it's upstream's problem to solve downstream decisions.
> If they want to (re)enable datacow, then they can also setup some kind
> of service to defragment /var/log/journal/ on a schedule, or they can
> use autodefrag.

There are good reasons to enable cow, even if we default to
nocow. RAID, checksumming, compression, all that. It's not clear that
nocow is perfect, and cow is terrible or vice versa — in reality it's
a very blurry line, and hence we should support both modes, even if we
pick a default we think is in average the better choice. But because
we support both modes and because defragmentation of an unfragmented
file should be a NOP we issue the defrag ioctl too.

Moreover, if we wouldn't issue the defrag ioctl, there's no way to get
it.

I mean, to turn this into something constructive: please send a patch
that adds an env var $SYSTEMD_JOURNAL_BTRFS_DEFRAG which when set to 0
will turn off the defrag. If you want to disable this locally, then I
am happy to merge a patch that makes that configurable.

> > When we archive a journal file (i.e stop writing to it) we know it
> > will never receive any further writes. It's a good time to undo the
> > fragmentation (we make no distinction whether heavily fragmented,
> > little fragmented or not at all fragmented on this) and thus for the
> > future make access behaviour better, given that we'll still access the
> > file regularly (because archiving in journald doesn't mean we stop
> > reading it, it just means we stop writing it — journalctl always
> > operates on the full data set). defragmentation happens in the bg once
> > triggered, it's a simple ioctl you can invoke on a file. if the file
> > is not fragmented it shouldn't do anything.
>
> ioctl(3, BTRFS_IOC_DEFRAG_RANGE, {start=0, len=16777216, flags=0,
> extent_thresh=33554432, compress_type=BTRFS_COMPRESS_NONE}) = 0
>
> What 'len' value does journald use?

We don't call BTRFS_IOC_DEFRAG_RANGE.

Instead, we call BTRFS_IOC_DEFRAG with a NULL parameter.

> > other file systems simply have no such ioctl, and they never fragment
> > as terribly as btrfs can fragment. hence we don't call that ioctl.
>
> I did explain how to avoid the fragmentation in the first place, to
> obviate the need to defragment.
>
> 1. nodatacow. journald does this already
> 2. fallocate the intended final journal file size from the start,
> instead of growing them in 8MB increments.

Not an option, as mentioned. We maintain a bunch of journal files in
parallel, and if we would allocate them 100% in advance, then we'd
have really shitty behaviour since we'd allocate a ton of space on
disk we don't actually use, but nonetheless already took away from
everything else.

> 3. Don't reflink copy (including snapshot) the journals. This arguably
> is not journald's responsibility but as it creates both the journal/
> directory and $MACHINEID directory, it could make one or both of them
> as subvolumes instead to ensure they're not subject to snapshotting
> from above.

That's nonsense. People do recursive snapshots. nspawn does, machined
does, and so do others. Also, even if recursive snapshots didn't
exist, I am pretty sure people might be annoyed if we just fuck with
their backup strategy, and exclude some files.

> > I'd even be fine dropping it entirely, if someone actually can
> > show the benefits of having the files unfragmented when archived
> > don't outweigh the downside of generating some iops when executing
> > the defragmentation.
>
> I showed that the archived journals have way more fragmentation than
> active journals.

Can you report this to the btrfs maintainers?

Apparently defragmentation is broken on your btrfs then?

(I don't see that here btw)

> And the fragments in active journals are insignificant, and can even
> be reduced by fully allocating the journal file to final size rather
> than appending - which has a good chance of fragmenting the file on
> any file system, not just Btrfs.

Yeah,

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-08 Thread Phillip Susi

Chris Murphy writes:

> I showed that the archived journals have way more fragmentation than
> active journals. And the fragments in active journals are
> insignificant, and can even be reduced by fully allocating the journal

Then clearly this is a problem with btrfs: it absolutely should not be
making the files more fragmented when asked to defrag them.

> file to final size rather than appending - which has a good chance of
> fragmenting the file on any file system, not just Btrfs.

And yet, you just said the active journal had minimal fragmentation.
That seems to mean that the 8mb fallocates that journald does is working
well.  Sure, you could proabbly get fewer fragments by fallocating the
whole 128 mb at once, but there are tradeoffs to that that are not worth
it.  One fragment per 8 mb isn't a big deal.  Ideally a filesystem will
manage to do better than that ( didn't btrfs have a persistent
reservation system for this purpose? ), but it certainly should not
commonly do worse.

> Further, even *despite* this worse fragmentation of the archived
> journals, bcc-tools fileslower shows no meaningful latency as a
> result. I wrote this in the previous email. I don't understand what
> you want me to show you.

*Of course* it showed no meaningful latency because you did the test on
an SSD, which has no meaningful latency penalty due to fragmentation.
The question is how bad is it on HDD.

> And since journald offers no ability to disable the defragment on
> Btrfs, I can't really do a longer term A/B comparison can I?

You proposed a patch to disable it.  Test before and after the patch.

> I did provide data. That you don't like what the data shows: archived
> journals have more fragments than active journals, is not my fault.
> The existing "optimization" is making things worse, in addition to
> adding a pile of unnecessary writes upon journal rotation.

If it is making things worse, that is definately a bug in btrfs.  It
might be nice to avoid the writes on SSD though since there is no
benefit there.

> Conversely, you have not provided data proving that nodatacow
> fallocated files on Btrfs are any more fragmented than fallocated
> files on ext4 or XFS.

That's a fair point: if btrfs isn't any worse than other filessytems,
then why is it the only one that gets a defrag?

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-08 Thread Phillip Susi

Chris Murphy writes:

>> It sounds like you are arguing that it is better to do the wrong thing
>> on all SSDs rather than do the right thing on ones that aren't broken.
>
> No I'm suggesting there isn't currently a way to isolate
> defragmentation to just HDDs.

Yes, but it sounded like you were suggesting that we shouldn't even try,
not just that it isn't 100% accurate.  Sure, some SSDs will be stupid
and report that they are rotational, but most aren't stupid, so it's a
good idea to disable the defragmentation on drives that report that they
are non rotational.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-06 Thread Andrei Borzenkov

06.02.2021 00:33, Phillip Susi пишет:
> 
> Lennart Poettering writes:
> 
>> journalctl gives you one long continues log stream, joining everything
>> available, archived or not into one big interleaved stream.
> 
> If you ask for everything, yes... but if you run journalctl -b then
> shuoldn't it only read back until it finds the start of the current
> boot?

Ever tried "systemctl status" on HDD with large amount of archived
journal data? It can easily take minutes ...
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-06 Thread Chris Murphy

On Fri, Feb 5, 2021 at 8:23 AM Phillip Susi  wrote:

> Chris Murphy writes:
>
> > But it gets worse. The way systemd-journald is submitting the journals
> > for defragmentation is making them more fragmented than just leaving
> > them alone.
>
> Wait, doesn't it just create a new file, fallocate the whole thing, copy
> the contents, and delete the original?

Same inode, so no. As to the logic, I don't know. I'll ask upstream to
document it.

?How can that possibly make
> fragmentation *worse*?

I'm only seeing this pattern with journald journals, and
BTRFS_IOC_DEFRAG. But I'm also seeing it with all archived journals.

Meanwhile, active journals exhibit no different pattern from ext4 and
xfs, no worse fragmentation.

Consid other storage technologies where COW and snapshots come into
play. For example anything based on device-mapper thin provisioning is
going to run into these issues. How it allocates physical extents
isn't up to the file system. Duplicate a file and delete the original,
you might get a more fragmented file as well. The physical layout is
entirely decoupled from the file system - where the filesystem could
tell you "no fragmentation" and yet it is highly fragmented, or vice
versa. These problems are not unique to Btrfs.

Is there a VFS API for handling these isues? Should there be? I really
don't think any application, including journald, should be having to
micromanage these kinds of things on a case by case basis. General
problems like this need general solutions.

> > All of those archived files have more fragments (post defrag) than
> > they had when they were active. And here is the FIEMAP for the 96MB
> > file which has 92 fragments.
>
> How the heck did you end up with nearly 1 frag per mb?

I didn't do anything special, it's a default configuration. I'll ask
Btrfs developers about it. Maybe it's one of those artifacts of FIEMAP
I mentioned previously. Maybe it's not that badly fragmented to a
drive that's going to reorder reads anyway, to be more efficient about
it.

> > If you want an optimization that's actually useful on Btrfs,
> > /var/log/journal/ could be a nested subvolume. That would prevent any
> > snapshots above from turning the nodatacow journals into datacow
> > journals, which does significantly increase fragmentation (it would in
> > the exact same case if it were a reflink copy on XFS for that matter).
>
> Wouldn't that mean that when you take snapshots, they don't include the
> logs?

That's a snapshot/rollback regime design and policy question.

If you snapshot the subvolume that contains the journals, the journals
will be in the snapshot. The user space tools do not have an option
for recursive snapshots, so snapshotting does end at subvolume
boundaries. If you want journals snapshot, then their enclosing
subvolume would need to be snapshot.

> That seems like an anti feature that violates the principal of
> least surprise.  If I make a snapshot of my root, I *expect* it to
> contain my logs.

You can only rollback that which you snapshot. If you snapshot a root
without excluding journals, if you rollback, you rollback the
journals. That's data loss.

(open)suse has a snapshot/rollback regime configured and enabled by
default out of the box. Logs are excluded from it, same as the
bootloader. (Although I'll also note they default to volatile systemd
journals, and use rsyslogd for persistent logs.) Fedora meanwhile does
have persistent journald journals in the root subvolume, but there's
no snapshot/rollback regime enabled out of the box. I'm inclined to
have them excluded, not so much to avoid cow of the nodatacow
journals, but avoiding discontinuity in the journals upon rollback.

>
> > I don't get the iops thing at all. What we care about in this case is
> > latency. A least noticeable latency of around 150ms seems reasonable
> > as a starting point, that's where users realize a delay between a key
> > press and a character appearing. However, if I check for 10ms latency
> > (using bcc-tools fileslower) when reading all of the above journals at
> > once:
> >
> > $ sudo journalctl -D
> > /mnt/varlog33/journal/b51b4a725db84fd286dcf4a790a50a1d/ --no-pager
> >
> > Not a single report. None. Nothing took even 10ms. And those journals
> > are more fragmented than your 20 in a 100MB file.
> >
> > I don't have any hard drives to test this on. This is what, 10% of the
> > market at this point? The best you can do there is the same as on SSD.
>
> The above sounded like great data, but not if it was done on SSD.

Right. But also I can't disable the defragmentation in order to do a
proper test on HDD.

> > You can't depend on sysfs to conditionally do defragmentation on only
> > rotational media, too many fragile media claim to be rotating.
>
> It sounds like you are arguing that it is better to do the wrong thing
> on all SSDs rather than do the right thing on ones that aren't broken.

No I'm suggesting there isn't currently a way to isolate
defragmentation to just

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-06 Thread Vito Caputo

On Fri, Feb 05, 2021 at 05:44:03PM -0700, Chris Murphy wrote:
> On Fri, Feb 5, 2021 at 3:55 PM Lennart Poettering
>  wrote:
> >
> > On Fr, 05.02.21 20:58, Maksim Fomin (ma...@fomin.one) wrote:
> >
> > > > You know, we issue the btrfs ioctl, under the assumption that if the
> > > > file is already perfectly defragmented it's a NOP. Are you suggesting
> > > > it isn't a NOP in that case?
> > >
> > > So, what is the reason for defragmenting journal is BTRFS is
> > > detected? This does not happen at other filesystems. I have read
> > > this thread but has not found a clear answer to this question.
> >
> > btrfs like any file system fragments files with nocow a bit. Without
> > nocow (i.e. with cow) it fragments files horribly, given our write
> > pattern (wich is: append something to the end, and update a few
> > pointers in the beginning). By upstream default we set nocow, some
> > downstreams/users undo that however. (this is done via tmpfiles,
> > i.e. journald doesn't actually set nocow ever).
> 
> I don't see why it's upstream's problem to solve downstream decisions.
> If they want to (re)enable datacow, then they can also setup some kind
> of service to defragment /var/log/journal/ on a schedule, or they can
> use autodefrag.
> 

It seems cooperative to me that applications advise the filesystem on
appropriate optimization opportunities.

Taking a step back and looking at what journald is doing, how and when
these journal files are accessed, it doesn't strike me as illogical
to tell the fs when archiving it's a good time to defragment the file.

> 
> > When we archive a journal file (i.e stop writing to it) we know it
> > will never receive any further writes. It's a good time to undo the
> > fragmentation (we make no distinction whether heavily fragmented,
> > little fragmented or not at all fragmented on this) and thus for the
> > future make access behaviour better, given that we'll still access the
> > file regularly (because archiving in journald doesn't mean we stop
> > reading it, it just means we stop writing it — journalctl always
> > operates on the full data set). defragmentation happens in the bg once
> > triggered, it's a simple ioctl you can invoke on a file. if the file
> > is not fragmented it shouldn't do anything.
> 
> ioctl(3, BTRFS_IOC_DEFRAG_RANGE, {start=0, len=16777216, flags=0,
> extent_thresh=33554432, compress_type=BTRFS_COMPRESS_NONE}) = 0
> 
> What 'len' value does journald use?
> 

journald uses BTRFS_IOC_DEFRAG, there is no range argument; it's the
whole file.

I'm inclined to agree with Lennart on this looking more like a btrfs
issue than journald issue, based on your claims.

journald is arguably Doing The Right Thing by advising btrfs of a
defrag opportunity.  If btrfs can't usefully defragment the file vs.
its layout, it should NOOP the ioctl.  If it's producing more
fragmented files post-defrag, how is that not a btrfs bug?

Some things I didn't see being considered in your comparisons is
filesystem free space, age, and concurrent use.

If your comparisons are on fresh filesystems, fragmentation tends to
be much lower as the business of finding contiguous blocks of free
space is trivial.  Once the filesystem has aged enough to churn
through the available space, fragmentation increases substantially.

When journald is the only writer on an otherwise idle filesystem, it's
less likely to have its allocations interrupted by allocations to
other writers.

To make meaningful measurements of fragmentation and the necessity of
telling the fs "hey, now's a good time to defrag this file I'm no
longer going to write to", you need to look at more worst case
scenarios, not best case.

On a different note, I feel like there's an unnecessarily combative
tone to this discussion.  Maybe it's just me, but it deterred me from
participating up until this point.  

Regards,
Vito Caputo
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-06 Thread Chris Murphy

More data points.

1.
An ext4 file system with a 112M system.journal, it has 15 extents.
>From FIEMAP we can pretty much see it's really made from 14 8MB
extents, consistent with multiple appends. And it's the exact same
behavior seen on Btrfs with nodatacow journals.

https://pastebin.com/6vuufwXt

2.
A Btrfs file system with a 24MB system.journal, nodatacow, 4 extents.
The fragments are consistent with #1 as a result of nodatacow
journals.

https://pastebin.com/Y18B2m4h

3.
Continuing from #2, 'journalctl --rotate'

strace shows this results in:
ioctl(31, BTRFS_IOC_DEFRAG) = 0

filefrag shows the result, 17 extents. But this is misleading because
9 of them are in the same position as before, so it seems to be a
minimalist defragment. Btrfs did what was requested but with both
limited impact and efficacy, at least on nodatacow files having
minimal fragmentation to begin with.
https://pastebin.com/1ufErVMs

4.
Continuing from #3, 'btrfs fi defrag -l 32M' pointed to this same file
results in a single extent file.

strace shows this uses
ioctl(3, BTRFS_IOC_DEFRAG_RANGE, {start=0, len=33554432, flags=0,
extent_thresh=33554432, compress_type=BTRFS_COMPRESS_NONE}) = 0

and filefrag shows the single extent mapping:
https://pastebin.com/429fZmNB

While this is a numeric improvement (no fragmentation), again there's
no proven advantage of defragmenting nodatacow journals on Btrfs. It's
just needlessly contributing to write amplification.

--

The original commit description only mentions COW, it doesn't mention
being predicated on nodatacow. In effect commit
f27a386430cc7a27ebd06899d93310fb3bd4cee7 is obviated by commit
3a92e4ba470611ceec6693640b05eb248d62e32d four months later. I don't
think they were ever intended to be used together, and combining them
seems accidental.

Defragmenting datacow files makes some sense on rotating media. But
that's the exception, not the rule.

--
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Chris Murphy

On Fri, Feb 5, 2021 at 3:55 PM Lennart Poettering
 wrote:
>
> On Fr, 05.02.21 20:58, Maksim Fomin (ma...@fomin.one) wrote:
>
> > > You know, we issue the btrfs ioctl, under the assumption that if the
> > > file is already perfectly defragmented it's a NOP. Are you suggesting
> > > it isn't a NOP in that case?
> >
> > So, what is the reason for defragmenting journal is BTRFS is
> > detected? This does not happen at other filesystems. I have read
> > this thread but has not found a clear answer to this question.
>
> btrfs like any file system fragments files with nocow a bit. Without
> nocow (i.e. with cow) it fragments files horribly, given our write
> pattern (wich is: append something to the end, and update a few
> pointers in the beginning). By upstream default we set nocow, some
> downstreams/users undo that however. (this is done via tmpfiles,
> i.e. journald doesn't actually set nocow ever).

I don't see why it's upstream's problem to solve downstream decisions.
If they want to (re)enable datacow, then they can also setup some kind
of service to defragment /var/log/journal/ on a schedule, or they can
use autodefrag.

> When we archive a journal file (i.e stop writing to it) we know it
> will never receive any further writes. It's a good time to undo the
> fragmentation (we make no distinction whether heavily fragmented,
> little fragmented or not at all fragmented on this) and thus for the
> future make access behaviour better, given that we'll still access the
> file regularly (because archiving in journald doesn't mean we stop
> reading it, it just means we stop writing it — journalctl always
> operates on the full data set). defragmentation happens in the bg once
> triggered, it's a simple ioctl you can invoke on a file. if the file
> is not fragmented it shouldn't do anything.

ioctl(3, BTRFS_IOC_DEFRAG_RANGE, {start=0, len=16777216, flags=0,
extent_thresh=33554432, compress_type=BTRFS_COMPRESS_NONE}) = 0

What 'len' value does journald use?

> other file systems simply have no such ioctl, and they never fragment
> as terribly as btrfs can fragment. hence we don't call that ioctl.

I did explain how to avoid the fragmentation in the first place, to
obviate the need to defragment.

1. nodatacow. journald does this already
2. fallocate the intended final journal file size from the start,
instead of growing them in 8MB increments.
3. Don't reflink copy (including snapshot) the journals. This arguably
is not journald's responsibility but as it creates both the journal/
directory and $MACHINEID directory, it could make one or both of them
as subvolumes instead to ensure they're not subject to snapshotting
from above.

> I'd even be fine dropping it
> entirely, if someone actually can show the benefits of having the
> files unfragmented when archived don't outweigh the downside of
> generating some iops when executing the defragmentation.

I showed that the archived journals have way more fragmentation than
active journals. And the fragments in active journals are
insignificant, and can even be reduced by fully allocating the journal
file to final size rather than appending - which has a good chance of
fragmenting the file on any file system, not just Btrfs.

Further, even *despite* this worse fragmentation of the archived
journals, bcc-tools fileslower shows no meaningful latency as a
result. I wrote this in the previous email. I don't understand what
you want me to show you.

And since journald offers no ability to disable the defragment on
Btrfs, I can't really do a longer term A/B comparison can I?

>i.e. someone
> does some profiling, on both ssd and rotating media. Apparently noone
> who cares about this apparently wants to do such research though, and
> hence I remain deeply unimpressed. Let's not try to do such
> optimizations without any data that actually shows it betters things.

I did provide data. That you don't like what the data shows: archived
journals have more fragments than active journals, is not my fault.
The existing "optimization" is making things worse, in addition to
adding a pile of unnecessary writes upon journal rotation.

Conversely, you have not provided data proving that nodatacow
fallocated files on Btrfs are any more fragmented than fallocated
files on ext4 or XFS.

2-17 fragments on ext4:
https://pastebin.com/jiPhrDzG
https://pastebin.com/UggEiH2J

That behavior is no different for nodatacow fallocated journals on
Btrfs. There's no point in defragmenting these no matter the file
system. I don't have to profile this on HDD, I know that even in the
best case you're not likely to get (certainly not guaranteed) to get
fewer fragments than this. Defrag on Btrfs is for the thousands of
fragments case, which is what you get with datacow journals.

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Lennart Poettering

On Fr, 05.02.21 20:58, Maksim Fomin (ma...@fomin.one) wrote:

> > You know, we issue the btrfs ioctl, under the assumption that if the
> > file is already perfectly defragmented it's a NOP. Are you suggesting
> > it isn't a NOP in that case?
>
> So, what is the reason for defragmenting journal is BTRFS is
> detected? This does not happen at other filesystems. I have read
> this thread but has not found a clear answer to this question.

btrfs like any file system fragments files with nocow a bit. Without
nocow (i.e. with cow) it fragments files horribly, given our write
pattern (wich is: append something to the end, and update a few
pointers in the beginning). By upstream default we set nocow, some
downstreams/users undo that however. (this is done via tmpfiles,
i.e. journald doesn't actually set nocow ever).

When we archive a journal file (i.e stop writing to it) we know it
will never receive any further writes. It's a good time to undo the
fragmentation (we make no distinction whether heavily fragmented,
little fragmented or not at all fragmented on this) and thus for the
future make access behaviour better, given that we'll still access the
file regularly (because archiving in journald doesn't mean we stop
reading it, it just means we stop writing it — journalctl always
operates on the full data set). defragmentation happens in the bg once
triggered, it's a simple ioctl you can invoke on a file. if the file
is not fragmented it shouldn't do anything.

other file systems simply have no such ioctl, and they never fragment
as terribly as btrfs can fragment. hence we don't call that ioctl.

i'd be fine to avoid the ioctl if we knew for sure the file is at
worst mildly fragmented, but apparently btrfs is too broken to be able
to implement something like that.  I'd even be fine dropping it
entirely, if someone actually can show the benefits of having the
files unfragmented when archived don't outweigh the downside of
generating some iops when executing the defragmentation. i.e. someone
does some profiling, on both ssd and rotating media. Apparently noone
who cares about this apparently wants to do such research though, and
hence I remain deeply unimpressed. Let's not try to do such
optimizations without any data that actually shows it betters things.

Lennart

--
Lennart Poettering, Berlin
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Lennart Poettering

On Fr, 05.02.21 16:16, Phillip Susi (ph...@thesusis.net) wrote:

>
> Lennart Poettering writes:
>
> > Nope. We always interleave stuff. We currently open all journal files
> > in parallel. The system one and the per-user ones, the current ones
> > and the archived ones.
>
> Wait... every time you look at the journal at all, it has to read back
> through ALL of the archived journals, even if you are only interested in
> information since the last boot that just happened 5 minutes ago?

no, we do not iterate though them. we just read some metadata off the header.

Lennart

--
Lennart Poettering, Berlin
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Phillip Susi

Lennart Poettering writes:

> journalctl gives you one long continues log stream, joining everything
> available, archived or not into one big interleaved stream.

If you ask for everything, yes... but if you run journalctl -b then
shuoldn't it only read back until it finds the start of the current
boot?
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Lennart Poettering

On Fr, 05.02.21 20:43, Dave Howorth (syst...@howorth.org.uk) wrote:

> 128 MB files, and I might allocate an extra MB or two for overhead, I
> don't know. So when it first starts there'll be 128 MB allocated and
> 384 MB free. In stable state there'll be 512 MB allocated and nothing
> free. One 128 MB allocated and slowly being used. 384 MB full of
> archive files. You always have between 384 MB and 512 MB of logs
> stored. I don't understand where you're getting your numbers from.

As mentioned elswhere: we typically have to remove two "almost 128M"
files to get space for "exactly 128M" of guaranteed space.

And you know, each user gets their own journal. Hence, once a single
user logs a single line aother 128M are gone, and if another user then
does it, bam, another 128M is gone.

We can't eat space away like that.

> If you can't figure out which parts of an archived file are useful and
> which aren't then why are you keeping them? Why not just delete them?
> And if you can figure it out then why not do so and compact the useful
> information into the minimum storage?

We archive for multiple reasons: because file was dirty when we
started up (in which case there apparently was an abnormal shutdown of
the system or journald), or because we rotate and start a new file (or
time change or whatnot). In the first ("dirty") case we don't touch
the file at all, because it's likely corrupt and we don't want to
corrupt further. We just rename it so that it gets "~" at the
end. When we archive the "clean" way we mark the file internally as
archived, but before sync everything to disk, so that we know for sure
it's all in a good state, and then we don't touch it anymore.

"journalctl" will process all these files, regardless if "dirty"
archived or "clean" archived. It tries hard to make the best of these
files, and varirous codepaths to make sure we don't get confused by
half-written files, and can use as much as possible of the parts that
were written correctly.

hence, that's why we don't delete corrupted files: because we use as
much of it as we can. Why? because usually the logs shortly before
your system died abnormally are the most interesting.

> > Because fs metadata, and because we don't always write files in
> > full. I mean, we often do not, because we start a new file *before*
> > the file would grow beyond the threshold. this typically means that
> > it's typically not enough to delete a single file to get the space we
> > need for a full new one, we usually need to delete two.
>
> Why would you start a new file before the old one is full?

Various reasons: user asked for rotation or vacuuming. because
abnormal shutdown. becase time change (we want individual files to be
montonically ordered), …

Lennart

--
Lennart Poettering, Berlin
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Phillip Susi

Maksim Fomin writes:
> I would say it depends on whether defragmentation issues are feature
> of btrfs. As Chris mentioned, if root fs is snapshotted,
> 'defragmenting' the journal can actually increase fragmentation. This
> is an example when the problem is caused by a feature (not a bug) in
> btrfs. For example, my 'system.journal' file is currently 16 MB and
> according to filefrag it has 1608 extents (consequence of snapshotted
> rootfs?). It looks too much, if I am not missing some technical

Holy smokes!  How did btrfs manage to butcher that poor file that badly?
It shouldn't be possible for it to be *that* bad.  I mean, that's only
an average of 10kb per fragment!
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Phillip Susi



Dave Howorth writes:

> PS I'm subscribed to the list. I don't need a copy.

FYI, rather than ask others to go out of their way when replying to you,
you should configure your mail client to set the Reply-To: header to
point to the mailing list address so that other people's mail clients do
what you want automatically.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Phillip Susi

Lennart Poettering writes:

> Nope. We always interleave stuff. We currently open all journal files
> in parallel. The system one and the per-user ones, the current ones
> and the archived ones.

Wait... every time you look at the journal at all, it has to read back
through ALL of the archived journals, even if you are only interested in
information since the last boot that just happened 5 minutes ago?

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Maksim Fomin

‐‐‐ Original Message ‐‐‐
On Friday, February 5, 2021 3:23 PM, Lennart Poettering 
 wrote:

> On Do, 04.02.21 12:51, Chris Murphy (li...@colorremedies.com) wrote:
>
> > On Thu, Feb 4, 2021 at 6:49 AM Lennart Poettering
> > lenn...@poettering.net wrote:
> >
> > > You want to optimize write pattersn I understand, i.e. minimize
> > > iops. Hence start with profiling iops, i.e. what defrag actually costs
> > > and then weight that agains the reduced access time when accessing the
> > > files. In particular on rotating media.
> >
> > A nodatacow journal on Btrfs is no different than a journal on ext4 or
> > xfs. So I don't understand why you think you also need to defragment
> > the file, only on Btrfs. You cannot do better than you already are
> > with a nodatacow file. That file isn't going to get anymore fragmented
> > in use than it was at creation.
>
> You know, we issue the btrfs ioctl, under the assumption that if the
> file is already perfectly defragmented it's a NOP. Are you suggesting
> it isn't a NOP in that case?

So, what is the reason for defragmenting journal is BTRFS is detected? This 
does not happen at other filesystems. I have read this thread but has not found 
a clear answer to this question.

> > But it gets worse. The way systemd-journald is submitting the journals
> > for defragmentation is making them more fragmented than just leaving
> > them alone.
>
> Sounds like a bug in btrfs? systemd is not the place to hack around
> btrfs bugs?

I would say it depends on whether defragmentation issues are feature of btrfs. 
As Chris mentioned, if root fs is snapshotted, 'defragmenting' the journal can 
actually increase fragmentation. This is an example when the problem is caused 
by a feature (not a bug) in btrfs. For example, my 'system.journal' file is 
currently 16 MB and according to filefrag it has 1608 extents (consequence of 
snapshotted rootfs?). It looks too much, if I am not missing some technical 
details (perhaps filefrag 'extent' is not a real extent in case of this fs?). 
Even if it is a bug in btrfs, it would make sense to temporarily disable the 
policy of 'defragmenting only in BTRFS' in systemd.

I am interested in this issue because for some time (probably since late 2017 
till late 2019) I had strange issues with systemd-journald crashing at boot 
time because of archiving journal/defragmenting. The setup was follows: btrfs 
on external hd (not ssd) with full disk encryption. After mistaken 
disconnection of mounted disk (but not in all such cases) systemd-journald 
caused very long lock of boot process because of following loop: 
systemd-journald tries to archive/defragment journal files -> it crashes for 
some reason -> systemd restarts systemd-journald -> it starts 
archiving/defragmenting journal files -> it crashes again -> systemd restarts 
systemd-journald (my understaing of logs after boot). Eventually this loop 
breaks and the boot process counties. After login I see that journal data is 
fine - at least there is no evidence of journal data corruption, so I presume 
it was caused by archiving/defragmentation policy on btrfs. I used this disk 
with ext4 filesystem from 2014 to 2017 and never had any problem like that. 
Eventually I decided to buy a better disk and this problem vanished since then, 
but why systemd defragmets journal only in btrfs remained a mystery to me.

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Dave Howorth

On Fri, 5 Feb 2021 17:44:14 +0100
Lennart Poettering  wrote:
> On Fr, 05.02.21 16:06, Dave Howorth (syst...@howorth.org.uk) wrote:
> 
> > On Fri, 5 Feb 2021 16:23:02 +0100
> > Lennart Poettering  wrote:  
> > > I don't think that makes much sense: we rotate and start new
> > > files for a multitude of reasons, such as size overrun, time
> > > jumps, abnormal shutdown and so on. If we'd always leave a fully
> > > allocated file around people would hate us...  
> >
> > I'm not sure about that. The file is eventually going to grow to
> > 128 MB so if there isn't space for it, I might as well know right
> > now as later. And it's not like the space will be available for
> > anything else, it's left free for exactly this log file.  
> 
> let's say you assign 500M space to journald. If you allocate 128M at a
> time, this means the effective unused space is anything between 1M and
> 255M, leaving just 256M of logs around. it's probably surprising that
> you only end up with 255M of logs when you asked for 500M. I'd claim
> that's really shitty behaviour.

If you assign 500 MB for something that accommodates multiples of 128
MB then you're not very bright :) 512 MB by contrast can accommodate 4
128 MB files, and I might allocate an extra MB or two for overhead, I
don't know. So when it first starts there'll be 128 MB allocated and
384 MB free. In stable state there'll be 512 MB allocated and nothing
free. One 128 MB allocated and slowly being used. 384 MB full of
archive files. You always have between 384 MB and 512 MB of logs
stored. I don't understand where you're getting your numbers from.

BTW, I expect my linux systems to stay up from when they're booted
until I tell them to stop, and that's usually quite a while.

> > Or are you talking about left over files after some exceptional
> > event that are only part full? If so, then just deallocate the
> > unwanted empty space from them after you've recovered from the
> > exceptional event.  
> 
> Nah, it doesn't work like this: if a journal file isn't marked clean,
> i.e. was left in some half-written state we won't touch it, but just
> archive it and start a new one. We don't know how much was correctly
> written and how much was not, hence we can't sensibly truncate it. The
> kernel after all is entirely free to decide in which order it syncs
> writte blocks to disk, and hence it quite often happens that stuff at
> the end got synced while stuff in the middle didn't.

If you can't figure out which parts of an archived file are useful and
which aren't then why are you keeping them? Why not just delete them?
And if you can figure it out then why not do so and compact the useful
information into the minimum storage?

> > > Also, we vacuum old journals when allocating and the size
> > > constraints are hit. i.e. if we detect that adding 8M to journal
> > > file X would mean the space used by all journals together would
> > > be above the configure disk usage limits we'll delete the oldest
> > > journal files we can, until we can allocate 8M again. And we do
> > > this each time. If we'd allocate the full file all the time this
> > > means we'll likely remove ~256M of logs whenever we start a new
> > > file. And that's just shitty behaviour.  
> >
> > No it's not; it's exactly what happens most of the time, because all
> > the old log files are exactly the same size because that's why they
> > were rolled over. So freeing just one of those gives exactly the
> > right size space for the new log file. I don't understand why you
> > would want to free two?  
> 
> Because fs metadata, and because we don't always write files in
> full. I mean, we often do not, because we start a new file *before*
> the file would grow beyond the threshold. this typically means that
> it's typically not enough to delete a single file to get the space we
> need for a full new one, we usually need to delete two.

Why would you start a new file before the old one is full? Modulo truly
exceptional events. It's a genuine question - I don't think I've ever
seen it. And sure fs metadata - that just means allocate a bit extra
beyond the round number.

> actually it's even worse: btrfs lies in "df": it only updates counters
> with uncontrolled latency, hence we might actually delete more than
> necessary.

Sorry dunno much about btrfs. I'm planning to get rid of it here soon.

> Lennart

PS I'm subscribed to the list. I don't need a copy.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Lennart Poettering

On Fr, 05.02.21 16:06, Dave Howorth (syst...@howorth.org.uk) wrote:

> On Fri, 5 Feb 2021 16:23:02 +0100
> Lennart Poettering  wrote:
> > I don't think that makes much sense: we rotate and start new files for
> > a multitude of reasons, such as size overrun, time jumps, abnormal
> > shutdown and so on. If we'd always leave a fully allocated file around
> > people would hate us...
>
> I'm not sure about that. The file is eventually going to grow to 128 MB
> so if there isn't space for it, I might as well know right now as
> later. And it's not like the space will be available for anything else,
> it's left free for exactly this log file.

let's say you assign 500M space to journald. If you allocate 128M at a
time, this means the effective unused space is anything between 1M and
255M, leaving just 256M of logs around. it's probably surprising that
you only end up with 255M of logs when you asked for 500M. I'd claim
that's really shitty behaviour.

> Or are you talking about left over files after some exceptional event
> that are only part full? If so, then just deallocate the unwanted empty
> space from them after you've recovered from the exceptional event.

Nah, it doesn't work like this: if a journal file isn't marked clean,
i.e. was left in some half-written state we won't touch it, but just
archive it and start a new one. We don't know how much was correctly
written and how much was not, hence we can't sensibly truncate it. The
kernel after all is entirely free to decide in which order it syncs
writte blocks to disk, and hence it quite often happens that stuff at
the end got synced while stuff in the middle didn't.

> > Also, we vacuum old journals when allocating and the size constraints
> > are hit. i.e. if we detect that adding 8M to journal file X would mean
> > the space used by all journals together would be above the configure
> > disk usage limits we'll delete the oldest journal files we can, until
> > we can allocate 8M again. And we do this each time. If we'd allocate
> > the full file all the time this means we'll likely remove ~256M of
> > logs whenever we start a new file. And that's just shitty behaviour.
>
> No it's not; it's exactly what happens most of the time, because all
> the old log files are exactly the same size because that's why they
> were rolled over. So freeing just one of those gives exactly the right
> size space for the new log file. I don't understand why you would want
> to free two?

Because fs metadata, and because we don't always write files in
full. I mean, we often do not, because we start a new file *before*
the file would grow beyond the threshold. this typically means that
it's typically not enough to delete a single file to get the space we
need for a full new one, we usually need to delete two.

actually it's even worse: btrfs lies in "df": it only updates counters
with uncontrolled latency, hence we might actually delete more than
necessary.

Lennart

--
Lennart Poettering, Berlin
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Dave Howorth

On Fri, 5 Feb 2021 16:23:02 +0100
Lennart Poettering  wrote:
> I don't think that makes much sense: we rotate and start new files for
> a multitude of reasons, such as size overrun, time jumps, abnormal
> shutdown and so on. If we'd always leave a fully allocated file around
> people would hate us...

I'm not sure about that. The file is eventually going to grow to 128 MB
so if there isn't space for it, I might as well know right now as
later. And it's not like the space will be available for anything else,
it's left free for exactly this log file.

Or are you talking about left over files after some exceptional event
that are only part full? If so, then just deallocate the unwanted empty
space from them after you've recovered from the exceptional event.

> Also, we vacuum old journals when allocating and the size constraints
> are hit. i.e. if we detect that adding 8M to journal file X would mean
> the space used by all journals together would be above the configure
> disk usage limits we'll delete the oldest journal files we can, until
> we can allocate 8M again. And we do this each time. If we'd allocate
> the full file all the time this means we'll likely remove ~256M of
> logs whenever we start a new file. And that's just shitty behaviour.

No it's not; it's exactly what happens most of the time, because all
the old log files are exactly the same size because that's why they
were rolled over. So freeing just one of those gives exactly the right
size space for the new log file. I don't understand why you would want
to free two?
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Lennart Poettering

On Fr, 05.02.21 10:24, Phillip Susi (ph...@thesusis.net) wrote:

>
> Lennart Poettering writes:
>
> > You are focussing only on the one-time iops generated during archival,
> > and are ignoring the extra latency during access that fragmented files
> > cost. Show me that the iops reduction during the one-time operation
> > matters and the extra latency during access doesn't matter and we can
> > look into making changes. But without anything resembling any form of
> > profiling we are just blind people in the fog...
>
> I'm curious why you seem to think that latency accessing old logs is so
> important.  I would think that old logs tend to be accessed very
> rarely.  On such a rare occasion, a few extra mS doesn't seem very
> important to me.  Even if it's on a 5400 rpm drive, typical latency is
> what?  8 mS?  Even with a fragment every 8 MB, that's only going to add
> up to an extra 128 mS to read and parse a 128 MB log file.  Even with no
> fragments it's going to take over 1 second to read that file, so we're
> only talking about a ~11% slow down here, on an operation that is rare
> and you're going to be spending far more time actually looking at the
> log than it took to read off the disk.

journalctl gives you one long continues log stream, joining everything
available, archived or not into one big interleaved stream.

Lennart

--
Lennart Poettering, Berlin
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Phillip Susi

Lennart Poettering writes:

> You are focussing only on the one-time iops generated during archival,
> and are ignoring the extra latency during access that fragmented files
> cost. Show me that the iops reduction during the one-time operation
> matters and the extra latency during access doesn't matter and we can
> look into making changes. But without anything resembling any form of
> profiling we are just blind people in the fog...

I'm curious why you seem to think that latency accessing old logs is so
important.  I would think that old logs tend to be accessed very
rarely.  On such a rare occasion, a few extra mS doesn't seem very
important to me.  Even if it's on a 5400 rpm drive, typical latency is
what?  8 mS?  Even with a fragment every 8 MB, that's only going to add
up to an extra 128 mS to read and parse a 128 MB log file.  Even with no
fragments it's going to take over 1 second to read that file, so we're
only talking about a ~11% slow down here, on an operation that is rare
and you're going to be spending far more time actually looking at the
log than it took to read off the disk.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Phillip Susi

Chris Murphy writes:

> But it gets worse. The way systemd-journald is submitting the journals
> for defragmentation is making them more fragmented than just leaving
> them alone.

Wait, doesn't it just create a new file, fallocate the whole thing, copy
the contents, and delete the original?  How can that possibly make
fragmentation *worse*?

> All of those archived files have more fragments (post defrag) than
> they had when they were active. And here is the FIEMAP for the 96MB
> file which has 92 fragments.

How the heck did you end up with nearly 1 frag per mb?

> If you want an optimization that's actually useful on Btrfs,
> /var/log/journal/ could be a nested subvolume. That would prevent any
> snapshots above from turning the nodatacow journals into datacow
> journals, which does significantly increase fragmentation (it would in
> the exact same case if it were a reflink copy on XFS for that matter).

Wouldn't that mean that when you take snapshots, they don't include the
logs?  That seems like an anti feature that violates the principal of
least surprise.  If I make a snapshot of my root, I *expect* it to
contain my logs.

> I don't get the iops thing at all. What we care about in this case is
> latency. A least noticeable latency of around 150ms seems reasonable
> as a starting point, that's where users realize a delay between a key
> press and a character appearing. However, if I check for 10ms latency
> (using bcc-tools fileslower) when reading all of the above journals at
> once:
>
> $ sudo journalctl -D
> /mnt/varlog33/journal/b51b4a725db84fd286dcf4a790a50a1d/ --no-pager
>
> Not a single report. None. Nothing took even 10ms. And those journals
> are more fragmented than your 20 in a 100MB file.
>
> I don't have any hard drives to test this on. This is what, 10% of the
> market at this point? The best you can do there is the same as on SSD.

The above sounded like great data, but not if it was done on SSD.  Of
course it doesn't cause latency on an SSD.  I don't know about market
trends, but I stopped trusting my data to SSDs a few years ago when my
ext4 fs kept being corrupted and it appeared that the FTL of the drive
was randomly swapping the contents of different sectors around when I
found things like the contents of a text file in a block of the inode
table or a directory.

> You can't depend on sysfs to conditionally do defragmentation on only
> rotational media, too many fragile media claim to be rotating.

It sounds like you are arguing that it is better to do the wrong thing
on all SSDs rather than do the right thing on ones that aren't broken.

> Looking at the two original commits, I think they were always in
> conflict with each other, happening within months of each other. They
> are independent ways of dealing with the same problem, where only one
> of them is needed. And the best of the two is fallocate+nodatacow
> which makes the journals behave the same as on ext4 where you also
> don't do defragmentation.

This makes sense.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Lennart Poettering

On Do, 04.02.21 12:51, Chris Murphy (li...@colorremedies.com) wrote:

> On Thu, Feb 4, 2021 at 6:49 AM Lennart Poettering
>  wrote:
>
> > You want to optimize write pattersn I understand, i.e. minimize
> > iops. Hence start with profiling iops, i.e. what defrag actually costs
> > and then weight that agains the reduced access time when accessing the
> > files. In particular on rotating media.
>
> A nodatacow journal on Btrfs is no different than a journal on ext4 or
> xfs. So I don't understand why you think you *also* need to defragment
> the file, only on Btrfs. You cannot do better than you already are
> with a nodatacow file. That file isn't going to get anymore fragmented
> in use than it was at creation.

You know, we issue the btrfs ioctl, under the assumption that if the
file is already perfectly defragmented it's a NOP. Are you suggesting
it isn't a NOP in that case?

> If you want to do better, maybe stop appending in 8MB increments?
> Every time you append it's another extent. Since apparently the
> journal files can max out at 128MB before they are rotated, why aren't
> they created 128MB from the very start? That would have a decent
> chance of getting you a file that's 1-4 extents, and it's not going to
> have more extents than that.

You know, there are certainly "perfect" ways to adjust our writing
scheme to match some specific file system on some specific storage
matching some specific user pattern. THing is though, what might be
ideal for some fs and some user might be terrible for another fs or
another user. We try to find some compromise in the middle, that might
not result in "perfect" behaviour everywhere, but at least reasonable
behaviour.

> Presumably the currently active journal not being fragmented is more
> important than archived journals, because searches will happen on
> recent events more than old events. Right?

Nope. We always interleave stuff. We currently open all journal files
in parallel. The system one and the per-user ones, the current ones
and the archived ones.

> So if you're going to say
> fragmentation matters at all, maybe stop intentionally fragmenting the
> active journal?

We are not *intentionally* fragmenting. Please don't argue on that
level. Not helpful, man.

> Just fallocate the max size it's going to be right off
> the bat? Doesn't matter what file system it is. Once that 128MB
> journal is full, leave it alone, and rotate to a new 128M file. The
> append is what's making them fragmented.

I don't think that makes much sense: we rotate and start new files for
a multitude of reasons, such as size overrun, time jumps, abnormal
shutdown and so on. If we'd always leave a fully allocated file around
people would hate us...

The 8M increase is a middle ground: we don#t allocate space for each
log message, and we don't allocate space for everything at once. We
allocate medium sized chunks at a time.

Also, we vacuum old journals when allocating and the size constraints
are hit. i.e. if we detect that adding 8M to journal file X would mean
the space used by all journals together would be above the configure
disk usage limits we'll delete the oldest journal files we can, until
we can allocate 8M again. And we do this each time. If we'd allocate
the full file all the time this means we'll likely remove ~256M of
logs whenever we start a new file. And that's just shitty behaviour.

> But it gets worse. The way systemd-journald is submitting the journals
> for defragmentation is making them more fragmented than just leaving
> them alone.

Sounds like a bug in btrfs? systemd is not the place to hack around
btrfs bugs?

> If you want an optimization that's actually useful on Btrfs,
> /var/log/journal/ could be a nested subvolume. That would prevent any
> snapshots above from turning the nodatacow journals into datacow
> journals, which does significantly increase fragmentation (it would in
> the exact same case if it were a reflink copy on XFS for that
> matter).

Not sure what the point of that would be... at least when systemd does
snapshots (i.e. systemd-nspawn --template= and so on) they are of
course recursive, so what'd be the point of doing a subvolume there?

> > Somehow I think you are missing what I am asking for: some data that
> > actually shows your optimization is worth it: i.e. that leaving the
> > files fragment doesn't hurt access to the journal badly, and that the
> > number of iops is substantially lowered at the same time.
>
> I don't get the iops thing at all. What we care about in this case is
> latency. A least noticeable latency of around 150ms seems reasonable
> as a starting point, that's where users realize a delay between a key
> press and a character appearing. However, if I check for 10ms latency
> (using bcc-tools fileslower) when reading all of the above journals at
> once:
>
> $ sudo journalctl -D
> /mnt/varlog33/journal/b51b4a725db84fd286dcf4a790a50a1d/ --no-pager
>
> Not a single report. None. Nothing took even 10ms. And those

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-04 Thread Chris Murphy

On Thu, Feb 4, 2021 at 6:49 AM Lennart Poettering
 wrote:

> You want to optimize write pattersn I understand, i.e. minimize
> iops. Hence start with profiling iops, i.e. what defrag actually costs
> and then weight that agains the reduced access time when accessing the
> files. In particular on rotating media.

A nodatacow journal on Btrfs is no different than a journal on ext4 or
xfs. So I don't understand why you think you *also* need to defragment
the file, only on Btrfs. You cannot do better than you already are
with a nodatacow file. That file isn't going to get anymore fragmented
in use than it was at creation.

If you want to do better, maybe stop appending in 8MB increments?
Every time you append it's another extent. Since apparently the
journal files can max out at 128MB before they are rotated, why aren't
they created 128MB from the very start? That would have a decent
chance of getting you a file that's 1-4 extents, and it's not going to
have more extents than that.

Presumably the currently active journal not being fragmented is more
important than archived journals, because searches will happen on
recent events more than old events. Right? So if you're going to say
fragmentation matters at all, maybe stop intentionally fragmenting the
active journal? Just fallocate the max size it's going to be right off
the bat? Doesn't matter what file system it is. Once that 128MB
journal is full, leave it alone, and rotate to a new 128M file. The
append is what's making them fragmented.

But it gets worse. The way systemd-journald is submitting the journals
for defragmentation is making them more fragmented than just leaving
them alone.

https://drive.google.com/file/d/1FhffN4WZZT9gZTnG5VWongWJgPG_nlPF/view?usp=sharing

All of those archived files have more fragments (post defrag) than
they had when they were active. And here is the FIEMAP for the 96MB
file which has 92 fragments.

https://drive.google.com/file/d/1Owsd5DykNEkwucIPbKel0qqYyS134-tB/view?usp=sharing

I don't know if it's a bug with the submitted target size by
sd-journald, or if it's a bug in Btrfs. But it doesn't really matter.
There is no benefit to defragmenting nodatacow journals that were
fallocated upon creation.

If you want an optimization that's actually useful on Btrfs,
/var/log/journal/ could be a nested subvolume. That would prevent any
snapshots above from turning the nodatacow journals into datacow
journals, which does significantly increase fragmentation (it would in
the exact same case if it were a reflink copy on XFS for that matter).

> No, but doing this once in a big linear stream when the journal is
> archived might not be so bad if then later on things are much faster
> to access for all future because the files aren't fragmented.

Ok well in practice is worse than doing nothing so I'm suggesting doing nothing.

> Somehow I think you are missing what I am asking for: some data that
> actually shows your optimization is worth it: i.e. that leaving the
> files fragment doesn't hurt access to the journal badly, and that the
> number of iops is substantially lowered at the same time.

I don't get the iops thing at all. What we care about in this case is
latency. A least noticeable latency of around 150ms seems reasonable
as a starting point, that's where users realize a delay between a key
press and a character appearing. However, if I check for 10ms latency
(using bcc-tools fileslower) when reading all of the above journals at
once:

$ sudo journalctl -D
/mnt/varlog33/journal/b51b4a725db84fd286dcf4a790a50a1d/ --no-pager

Not a single report. None. Nothing took even 10ms. And those journals
are more fragmented than your 20 in a 100MB file.

I don't have any hard drives to test this on. This is what, 10% of the
market at this point? The best you can do there is the same as on SSD.
You can't depend on sysfs to conditionally do defragmentation on only
rotational media, too many fragile media claim to be rotating.

And by the way, I use Brfs on SD Card on a Raspberry Pi Zero of all
things. The cards last longer than other file systems due to net lower
write amplification due to native compression. I wouldn't be surprised
if the cards fail sooner if I weren't using compression. But who
knows, maybe Btrfs write amplification compared to ext4 and xfs
constant journaling ends up being a wash. There are a number of
embedded use cases for Btrfs as well. Is compressed F2FS better?
Probably. They have a solution for the wandering trees problem, but
also no snapshots or data checksumming. But I also don't think any of
that is super relevant to the overall topic, I just provide this as a
contra-argument that Btrfs isn't appropriate for small cheap storage
devices.

> The thing is that we tend to have few active files and many archived
> files, and since we interleave stuff our access patterns are pretty
> bad already, so we don't want to spend even more time on paying for
> extra bad access patterns becuase the archived files are

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-04 Thread Phillip Susi

Lennart Poettering writes:

> Well, at least on my system here there are still like 20 fragments per
> file. That's not nothin?

In a 100 mb file?  It could be better, but I very much doubt you're
going to notice a difference after defragmenting that.  I may be the nut
that rescued the old ext2 defrag utility from the dustbin of history,
but even I have to admit that it isn't really important to use and there
is a reasson why the linux community abandoned it.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-04 Thread Lennart Poettering

On Mi, 03.02.21 23:11, Chris Murphy (li...@colorremedies.com) wrote:

> On Wed, Feb 3, 2021 at 9:46 AM Lennart Poettering
>  wrote:
> >
> > Performance is terrible if cow is used on journal files while we write
> > them.
>
> I've done it for a year on NVMe. The latency is so low, it doesn't
> matter.

Maybe do it on rotating media...

> > It would be great if we could turn datacow back on once the files are
> > archived, and then take benefit of compression/checksumming and
> > stuff. not sure if there's any sane API for that in btrfs besides
> > rewriting the whole file, though. Anyone knows?
>
> A compressed file results in a completely different encoding and
> extent size, so it's a complete rewrite of the whole file, regardless
> of the cow/nocow status.
>
> Without compression it'd be a rewrite because in effect it's a
> different extent type that comes with checksums. i.e. a reflink copy
> of a nodatacow file can only be a nodatacow file; a reflink copy of a
> datacow file can only be a datacow file. The conversion between them
> is basically 'cp --reflink=never' and you get a complete rewrite.
>
> But you get a complete rewrite of extents by submitting for
> defragmentation too, depending on the target extent size.
>
> It is possible to do what you want by no longer setting nodatacow on
> the enclosing dir. Create a 0 length journal file, set nodatacow on
> that file, then fallocate it. That gets you a nodatacow active
> journal. And then you can just duplicate it in place with a new name,
> and the result will be datacow and automatically compressed if
> compression is enabled.
>
> But the write hit has already happened by writing journal data into
> this journal file during its lifetime. Just rename it on rotate.
> That's the least IO impact possible at this point. Defragmenting it
> means even more writes, and not much of a gain if any, unless it's
> datacow which isn't the journald default.

You are focussing only on the one-time iops generated during archival,
and are ignoring the extra latency during access that fragmented files
cost. Show me that the iops reduction during the one-time operation
matters and the extra latency during access doesn't matter and we can
look into making changes. But without anything resembling any form of
profiling we are just blind people in the fog...

Lennart

--
Lennart Poettering, Berlin
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-04 Thread Lennart Poettering

On Mi, 03.02.21 22:51, Chris Murphy (li...@colorremedies.com) wrote:

> > > Since systemd-journald sets nodatacow on /var/log/journal the journals
> > > don't really fragment much. I typically see 2-4 extents for the life
> > > of the journal, depending on how many times it's grown, in what looks
> > > like 8MiB increments. The defragment isn't really going to make any
> > > improvement on that, at least not worth submitting it for additional
> > > writes on SSD. While laptop and desktop SSD/NVMe can handle such a
> > > small amount of extra writes with no meaningful impact to wear, it
> > > probably does have an impact on much more low end flash like USB
> > > sticks, eMMC, and SD Cards. So I figure, let's just drop the
> > > defragmentation step entirely.
> >
> > Quite frankly, given how iops-expensive btrfs is, one probably
> > shouldn't choose btrfs for such small devices anyway. It's really not
> > where btrfs shines, last time I looked.
>
> Btrfs aggressively delays metadata and data allocation, so I don't
> agree that it's expensive.

It's not a matter of agreeing or not. Last time people showed me
benchmarks (which admittedly was 2 or 3 years ago), the number of iops
for typical workloads is typically twice as much as on ext4. Which I
don't really want to criticize, it's just the way that it is. I mean,
maybe they managed to lower the iops since then, but it's not a matter
of "agreeing", it's a matter of showing benchmarks that indicate this
is not a problem anymore.

> But in any case, reading a journal file and rewriting it out, which is
> what defragment does, doesn't really have any benefit given the file
> doesn't fragment much anyway due to (a) nodatacow and (b) fallocate,
> which is what systemd-journald does on Btrfs.

Well, at least on my system here there are still like 20 fragments per
file. That's not nothin?

> > Did you actually check the iops this generates?
>
> I don't understand the relevance.

You want to optimize write pattersn I understand, i.e. minimize
iops. Hence start with profiling iops, i.e. what defrag actually costs
and then weight that agains the reduced access time when accessing the
files. In particular on rotating media.

> > Not sure it's worth doing these kind of optimizations without any hard
> > data how expensive this really is. It would be premature.
>
> Submitting the journal for defragment in effect duplicates the
> journal. Read all extents, and rewrite those blocks to a new location.
> It's doubling the writes for that journal file. It's not like the
> defragment is free.

No, but doing this once in a big linear stream when the journal is
archived might not be so bad if then later on things are much faster
to access for all future because the files aren't fragmented.

> Somehow I think you're missing what I've asking for, which is to stop
> the unnecessary defragment step because it's not an optimization. It
> doesn't meaningfully reduce fragmentation at all, it just adds write
> amplification.

Somehow I think you are missing what I am asking for: some data that
actually shows your optimization is worth it: i.e. that leaving the
files fragment doesn't hurt access to the journal badly, and that the
number of iops is substantially lowered at the same time.

The thing is that we tend to have few active files and many archived
files, and since we interleave stuff our access patterns are pretty
bad already, so we don't want to spend even more time on paying for
extra bad access patterns becuase the archived files are fragment.

Lennart

--
Lennart Poettering, Berlin
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-03 Thread Chris Murphy

On Wed, Feb 3, 2021 at 9:46 AM Lennart Poettering
 wrote:
>
> Performance is terrible if cow is used on journal files while we write
> them.

I've done it for a year on NVMe. The latency is so low, it doesn't matter.

> It would be great if we could turn datacow back on once the files are
> archived, and then take benefit of compression/checksumming and
> stuff. not sure if there's any sane API for that in btrfs besides
> rewriting the whole file, though. Anyone knows?

A compressed file results in a completely different encoding and
extent size, so it's a complete rewrite of the whole file, regardless
of the cow/nocow status.

Without compression it'd be a rewrite because in effect it's a
different extent type that comes with checksums. i.e. a reflink copy
of a nodatacow file can only be a nodatacow file; a reflink copy of a
datacow file can only be a datacow file. The conversion between them
is basically 'cp --reflink=never' and you get a complete rewrite.

But you get a complete rewrite of extents by submitting for
defragmentation too, depending on the target extent size.

It is possible to do what you want by no longer setting nodatacow on
the enclosing dir. Create a 0 length journal file, set nodatacow on
that file, then fallocate it. That gets you a nodatacow active
journal. And then you can just duplicate it in place with a new name,
and the result will be datacow and automatically compressed if
compression is enabled.

But the write hit has already happened by writing journal data into
this journal file during its lifetime. Just rename it on rotate.
That's the least IO impact possible at this point. Defragmenting it
means even more writes, and not much of a gain if any, unless it's
datacow which isn't the journald default.

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-03 Thread Chris Murphy

On Wed, Feb 3, 2021 at 9:41 AM Lennart Poettering
 wrote:
>
> On Di, 05.01.21 10:04, Chris Murphy (li...@colorremedies.com) wrote:
>
> > f27a386430cc7a27ebd06899d93310fb3bd4cee7
> > journald: whenever we rotate a file, btrfs defrag it
> >
> > Since systemd-journald sets nodatacow on /var/log/journal the journals
> > don't really fragment much. I typically see 2-4 extents for the life
> > of the journal, depending on how many times it's grown, in what looks
> > like 8MiB increments. The defragment isn't really going to make any
> > improvement on that, at least not worth submitting it for additional
> > writes on SSD. While laptop and desktop SSD/NVMe can handle such a
> > small amount of extra writes with no meaningful impact to wear, it
> > probably does have an impact on much more low end flash like USB
> > sticks, eMMC, and SD Cards. So I figure, let's just drop the
> > defragmentation step entirely.
>
> Quite frankly, given how iops-expensive btrfs is, one probably
> shouldn't choose btrfs for such small devices anyway. It's really not
> where btrfs shines, last time I looked.

Btrfs aggressively delays metadata and data allocation, so I don't
agree that it's expensive. There is a wandering trees problem that can
result in write amplification, that's a different problem. But via
native compression overall writes are proven to significantly reduce
overall writes.

But in any case, reading a journal file and rewriting it out, which is
what defragment does, doesn't really have any benefit given the file
doesn't fragment much anyway due to (a) nodatacow and (b) fallocate,
which is what systemd-journald does on Btrfs.

It'd make more sense to defragment only if the file is datacow. At
least then it also gets compressed, which isn't the case when it's
nodatacow.

>
> > Further, since they are nodatacow, they can't be submitted for
> > compression. There was a quasi-bug in Btrfs, now fixed, where
> > nodatacow files submitted for decompression were compressed. So we no
> > longer get that unintended benefit. This strengthens the case to just
> > drop the defragment step upon rotation, no other changes.
> >
> > What do you think?
>
> Did you actually check the iops this generates?

I don't understand the relevance.

>
> Not sure it's worth doing these kind of optimizations without any hard
> data how expensive this really is. It would be premature.

Submitting the journal for defragment in effect duplicates the
journal. Read all extents, and rewrite those blocks to a new location.
It's doubling the writes for that journal file. It's not like the
defragment is free.

> That said, if there's actual reason to optimize the iops here then we
> could make this smart: there's actually an API for querying
> fragmentation: we could defrag only if we notice the fragmentation is
> really too high.

FIEMAP isn't going to work in the case the files are being fragmented.
The Btrfs extent size becomes 128KiB in that case, and it looks like
massive fragmentation. So that needs to be made smarter first.

I don't have a problem submitting the journal for a one time
defragment upon rotation if it's datacow, if empty journal-nocow.conf
exists.

But by default, the combination of fallocate and nodatacow already
avoids all meaningful fragmentation, so long as the journals aren't
being snapshot. If they are, well, that too is a different problem. If
the user does that and we're still defragmenting the files, it'll
explode their space consumption because defragment is not snapshot
aware, it results in all shared extents becoming unshared.

> But quite frankly, this sounds polishing things after the horse
> already left the stable: if you want to optimize iops, then don't use
> btrfs. If you bought into btrfs, then apparently you are OK with the
> extra iops it generates, hence also the defrag costs.

Somehow I think you're missing what I've asking for, which is to stop
the unnecessary defragment step because it's not an optimization. It
doesn't meaningfully reduce fragmentation at all, it just adds write
amplification.

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-03 Thread Lennart Poettering

On Di, 26.01.21 21:00, Chris Murphy (li...@colorremedies.com) wrote:

> On Tue, Jan 5, 2021 at 10:04 AM Chris Murphy  wrote:
> >
> > f27a386430cc7a27ebd06899d93310fb3bd4cee7
> > journald: whenever we rotate a file, btrfs defrag it
> >
> > Since systemd-journald sets nodatacow on /var/log/journal the journals
> > don't really fragment much. I typically see 2-4 extents for the life
> > of the journal, depending on how many times it's grown, in what looks
> > like 8MiB increments. The defragment isn't really going to make any
> > improvement on that, at least not worth submitting it for additional
> > writes on SSD. While laptop and desktop SSD/NVMe can handle such a
> > small amount of extra writes with no meaningful impact to wear, it
> > probably does have an impact on much more low end flash like USB
> > sticks, eMMC, and SD Cards. So I figure, let's just drop the
> > defragmentation step entirely.
> >
> > Further, since they are nodatacow, they can't be submitted for
> > compression. There was a quasi-bug in Btrfs, now fixed, where
> > nodatacow files submitted for decompression were compressed. So we no
> > longer get that unintended benefit. This strengthens the case to just
> > drop the defragment step upon rotation, no other changes.
> >
> > What do you think?
>
> A better idea.
>
> Default behavior: journals are nodatacow and are not defragmented.
>
> If '/etc/tmpfiles.d/journal-nocow.conf ` exists, do the reverse.
> Journals are datacow, and files are defragmented (and compressed, if
> it's enabled).

Performance is terrible if cow is used on journal files while we write
them.

It would be great if we could turn datacow back on once the files are
archived, and then take benefit of compression/checksumming and
stuff. not sure if there's any sane API for that in btrfs besides
rewriting the whole file, though. Anyone knows?

Just dropping FS_NOCOW_FL on the existing file doesn#t work iirc, it
can only be changed while a file is empty last time i looked iirc.

Lennart

--
Lennart Poettering, Berlin
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-03 Thread Lennart Poettering

On Di, 05.01.21 10:04, Chris Murphy (li...@colorremedies.com) wrote:

> f27a386430cc7a27ebd06899d93310fb3bd4cee7
> journald: whenever we rotate a file, btrfs defrag it
>
> Since systemd-journald sets nodatacow on /var/log/journal the journals
> don't really fragment much. I typically see 2-4 extents for the life
> of the journal, depending on how many times it's grown, in what looks
> like 8MiB increments. The defragment isn't really going to make any
> improvement on that, at least not worth submitting it for additional
> writes on SSD. While laptop and desktop SSD/NVMe can handle such a
> small amount of extra writes with no meaningful impact to wear, it
> probably does have an impact on much more low end flash like USB
> sticks, eMMC, and SD Cards. So I figure, let's just drop the
> defragmentation step entirely.

Quite frankly, given how iops-expensive btrfs is, one probably
shouldn't choose btrfs for such small devices anyway. It's really not
where btrfs shines, last time I looked.

> Further, since they are nodatacow, they can't be submitted for
> compression. There was a quasi-bug in Btrfs, now fixed, where
> nodatacow files submitted for decompression were compressed. So we no
> longer get that unintended benefit. This strengthens the case to just
> drop the defragment step upon rotation, no other changes.
>
> What do you think?

Did you actually check the iops this generates?

Not sure it's worth doing these kind of optimizations without any hard
data how expensive this really is. It would be premature.

That said, if there's actual reason to optimize the iops here then we
could make this smart: there's actually an API for querying
fragmentation: we could defrag only if we notice the fragmentation is
really too high.

But quite frankly, this sounds polishing things after the horse
already left the stable: if you want to optimize iops, then don't use
btrfs. If you bought into btrfs, then apparently you are OK with the
extra iops it generates, hence also the defrag costs.

Lennart

--
Lennart Poettering, Berlin
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-01-26 Thread Chris Murphy

On Tue, Jan 5, 2021 at 10:04 AM Chris Murphy  wrote:
>
> f27a386430cc7a27ebd06899d93310fb3bd4cee7
> journald: whenever we rotate a file, btrfs defrag it
>
> Since systemd-journald sets nodatacow on /var/log/journal the journals
> don't really fragment much. I typically see 2-4 extents for the life
> of the journal, depending on how many times it's grown, in what looks
> like 8MiB increments. The defragment isn't really going to make any
> improvement on that, at least not worth submitting it for additional
> writes on SSD. While laptop and desktop SSD/NVMe can handle such a
> small amount of extra writes with no meaningful impact to wear, it
> probably does have an impact on much more low end flash like USB
> sticks, eMMC, and SD Cards. So I figure, let's just drop the
> defragmentation step entirely.
>
> Further, since they are nodatacow, they can't be submitted for
> compression. There was a quasi-bug in Btrfs, now fixed, where
> nodatacow files submitted for decompression were compressed. So we no
> longer get that unintended benefit. This strengthens the case to just
> drop the defragment step upon rotation, no other changes.
>
> What do you think?

A better idea.

Default behavior: journals are nodatacow and are not defragmented.

If '/etc/tmpfiles.d/journal-nocow.conf ` exists, do the reverse.
Journals are datacow, and files are defragmented (and compressed, if
it's enabled).


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

47 matches

Mail list logo