Re: [systemd-devel] Monotonic time went backwards, rotating log

2023-10-06 Thread Phillip Susi
Pekka Paalanen  writes:

> have you checked your boot ID, maybe it's often the same as the previous
> boot?

Good thought, but it doesn't look like it:

IDX BOOT ID  FIRST ENTRY LAST ENTRY 

-20 c2a5e3af1f044d79805c4fbdd120beec Wed 2023-05-10 11:17:40 EDT Wed 2023-05-17 
08:59:48 EDT
-19 3f20878d92714d09a7928b1d1260074c Wed 2023-05-17 09:30:55 EDT Wed 2023-05-17 
09:33:36 EDT
-18 22944fc66fc949048c14a3e9e559824e Wed 2023-05-17 09:34:13 EDT Mon 2023-05-22 
15:41:05 EDT
-17 28c59dbf8d13407c8aa89ef2d3b3024c Tue 2023-05-23 08:34:38 EDT Fri 2023-05-26 
14:46:36 EDT
-16 e6804a377e984bc499fbc44dd9a14f40 Tue 2023-05-30 08:53:41 EDT Tue 2023-05-30 
09:35:00 EDT
-15 d3f4946a0f4e4951961ce62ae88d390c Tue 2023-05-30 09:35:36 EDT Thu 2023-06-01 
13:00:15 EDT
-14 9bd458cbf57d458e869d5405c534d549 Thu 2023-06-01 13:01:06 EDT Thu 2023-06-01 
13:01:35 EDT
-13 df67fe939f4a434f9beadfe81101e10e Thu 2023-06-01 13:40:22 EDT Tue 2023-06-06 
10:56:06 EDT
-12 b860691a4da841e6bd223a4035536ef6 Tue 2023-06-06 10:57:13 EDT Wed 2023-06-21 
11:16:51 EDT
-11 1f72e13c3d2542e69abfd5c38d8050fe Fri 2023-06-23 11:41:45 EDT Mon 2023-06-26 
15:48:31 EDT
-10 50821dde5780459bbf05d5dffc52ac37 Fri 2023-07-28 15:08:56 EDT Mon 2023-07-31 
16:15:21 EDT
 -9 444f76e5a93b422583e2a8089816aafe Mon 2023-07-31 16:16:13 EDT Fri 2023-08-04 
13:26:22 EDT
 -8 3ff965b69adb4c51b058cba5dcaa4c09 Tue 2023-08-08 12:50:10 EDT Tue 2023-08-08 
12:50:19 EDT
 -7 448ca4a0ef024be0a0dd7ec2b58b1015 Tue 2023-08-08 12:54:43 EDT Thu 2023-08-10 
10:17:08 EDT
 -6 68f66c75cf674dd48b6216cb05c9278c Thu 2023-08-24 10:00:56 EDT Thu 2023-08-24 
15:09:49 EDT
 -5 184ac41dd9164e1786edc74d19e4cef9 Fri 2023-09-01 09:34:58 EDT Tue 2023-09-12 
15:28:40 EDT
 -4 0f7b2c0b1d244769bff218e2933ba46d Mon 2023-09-25 12:04:21 EDT Tue 2023-09-26 
10:58:30 EDT
 -3 f1369263334a4c6db183fa7fa61074c6 Tue 2023-09-26 10:59:10 EDT Thu 2023-10-05 
13:02:18 EDT
 -2 dea7a07fe6d24ad49ce1841e0260b42e Thu 2023-10-05 13:03:49 EDT Thu 2023-10-05 
13:11:00 EDT
 -1 b2d29b9e947942a79303ad6944d7ad31 Thu 2023-10-05 13:11:45 EDT Thu 2023-10-05 
13:22:45 EDT
  0 4539fd6a1ddf471e8795345cc3965f44 Thu 2023-10-05 13:23:22 EDT Fri 2023-10-06 
13:38:45 EDT
  


Re: [systemd-devel] Monotonic time went backwards, rotating log

2023-10-05 Thread Phillip Susi
Phillip Susi  writes:

> Lennart Poettering  writes:
>
>> It actually checks that first:
>>
>> https://github.com/systemd/systemd/blob/main/src/libsystemd/sd-journal/journal-file.c#L2201
>
> That's what I'm saying: it should have noticed that FIRST and not gotten
> to the monotonic time check, but it didn't.

I decided to try looking into this again.  It seems it's also my system
log that is rotated on each boot with this message about the monotonic
clock, despite the fact that it should be rotated first just because
it's a new boot.

There are some debug prints that might shed light on what is happening,
but I can't seem to get them to enable.  I tried setting
Environment=SYSTEMD_DEBUG_LEVEL=debug on systemd-journald.service, but I
still don't get them.



Re: [systemd-devel] Monotonic time went backwards, rotating log

2023-05-26 Thread Phillip Susi


Lennart Poettering  writes:

> It actually checks that first:
>
> https://github.com/systemd/systemd/blob/main/src/libsystemd/sd-journal/journal-file.c#L2201

That's what I'm saying: it should have noticed that FIRST and not gotten
to the monotonic time check, but it didn't.




Re: [systemd-devel] Monotonic time went backwards, rotating log

2023-05-25 Thread Phillip Susi


Lennart Poettering  writes:

> We want that within each file all records are strictly ordered by all
> clocks, so that we can find specific entries via bisection.

Why *all* clocks?  Even if you want to search on the monotonic time, you
first have to specify a boot ID within which that monotonic time is
valid, don't you?  So the first step in your search would be to find the
boot records, then bisect from there.

> The message is debug level, no?

log_ratelimit_info(), which appears to be printed by default when I log
in, and I presume my session systemd instance is started.  I guess
that's the problem: it should be debug.

Also though, why doesn't it first note that the boot ID changed?


[systemd-devel] Monotonic time went backwards, rotating log

2023-05-23 Thread Phillip Susi
Every time I reboot, when I first log in, journald ( 253.3-r1 )
complains that the monotonic time went backwards, rotating log file.
This appears to happen because journal_file_append_entry_internal()
wishes to enforce strict time ordering within the log file.  I'm not
sure why it cares about the *monotonic* time being in strict order
though, since that will always go backwards when you reboot.  I'm also
not sure why the previous check that the boot ID has changed did not
trigger.

If it is intentional that journals be rotated after a reboot, could it
at least be done without complaining about it?



Re: [systemd-devel] how to let systemd hibernate start/stop the swap area?

2023-03-30 Thread Phillip Susi


Michael Chapman  writes:

> What specifically is the difference between:
>
> * swap does not exist at all;
> * swap is full of data that will not be swapped in for weeks or months;

That's the wrong question.  The question is, what is the difference
between having NO swap, and having some swap that you don't use much of?
The answer to that is that there will be a non zero amount of anonymous
memory allocated to processes that hardly ever touch it, and that can be
tossed out to swap to provide more memory to use for, if nothing else,
caching files that ARE being accessed.  Now that amount may not be much
if you usually have plenty of free ram, but it won't be zero.

I too have long gone without a swap partition because the small benefit
of having a little more ram to cache files did not justify the risk of
going into thrashing mode when some process went haywire, but if that
problem has been solved, and you want a swap partition for hibernation
anyhow, then you may as well keep it mounted all the time since
unmounting it when you aren't about to hibernate costs *something* and
gains *nothing*.



Re: [systemd-devel] how to let systemd hibernate start/stop the swap area?

2023-03-30 Thread Phillip Susi


Lennart Poettering  writes:

> oomd/PSI looks at memory allocation latencies to determine memory
> pressure. Since you disallow anonymous memory to be paged out and thus
> increase IO on file backed memory you increase the latencies
> unnecessarily, thus making oomd trigger earlier.

Did this get changed in the last few years?  Because I'm sure it used to
be based on the total commit limit, and so OOM wouldn't start killing
until your swap was full, which didn't happen until the system was
thrashing itself to uselessness for 20 minutes already.

If this has been fixed then I guess it's time for me to start using swap
again.

What happens if you use zswap?  Will hibernation try to save things to
there instead of a real disk swap?  It might be nice to have zswap for
normal use and the on disk swap for hibernate.



Re: [systemd-devel] .local searches not working

2021-04-09 Thread Phillip Susi


Silvio Knizek writes:

> So in fact your network is not standard conform. You have to define
> .local as search and routing domain in the configuration of sd-
> resolved.

Interesting... so what are you supposed to name your local, private
domains?  I believe Microsoft used to ( or still do? ) recommend using
.local to name your domain if you don't have a public domain name, so
surely I'm not the first person to run into this?  Why does
systemd-resolved not fall back to DNS if it can't first resolve the name
using mDNS?  That appears to be allowed by the RFC.

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] .local searches not working

2021-04-09 Thread Phillip Susi
What special treatment does systemd-resolved give to .local domains?
The corporate windows network uses a .local domain and even when I point
systemd-resolved at the domain controller, it fails the query without
bothering to ask the dc saying:

resolve call failed: No appropriate name servers or networks for name
found

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-11 Thread Phillip Susi


Colin Guthrie writes:

> I think the defaults are more complex than just "each journal file can 
> grow to 128M" no?

Not as far as I can see.

> I mean there is SystemMaxUse= which defaults to 10% of the partition on 
> which journal files live (this is for all journal files, not just the 
> SystemMaxFileSize= which refers to just one file).

That controls when to delete old journals, not when to rotate a
journal.  It looks like you can manually request a rotation, and you can
set a time based rotation, but it defaults to off, so that leaves
rotating once the file reaches the max size ( 128M ).
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-11 Thread Phillip Susi


Colin Guthrie writes:

> Are those journal files suffixed with a ~. Only ~ suffixed journals 
> represent a dirty journal file (i.e. from an unexpected shutdown).

Nope.

> Journals rotate for other reason too (e.g. user request, overall space 
> requirements etc.) which might explain this wasted space?

I've made no requests to rotate and config is default, which afaics
means only rotate when the log hits max size of 128MB.  Thus I wouldn't
expect to really see any holes in the log, especially in the middle.

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-11 Thread Phillip Susi

Phillip Susi writes:

> Wait, what do you mean the inode nr changes?  I thought the whole point
> of the block donating thing was that you get a contiguous set of blocks
> in the new file, then transfer those blocks back to the old inode so
> that the inode number and timestamps of the file don't change.

I just tested this with e4defrag and the inode nr does not change.
Oddly, it refused to improve my archived journals which had 12-15
fragments.  I finally found /var/log/btmp.1 which despite being less
than 8mb had several hundred fragments.  e4defrag got it down to 1
fragment, but for some reason, it is still described by 3 separate
entries in the extent tree.

Looking at the archived journals though, I wonder why am I seeing so
many unwritten areas?  Just the last extent of this file has nearly 4 mb
that were never written to.  This system has never had an unexpected
shutdown.  Attached is the extent map.

Filesystem type is: ef53
File size of 
system@13a67b4b418d4869b37247eda6ebe494-00151338-0005b9ee46d7d4a9.journal
 is 117440512 (28672 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..   0:1712667..   1712667:  1:
   1:1..2047:1591168..   1593214:   2047:1712668:
   2: 2048..2132:3012608..   3012692: 85:1593215:
   3: 2133..2139:3012693..   3012699:  7: unwritten
   4: 2140..4095:3012700..   3014655:   1956:
   5: 4096..6143:3041280..   3043327:   2048:3014656:
   6: 6144..8191:3010560..   3012607:   2048:3043328:
   7: 8192..9011:3002368..   3003187:820:3012608:
   8: 9012..9013:3003188..   3003189:  2: unwritten
   9: 9014..   10239:3003190..   3004415:   1226:
  10:10240..   11255:3024896..   3025911:   1016:3004416:
  11:11256..   11268:3025912..   3025924: 13: unwritten
  12:11269..   11348:3025925..   3026004: 80:
  13:11349..   11352:3026005..   3026008:  4: unwritten
  14:11353..   11360:3026009..   3026016:  8:
  15:11361..   11364:3026017..   3026020:  4: unwritten
  16:11365..   11373:3026021..   3026029:  9:
  17:11374..   11376:3026030..   3026032:  3: unwritten
  18:11377..   11642:3026033..   3026298:266:
  19:11643..   11688:3026299..   3026344: 46: unwritten
  20:11689..   11961:3026345..   3026617:273:
  21:11962..   11962:3026618..   3026618:  1: unwritten
  22:11963..   12287:3026619..   3026943:325:
  23:12288..   12347:3033088..   3033147: 60:3026944:
  24:12348..   12381:3033148..   3033181: 34: unwritten
  25:12382..   12466:3033182..   3033266: 85:
  26:12467..   12503:3033267..   3033303: 37: unwritten
  27:12504..   13007:3033304..   3033807:504:
  28:13008..   13024:3033808..   3033824: 17: unwritten
  29:13025..   13044:3033825..   3033844: 20:
  30:13045..   13061:3033845..   3033861: 17: unwritten
  31:13062..   13081:3033862..   3033881: 20:
  32:13082..   13098:3033882..   3033898: 17: unwritten
  33:13099..   13642:3033899..   3034442:544:
  34:13643..   13648:3034443..   3034448:  6: unwritten
  35:13649..   13655:3034449..   3034455:  7:
  36:13656..   13660:3034456..   3034460:  5: unwritten
  37:13661..   13667:3034461..   3034467:  7:
  38:13668..   13673:3034468..   3034473:  6: unwritten
  39:13674..   13680:3034474..   3034480:  7:
  40:13681..   13685:3034481..   3034485:  5: unwritten
  41:13686..   13692:3034486..   3034492:  7:
  42:13693..   13698:3034493..   3034498:  6: unwritten
  43:13699..   14276:3034499..   3035076:578:
  44:14277..   14277:3035077..   3035077:  1: unwritten
  45:14278..   14458:3035078..   3035258:181:
  46:14459..   14529:3035259..   3035329: 71: unwritten
  47:14530..   14570:3035330..   3035370: 41:
  48:14571..   14641:3035371..   3035441: 71: unwritten
  49:14642..   14928:3035442..   3035728:287:
  50:14929..   15002:3035729..   3035802: 74: unwritten
  51:15003..   15837

Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-10 Thread Phillip Susi


Lennart Poettering writes:

> inode, and then donate the old blocks over. This means the inode nr
> changes, which is something I don't like. Semantically it's only
> marginally better than just creating a new file from scratch.

Wait, what do you mean the inode nr changes?  I thought the whole point
of the block donating thing was that you get a contiguous set of blocks
in the new file, then transfer those blocks back to the old inode so
that the inode number and timestamps of the file don't change.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [EXT] Re: consider dropping defrag of journals on btrfs

2021-02-10 Thread Phillip Susi


Chris Murphy writes:

> It's not interleaving. It uses delayed allocation to make random
> writes into sequential writes. It's tries harder to keep file blocks

Yes, and when you do that, you are inverleaving data from multiple files
into a single stream, which you really shouldn't be doing.  IIRC, XFS
has special io streaming modes specifically designed to *prevent* this
from happening and record multiple video streams simultaniously to
different parts of the disk to keep them from being fragmented to hell
like that.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-09 Thread Phillip Susi


Chris Murphy writes:

> And I agree 8MB isn't a big deal. Does anyone complain about journal
> fragmentation on ext4 or xfs? If not, then we come full circle to my
> second email in the thread which is don't defragment when nodatacow,
> only defragment when datacow. Or use BTRFS_IOC_DEFRAG_RANGE and
> specify 8MB length. That does seem to consistently no op on nodatacow
> journals which have 8MB extents.

Ok, I agree there.

> The reason I'm dismissive is because the nodatacow fragment case is
> the same as ext4 and XFS; the datacow fragment case is both
> spectacular and non-deterministic. The workload will matter where

Your argument seems to be that it's no worse than ext4 and so if we
don't defrag there, why on btrfs?  Lennart seems to be arguing that the
only reason systemd doesn't defrag on ext4 is because the ioctl is
harder to use.  Maybe it should defrag as well, so he's asking for
actual performance data to evaluate whether the defrag is pointless or
whether maybe ext4 should also start doing a defrag.  At least I think
that's his point.  Personally I agree ( and showed the calculations in a
previous post ) that 8 MB/fragment is only going to have a negligiable
impact on performance and so isn't worth bothering with a defrag, but he
has asked for real world data...

> And also, only defragmenting on rotation strikes me as leaving
> performance on the table, right? If there is concern about fragmented

No, because fragmentation only causes additional latency on HDD, not SSD.

> But it sounds to me like you want to learn what the performance is of
> journals defragmented with BTFS_IOC_DEFRAG specifically? I don't think
> it's interesting because you're still better off leaving nodatacow
> journals alone, and something still has to be done in the datacow

Except that you're not.  Your definition of better off appears to be
only on SSD and only because it is preferable to have fewer writes than
less fragmentation.  On HDD defragmenting is a good thing.  Lennart
seems to want real world performance data to evaluate just *how* good
and whether it's worth the bother, at least for HDD.  For SSDs, I
believe he agreed that it may as well be shut off there since it
provides no benefit, but your patch kills it on HDDs as well.

> Is there a test mode for journald to just dump a bunch of random stuff
> into the journal to age it? I don't want to wait weeks to get a dozen
> journal files.

The cause of the fragmentation is slowly appending to the file over
time, so if you dump a bunch of data in too quickly you would eliminate
the fragmentation.  You might try:

while true ; do logger "This is a test log message to act as filler" ;
sleep 1 ; done

To speed things up a little bit.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [EXT] Re: consider dropping defrag of journals on btrfs

2021-02-09 Thread Phillip Susi


Chris Murphy writes:

> Basically correct. It will merge random writes such that they become
> sequential writes. But it means inserts/appends/overwrites for a file
> won't be located with the original extents.

Wait, I thoguht that was only true for metadata, not normal file data
blocks?  Well, maybe it becomes true for normal data if you enable
compression.  Or small files that get leaf packed into the metadata
chunk.

If it's really combining streaming writes from two different files into
a single interleaved write to the disk, that would be really silly.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-08 Thread Phillip Susi


Chris Murphy writes:

> I showed that the archived journals have way more fragmentation than
> active journals. And the fragments in active journals are
> insignificant, and can even be reduced by fully allocating the journal

Then clearly this is a problem with btrfs: it absolutely should not be
making the files more fragmented when asked to defrag them.

> file to final size rather than appending - which has a good chance of
> fragmenting the file on any file system, not just Btrfs.

And yet, you just said the active journal had minimal fragmentation.
That seems to mean that the 8mb fallocates that journald does is working
well.  Sure, you could proabbly get fewer fragments by fallocating the
whole 128 mb at once, but there are tradeoffs to that that are not worth
it.  One fragment per 8 mb isn't a big deal.  Ideally a filesystem will
manage to do better than that ( didn't btrfs have a persistent
reservation system for this purpose? ), but it certainly should not
commonly do worse.

> Further, even *despite* this worse fragmentation of the archived
> journals, bcc-tools fileslower shows no meaningful latency as a
> result. I wrote this in the previous email. I don't understand what
> you want me to show you.

*Of course* it showed no meaningful latency because you did the test on
an SSD, which has no meaningful latency penalty due to fragmentation.
The question is how bad is it on HDD.

> And since journald offers no ability to disable the defragment on
> Btrfs, I can't really do a longer term A/B comparison can I?

You proposed a patch to disable it.  Test before and after the patch.

> I did provide data. That you don't like what the data shows: archived
> journals have more fragments than active journals, is not my fault.
> The existing "optimization" is making things worse, in addition to
> adding a pile of unnecessary writes upon journal rotation.

If it is making things worse, that is definately a bug in btrfs.  It
might be nice to avoid the writes on SSD though since there is no
benefit there.

> Conversely, you have not provided data proving that nodatacow
> fallocated files on Btrfs are any more fragmented than fallocated
> files on ext4 or XFS.

That's a fair point: if btrfs isn't any worse than other filessytems,
then why is it the only one that gets a defrag?

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-08 Thread Phillip Susi


Chris Murphy writes:

>> It sounds like you are arguing that it is better to do the wrong thing
>> on all SSDs rather than do the right thing on ones that aren't broken.
>
> No I'm suggesting there isn't currently a way to isolate
> defragmentation to just HDDs.

Yes, but it sounded like you were suggesting that we shouldn't even try,
not just that it isn't 100% accurate.  Sure, some SSDs will be stupid
and report that they are rotational, but most aren't stupid, so it's a
good idea to disable the defragmentation on drives that report that they
are non rotational.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Phillip Susi


Lennart Poettering writes:

> journalctl gives you one long continues log stream, joining everything
> available, archived or not into one big interleaved stream.

If you ask for everything, yes... but if you run journalctl -b then
shuoldn't it only read back until it finds the start of the current
boot?
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Phillip Susi


Maksim Fomin writes:
> I would say it depends on whether defragmentation issues are feature
> of btrfs. As Chris mentioned, if root fs is snapshotted,
> 'defragmenting' the journal can actually increase fragmentation. This
> is an example when the problem is caused by a feature (not a bug) in
> btrfs. For example, my 'system.journal' file is currently 16 MB and
> according to filefrag it has 1608 extents (consequence of snapshotted
> rootfs?). It looks too much, if I am not missing some technical

Holy smokes!  How did btrfs manage to butcher that poor file that badly?
It shouldn't be possible for it to be *that* bad.  I mean, that's only
an average of 10kb per fragment!
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Phillip Susi


Dave Howorth writes:

> PS I'm subscribed to the list. I don't need a copy.

FYI, rather than ask others to go out of their way when replying to you,
you should configure your mail client to set the Reply-To: header to
point to the mailing list address so that other people's mail clients do
what you want automatically.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Phillip Susi


Lennart Poettering writes:

> Nope. We always interleave stuff. We currently open all journal files
> in parallel. The system one and the per-user ones, the current ones
> and the archived ones.

Wait... every time you look at the journal at all, it has to read back
through ALL of the archived journals, even if you are only interested in
information since the last boot that just happened 5 minutes ago?

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Phillip Susi


Lennart Poettering writes:

> You are focussing only on the one-time iops generated during archival,
> and are ignoring the extra latency during access that fragmented files
> cost. Show me that the iops reduction during the one-time operation
> matters and the extra latency during access doesn't matter and we can
> look into making changes. But without anything resembling any form of
> profiling we are just blind people in the fog...

I'm curious why you seem to think that latency accessing old logs is so
important.  I would think that old logs tend to be accessed very
rarely.  On such a rare occasion, a few extra mS doesn't seem very
important to me.  Even if it's on a 5400 rpm drive, typical latency is
what?  8 mS?  Even with a fragment every 8 MB, that's only going to add
up to an extra 128 mS to read and parse a 128 MB log file.  Even with no
fragments it's going to take over 1 second to read that file, so we're
only talking about a ~11% slow down here, on an operation that is rare
and you're going to be spending far more time actually looking at the
log than it took to read off the disk.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-05 Thread Phillip Susi


Chris Murphy writes:

> But it gets worse. The way systemd-journald is submitting the journals
> for defragmentation is making them more fragmented than just leaving
> them alone.

Wait, doesn't it just create a new file, fallocate the whole thing, copy
the contents, and delete the original?  How can that possibly make
fragmentation *worse*?

> All of those archived files have more fragments (post defrag) than
> they had when they were active. And here is the FIEMAP for the 96MB
> file which has 92 fragments.

How the heck did you end up with nearly 1 frag per mb?

> If you want an optimization that's actually useful on Btrfs,
> /var/log/journal/ could be a nested subvolume. That would prevent any
> snapshots above from turning the nodatacow journals into datacow
> journals, which does significantly increase fragmentation (it would in
> the exact same case if it were a reflink copy on XFS for that matter).

Wouldn't that mean that when you take snapshots, they don't include the
logs?  That seems like an anti feature that violates the principal of
least surprise.  If I make a snapshot of my root, I *expect* it to
contain my logs.

> I don't get the iops thing at all. What we care about in this case is
> latency. A least noticeable latency of around 150ms seems reasonable
> as a starting point, that's where users realize a delay between a key
> press and a character appearing. However, if I check for 10ms latency
> (using bcc-tools fileslower) when reading all of the above journals at
> once:
>
> $ sudo journalctl -D
> /mnt/varlog33/journal/b51b4a725db84fd286dcf4a790a50a1d/ --no-pager
>
> Not a single report. None. Nothing took even 10ms. And those journals
> are more fragmented than your 20 in a 100MB file.
>
> I don't have any hard drives to test this on. This is what, 10% of the
> market at this point? The best you can do there is the same as on SSD.

The above sounded like great data, but not if it was done on SSD.  Of
course it doesn't cause latency on an SSD.  I don't know about market
trends, but I stopped trusting my data to SSDs a few years ago when my
ext4 fs kept being corrupted and it appeared that the FTL of the drive
was randomly swapping the contents of different sectors around when I
found things like the contents of a text file in a block of the inode
table or a directory.

> You can't depend on sysfs to conditionally do defragmentation on only
> rotational media, too many fragile media claim to be rotating.

It sounds like you are arguing that it is better to do the wrong thing
on all SSDs rather than do the right thing on ones that aren't broken.

> Looking at the two original commits, I think they were always in
> conflict with each other, happening within months of each other. They
> are independent ways of dealing with the same problem, where only one
> of them is needed. And the best of the two is fallocate+nodatacow
> which makes the journals behave the same as on ext4 where you also
> don't do defragmentation.

This makes sense.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] consider dropping defrag of journals on btrfs

2021-02-04 Thread Phillip Susi


Lennart Poettering writes:

> Well, at least on my system here there are still like 20 fragments per
> file. That's not nothin?

In a 100 mb file?  It could be better, but I very much doubt you're
going to notice a difference after defragmenting that.  I may be the nut
that rescued the old ext2 defrag utility from the dustbin of history,
but even I have to admit that it isn't really important to use and there
is a reasson why the linux community abandoned it.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Antw: [EXT] emergency shutdown, don't wait for timeouts

2021-01-04 Thread Phillip Susi


Reindl Harald writes:

> i have seen "user manager" instances hanging for way too long and way 
> more than 3 minutes over the last 10 years

The default timeout is 3 minutes iirc, so at that point it should be
forcibly killed.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Antw: [EXT] emergency shutdown, don't wait for timeouts

2021-01-04 Thread Phillip Susi


Reindl Harald writes:

> topic missed - it makes no difference if it can hold the power 3 
> minutes, 3 hours or even 3 days at the point where it decides "i need to 
> shutdown everything because the battery goes empty"

It is that point that really should be at least 3 minutes before power
fails.  As long as the battery lasts for at least 3 minutes, then the
monitoring daemon should easily be able to begin the shutdown when 3
minutes remain.

I'm not sure that forcibly killing services to quickly shut down is
really much better than the sudden power loss you are trying to avoid.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] ssh.service in rescue.target

2020-11-09 Thread Phillip Susi


Simon McVittie writes:

> The Debian/Ubuntu package for systemd already masks various services
> that are superseded by something in systemd, such as procps.service and
> rcS.service. It used to also mask all the services from initscripts,
> but that seems to have been dropped in version 243-5.

Ahh, that explains why it seems to be implicitly masked on 18.04.

> Perhaps the systemd Debian/Ubuntu package still needs to mask rc1 services
> like killprocs, or perhaps the initscripts package should take over

Sounds like it.

>> initramfs-tools does not depend on initscripts, but *breaks* it, which
>> should mean it is not possible for both packages to be installed at the
>> same time.
>
> initramfs-tools only Breaks initscripts (<< 2.88dsf-59.3~), which means
> it is possible for both to be installed at the same time, as long as
> initscripts is at a sufficiently new version.

Yes, but why is it listed as *depending* on initscripts when it only
breaks it ( and only an older version at that )?
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] ssh.service in rescue.target

2020-11-09 Thread Phillip Susi


Michael Biebl writes:

> Are you sure?
> Which Ubuntu version is that?
> At least in Debian, /etc/init.d/killprocs is shipped by "initscripts"
> which is no longer installed by default.

20.04.  apt-cache rdepends shows:

Reverse Depends:
  sysv-rc
  util-linux
  hostapd
  wpasupplicant
  util-linux
  initramfs-tools
  base-files
  hostapd
  wpasupplicant
  sysvinit-utils
  initramfs-tools
  base-files
  console-setup-linux


So it looks like it's a required package.  I guess I'll try masking it.

Hrm... odd... I wondered why util-linux would depend on
initscripts... apt-cache depends util-linux says that it does not
*depend* on it but *replaces* it.  Doesn't that mean that when
util-linux is installed, initscripts should be removed?  And
initramfs-tools does not depend on initscripts, but *breaks* it, which
should mean it is not possible for both packages to be installed at the
same time.  WTF over?
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] ssh.service in rescue.target

2020-11-06 Thread Phillip Susi


Lennart Poettering writes:

> Are you running systemd? If so, please get rid of "killproc". It will
> interfere with systemd's service management.

I see.. apparently Ubuntu still has it around.  How does systemd handle
it?  For instance, if a user logged in and forked off a background
process, how does systemd make sure it gets killed when isolating to
rescue.target?  Does it decide that it is still connected to ssh.service
and so won't kill it when isolating?  I'd like to make sure anything
like that is killed and maybe restart sshd if needed.



___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] ssh.service in rescue.target

2020-11-06 Thread Phillip Susi


Lennart Poettering writes:

> What is "killprocs"?
>
> Is something killing services behind systemd's back? What's that
> about?

It's the thing that kills all remaining processes right before shutdown
that we've had since the sysvinit?  And also when isolating I suppose.

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] ssh.service in rescue.target

2020-11-02 Thread Phillip Susi


Lennart Poettering writes:

> Look at the logs?
>
> if they are not immeidately helpful, consider turning on debug logging
> in systemd first, and then redoing the action and then looking at the
> logs. You can use "systemd-analyze log-level debug" to turn on debug
> logging in PID 1 any time.

It appears that systemd decides that ssh.service should remain running,
removes the redundant start job since it is already running, but
killprocs sends sshd a SIGTERM, so it shuts down, and systemd decides
not to restart it.  iirc, there was a list of pids that would NOT be
killed at that stage... it appears that the pid for ssh.service isn't
getting placed in that list.  How did that work again?
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] ssh.service in rescue.target

2020-10-29 Thread Phillip Susi
I used to just have to add-wants ssh.service to rescue.target and I
could isolate to rescue mode for remote system maintainence without
loosing remote access to the server.  After an upgrade, even though
ssh.service is wanted by rescue.target, it is still killed if I
isolate.  How can I figure out why?
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Hotplug auto mounting and masked mount units

2020-01-10 Thread Phillip Susi


Lennart Poettering writes:

> Can you file a bug about this? Sounds like something to fix.

Sure.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] Hotplug auto mounting and masked mount units

2020-01-09 Thread Phillip Susi
Someone in #debian mentioned to me that they were getting some odd
errors in their logs when running gparted.  It seems that several years
ago there was someone with a problem caused by systemd auto mounting
filesystems in response to udev events triggered by gparted, and so as a
workaround, gparted masks all mount units.  Curtis Gedeck and I can't
seem to figure out now, why this was needed because we can't seen to get
systemd to automatically mount a filesystem just because it's device is
hot plugged.  Are there any circumstances under which systemd will mount
a filesystem when it's device is hotplugged?

Also I'm pretty sure this part is a bug in systemd: any service that
depends on -.mount ( so most of them ) it will refuse to start while
-.mount is masked.  It shouldn't matter that it's masked if it is
already mounted should it?  Only if it isn't mounted, then it can't be
mounted to satisfy the dependency.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Inhibiting plug and play

2013-07-16 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 7/16/2013 1:23 PM, Lennart Poettering wrote:
 So, Kay suggested we should use BSD file locks for this. i.e. all
 tools which want to turn off events for a device would take one on
 that specific device fd. As long as it is taken udev would not
 generate events. As soon as the BSD lock is released again it would
 recheck the device.
 
 To me this sounds like a pretty clean thing to do. Locks usually
 suck, but for this purpose they appear to do exactly what they
 should, and most of the problematic things with them don't apply in
 this specific case.
 
 Doing things way would be quite robust, as we have clean
 synchronization and the kernel will release the locks automatically
 when the owner dies.
 
 Opinions?

Sounds like it might work.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJR5YRIAAoJEJrBOlT6nu75VLUH/3X7fHhppdUCw5WFt1PpitKK
O9tuPcs9RZBWhaQ+Ol9Sp82qnEG+mqmmCLAc3z35Zj1PpNRKTLYrGWbmqlbkPsks
ZU4UZTnr9i03uDRuQfSMtUsTpnriBILT8tfyPkH7XYulGBligI3D3Sdk6LWD6Y6J
tm0SnVlOk/tm4FasWFT4KlFp/obRuL8yUBnZvgYqyTblCeVTX2013xEtXN/TG9pH
4iNSgRTQ98K/EdZQP1yz2j/LSLn0MFQTCPU4YuDVmds9nU2iZAllaY+sSMQCSkm6
Ks4giagyhKsBw8oy3AAN5f/VEvpriuAAVrLxNNaTzTAfR//J7gHwzB40ploB3oo=
=+o3u
-END PGP SIGNATURE-
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] Inhibiting plug and play

2013-06-18 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Various tools, but most notably partitioners, manipulate disks in such
a way that they need to prevent the rest of the system from racing
with them while they are in the middle of manipulating the disk.
Presently this is done with a hodge podge of hacks that involve
running some script or executable to temporarily hold off on some
aspects ( typically only auto mounting ) of plug and play processing.
 Which one depends on whether you are running hal, udisks, udisks2, or
systemd.

There really needs to be a proper way at a lower level, either udev,
or maybe in the kernel, to inhibit processing events until the tool
changing the device has finished completely.  The question is, should
this be in the kernel, or in udev, and what should the interface be?

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJRwJylAAoJEJrBOlT6nu75TlAH/1Eso89Jta4AFn/ynYZUWwVD
xS1Nm8ZbRHQizBFmv5rq5Yunr6XUcUQlux9EeG81QwgJ2mgOAk3XE2ldzOp0lUei
cqQYsrdWKHXz8ZXpNG1Jsgw77EUyrs39Z6NmNC+X1AcFbzxRXplGMTJfRSWtW3bw
Ngi8MCjKZOx/qNzUcyZnR3tdAF0veLHWtr7j5XvgO+/iomnAxIOcYiSCv1OeDMdX
SCx8bULT4/LaRWzbcmpzmh1irMsXavrOwuPzIGBTdMKhByyxnwxiOdIyhOs1OJda
059zK7CxMNidD37ON9hMyMtYz5BeCzZmPJdJ6Ef4G7ZrH++xiI4cGvgVOClP6vI=
=Ym1b
-END PGP SIGNATURE-
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Inhibiting plug and play

2013-06-18 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 6/18/2013 2:03 PM, David Zeuthen wrote:
 When I was younger I used to think things like this was a good
 idea and, in fact, did a lot of work to add complex interfaces for
 this in the various components you mention. These interfaces didn't
 really work well, someone would always complain that this or that
 edge-case didn't work. Or some other desktop environment ended up
 not using the interfaces. Or some kernel hacker running twm (with
 carefully selected bits of GNOME or KDE to get automounting) ran
 into problems. It was awful. Just awful.

I can't really extract any meaning from this without knowledge of what
was tried and what problems it caused.  I also don't see why it can't
be something as simple as opening the device with O_EXCL.

 What _did_ turn out to work really well - and what GNOME is using 
 today and have been for the last couple of years - is that the 
 should_automount flag [1] is set only if, and only if, the device
 the volume is on, has been added within the last five seconds [2].
 It's incredibly simple (and low-tech). And judging from bug
 reports, it works really well.

I don't follow.  You mean udisks delays auto mounting by 5 seconds?
That's not going to help if, for instance, you use gparted to move a
partition to the right.  It first enlarges the partition, which
generates a remove/add event, then starts moving data.  5 seconds
later udisks tries to mount the partition, which very well may succeed
with horrible consequences.

The problem also goes beyond udisks and auto mounting, which is why I
say it really needs done either at the udev or kernel level.

For instance, a udev script may identify the new volume as part of a
raid ( leftover metadata ) and try to attach mdadm to it, at the same
time you're running mkfs.  I'm also pretty sure that I have seen the
mdadm udev script race with mdadm itself while you are trying to
create a new raid volume.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJRwKm1AAoJEJrBOlT6nu75ZqMIAM/EUrDIQQn6O5dlCMAOwGSm
h/D5Pbb6amPmDiFELooQgb+BMuUw9bAYwdcukMWZB1MqBTMBOtwLGTeI9TEeWH4y
y2c753e2JBgkPnzY6iFkfPXDvsTEIZSHsx00YLZt06aDL5k/Fmt5eN+mD5pSiC2T
l1qSdhtEw2IseWVuXOjwjy5K00vIDDAaLG1o2Ff135gNx/wUaOK8nL0vSUZhDK96
V3WS4LGKJDlrGESeAyDELfuExrvtmASgohlpUEy2IK9R9lpNicudStPDZFp+dzCA
wv/D1HXkZiIRS74u6Nl3BLtWWd9rPF0ub2OXKCwURYXl2ULE7bPwaiJIdtYp/zo=
=BWbx
-END PGP SIGNATURE-
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel