Re: [systemd-devel] (solved) Re: How to chain services driven by a timer?

2024-04-18 Thread Brian Reichert
On Thu, Apr 18, 2024 at 09:23:29AM -0600, Dan Nicholson wrote:
> Since you likely don't have any units that depend on your service it
> likely doesn't make a big difference. To demonstrate, here's a stupid
> service I created:
> 
> # cat /etc/systemd/system/foo.service
> [Service]
> Type=oneshot
> ExecStart=/bin/echo foo
> 
> With Type=oneshot, the journal output looks like this:
> 
> Apr 17 15:02:50 endless systemd[1]: Starting foo.service...
> Apr 17 15:02:50 endless echo[5390]: foo
> Apr 17 15:02:50 endless systemd[1]: foo.service: Deactivated successfully.
> Apr 17 15:02:50 endless systemd[1]: Finished foo.service.
> 
> With Type=simple, the journal output looks like this:
> 
> Apr 17 14:55:23 endless systemd[1]: Started foo.service.
> Apr 17 14:55:23 endless echo[4482]: foo
> Apr 17 14:55:23 endless systemd[1]: foo.service: Deactivated successfully.
> 
> Notice that in the oneshot case it doesn't reach Finished until after
> Deactivated. In the simple case, it immediately goes into Started. If
> I had a unit with After=foo.service, it would be started before
> foo.service actually did anything if it had Type=simple.
> 
> Of more interest to you is logrotate.service, which is Type=oneshot.

(Confirmed, it is.)

> If it was Type=simple, your unit would be started before the logrotate
> command completed, which is probably not what you want.

Thanks for all of the succint details!  I'm incorporating your advice.

-- 
Brian Reichert  
BSD admin/developer at large


Re: [systemd-devel] (solved) Re: How to chain services driven by a timer?

2024-04-18 Thread Brian Reichert
On Wed, Apr 17, 2024 at 03:03:16PM -0600, Dan Nicholson wrote:
> I assume that this is just a script that does some post-processing on
> log files. In that case, I suggest that you use Type=oneshot with
> RemainAfterExit=no (the default). Then the service will actually wait
> until your script completes. Type=simple is expected to be used for a
> service that doesn't exit under normal conditions.

Thanks for the additional feedback; I don't see the harm in trying.

How, forensically, would I see the difference between 'simple' and
'oneshot', in my use case here?

> --
> Dan

-- 
Brian Reichert  
BSD admin/developer at large


[systemd-devel] (solved) Re: How to chain services driven by a timer?

2024-04-17 Thread Brian Reichert
On Thu, Apr 11, 2024 at 11:14:20AM -0400, Brian Reichert wrote:
> Let me wrap up some testing, and I'll report back if all is successful.

I failed to report back; everything is working as I needed!

I appreciate everyone's help here.

For the record, my new service:

  10-153-68-34:~ # cat /etc/systemd/system/post-logrotate.service
  [Unit]
  Description=Activities after 'logrotate' completes
  
  Requires=logrotate.service
  After=logrotate.service
  
  [Service]
  Type=simple
  
  ExecStart=/usr/local/sbin/post-logrotate
  
  [Install]
  WantedBy=logrotate.service



-- 
Brian Reichert  
BSD admin/developer at large


Re: [systemd-devel] How to chain services driven by a timer?

2024-04-11 Thread Brian Reichert
On Thu, Apr 11, 2024 at 04:58:05PM +0300, Andrei Borzenkov wrote:
> There are no ordering dependencies between your services, so they are
> started as soon as possible. if post-rotate.service must be started
> after logrotate.service, it needs
> 
> After=logrotate.service
> 
> This is also needed because otherwise the Requires directive does not
> work as intended.

That does seem to cause the correct behavior!  Thanks!

Let me wrap up some testing, and I'll report back if all is successful.

-- 
Brian Reichert  
BSD admin/developer at large


Re: [systemd-devel] How to chain services driven by a timer?

2024-04-11 Thread Brian Reichert
On Thu, Apr 11, 2024 at 11:16:36AM +0300, Andrei Borzenkov wrote:
> Show full unit definition for both logrotate.service and your service.

Sure:

10-153-68-34:~ # cat /usr/lib/systemd/system/logrotate.service
[Unit]
Description=Rotate log files
Documentation=man:logrotate(8) man:logrotate.conf(5)
ConditionACPower=true

[Service]
Type=oneshot
#ExecStart=/usr/sbin/logrotate /etc/logrotate.conf
ExecStart=/usr/sbin/logrotate -l /var/log/logrotate.log /etc/logrotate.conf
ExecStartPost=/usr/bin/logger 'XXX log rotation completed'
Nice=19
IOSchedulingClass=best-effort
IOSchedulingPriority=7
Environment=HOME=/root

10-153-68-34:~ # cat /etc/systemd/system/post-logrotate.service
[Unit]
Description=Activities after logrotation

Requires=logrotate.service

[Service]
Type=simple

ExecStart=/usr/bin/logger 'XXX post log rotation'

[Install]
WantedBy=logrotate.service

-- 
Brian Reichert  
BSD admin/developer at large


Re: [systemd-devel] How to chain services driven by a timer?

2024-04-10 Thread Brian Reichert
On Wed, Apr 10, 2024 at 01:47:47PM -0600, Dan Nicholson wrote:
> Restarting the timer doesn't make the service run immediately. Are you
> sure logrotate.service has run again since you made this change? Just
> simulate the timer and start logrotate.service again. All the timer
> does is activate the service. For testing you don't need to wait for
> that to happen.

Ok, that is a helpful detail.

Restarting logrotate.service does now cause my post-logrotate.service
to subsequently start.

On a lark, I augmented the stock logrotate.service with some
instrumentation to show me when 'logrotate' completes, in addition to
maintaining a log file:

  #ExecStart=/usr/sbin/logrotate /etc/logrotate.conf
  ExecStart=/usr/sbin/logrotate -l /var/log/logrotate.log /etc/logrotate.conf
  ExecStartPost=/usr/bin/logger 'XXX log rotation completed'

My service is being run (yay!), but I'm wary of the out-of-order
messaging here:

  10-153-68-34:~ # journalctl -o short-precise --no-pager -u logrotate.service 
-u post-logrotate.service | tail -6
  Apr 10 16:57:54.061053 10-153-68-34 systemd[1]: Started Activities after 
logrotation.
  Apr 10 16:57:54.061140 10-153-68-34 systemd[1]: Stopped Rotate log files.
  Apr 10 16:57:54.062219 10-153-68-34 systemd[1]: Starting Rotate log files...
  Apr 10 16:57:54.104300 10-153-68-34 root[5899]: XXX post log rotation
  Apr 10 16:57:55.367522 10-153-68-34 root[5903]: XXX log rotation completed
  Apr 10 16:57:55.368789 10-153-68-34 systemd[1]: Started Rotate log files.

And systemctl shows the new post-logrotate.service started slightly
before logrotate.service ended:

  10-153-68-34:~ # systemctl show logrotate.service --property 
ExecMainExitTimestamp
  ExecMainExitTimestamp=Wed 2024-04-10 16:57:55 EDT
  10-153-68-34:~ # systemctl show post-logrotate.service --property 
ExecMainStartTimestamp
  ExecMainStartTimestamp=Wed 2024-04-10 16:57:54 EDT

(I really wish I had higher-resolution timestamps here.)

That log file's mtime:

  10-153-68-34:~ # ls -ldtr --full-time /var/log/logrotate.log
  -rw-r--r-- 1 root root 1607975 2024-04-10 16:57:55.094420531 -0400 
/var/log/logrotate.log

Hopefully someone here can assure me this is just due to an artifact
of bookkeeping. I'm specifically trying to avoid doing any work
while logrotate is running.

That I got even this far is really great, so I appreciate all of the
guidance!

> --
> Dan

-- 
Brian Reichert  
BSD admin/developer at large


Re: [systemd-devel] How to chain services driven by a timer?

2024-04-10 Thread Brian Reichert
On Wed, Apr 10, 2024 at 01:29:10PM -0600, Dan Nicholson wrote:
> On Wed, Apr 10, 2024 at 1:21???PM Andrei Borzenkov  
> wrote:
> Just to be complete, your unit won't be triggered until you see it in
> "systemctl show -p Wants logrotate.service". With
> WantedBy=logrotate.service, you'll also find a symlink to your service
> in /etc/systemd/system/logrotate.service.wants/ once it's enabled.

Ok, double-checking:

  10-153-68-34:~ # systemctl show -p Wants logrotate.service
  Wants=post-logrotate.service
  10-153-68-34:~ # ls -ldtr --full-time
  /etc/systemd/system/logrotate.service.wants/post-logrotate.service
  lrwxrwxrwx 1 root root 42 2024-04-10 15:26:19.187115411 -0400
  /etc/systemd/system/logrotate.service.wants/post-logrotate.service ->
  /etc/systemd/system/post-logrotate.service

-- 
Brian Reichert  
BSD admin/developer at large


Re: [systemd-devel] How to chain services driven by a timer?

2024-04-10 Thread Brian Reichert
On Wed, Apr 10, 2024 at 10:21:32PM +0300, Andrei Borzenkov wrote:
> On 10.04.2024 22:04, Brian Reichert wrote:
> >   [Install]
> >   WantedBy=logrotate.service
> >
> 
> Links in [Install] section are created by "systemctl enable".

I could have sworn I did this, but did so (again) just to be sure:

  10-153-68-34:~ # systemctl enable post-logrotate.service
  Created symlink from
  /etc/systemd/system/logrotate.service.wants/post-logrotate.service to
  /etc/systemd/system/post-logrotate.service.
  
  10-153-68-34:~ # systemctl restart logrotate.timer
  
  10-153-68-34:~ # systemctl status logrotate.service
  ??? logrotate.service - Rotate log files
 Loaded: loaded (/usr/lib/systemd/system/logrotate.service; static; vendor
  preset: disabled)
 Active: inactive (dead) since Wed 2024-04-10 14:58:31 EDT; 28min ago
   Docs: man:logrotate(8)
 man:logrotate.conf(5)
   Main PID: 17686 (code=exited, status=0/SUCCESS)
  
  Apr 10 14:58:29 10-153-68-34 systemd[1]: Starting Rotate log files...
  Apr 10 14:58:31 10-153-68-34 systemd[1]: Started Rotate log files.
  
  10-153-68-34:~ # systemctl status post-logrotate.service
  ??? post-logrotate.service - Activities after logrotation
 Loaded: loaded (/etc/systemd/system/post-logrotate.service; enabled;
  vendor preset: disabled)
 Active: inactive (dead)

I don't see post-logrotate.service has having been run.

-- 
Brian Reichert  
BSD admin/developer at large


Re: [systemd-devel] How to chain services driven by a timer?

2024-04-10 Thread Brian Reichert
On Wed, Apr 10, 2024 at 09:06:09AM -0600, Dan Nicholson wrote:
> On Wed, Apr 10, 2024 at 8:50???AM Brian Reichert  wrote:
> >
> > My current service file:
> >
> >   [Unit]
> >   Description=Activities after logrotation
> >
> >   Requires=logrotate.service
> >   Wants=logrotate.service
> >   After=logrotate.service
> >
> >   [Service]
> >   #Type=oneshot
> >   Type=simple
> >
> >   ExecStart=/usr/bin/logger 'XXX post log rotation'
> >
> >   [Install]
> >   WantedBy=timers.target
> 
> The critical part is WantedBy=logrotate.service. In other words, when
> logrotate.service is activated, you want it to also activate your
> service. Then After=logrotate.service above will ensure your service
> starts after it completes. The Requires and Wants above are
> conflicting. You only want one or the other, but I'd probably put it
> as Requires=logrotate.service. That way your unit won't start if
> logrotate.service fails.

Thanks to you and  for your advice. I think I've
correctly incorporated your suggestions, but I still can't seem to get
things to work.

Perhaps my method of testing is flawed.

My current service:

  [Unit]
  Description=Activities after logrotation
  
  Requires=logrotate.service
  
  [Service]
  Type=simple
  
  ExecStart=/usr/bin/logger 'XXX post log rotation'
  
  [Install]
  WantedBy=logrotate.service

I tried, variously, to no apparent effect:

  systemctl restart logrotate.timer

  systemctl start logrotate.service

How should I be testing this?

-- 
Brian Reichert  
BSD admin/developer at large


[systemd-devel] How to chain services driven by a timer?

2024-04-10 Thread Brian Reichert
My goal is to implement a service that runs after logrotate.service
completes.

logrotate.service is triggered by a timer logrotate.timer.

I don't want to modify either of logrotate.service or logrotate.timer,
as they are provided by the OS vendor (SLES 12 SP5, in my case.)

I've tried to apply advice I've seen in misc. forums, e.g.:

  
https://stackoverflow.com/questions/76314129/how-do-i-configure-systemd-timers-to-run-three-separate-tasks-one-after-the-next

  
https://stackoverflow.com/questions/70645559/execute-systemd-service-just-after-another-using-a-timer

But I can't seem to get it to be fired.

The version of systemd on this distribution:

  10-153-68-34:~ # rpm -q systemd
  systemd-228-157.57.1.x86_64

I'd appreciate any guidance.

My current service file:

  [Unit]
  Description=Activities after logrotation
  
  Requires=logrotate.service
  Wants=logrotate.service
  After=logrotate.service
  
  [Service]
  #Type=oneshot
  Type=simple
  
  ExecStart=/usr/bin/logger 'XXX post log rotation'
  
  [Install]
  WantedBy=timers.target

-- 
Brian Reichert  
BSD admin/developer at large


Re: [systemd-devel] Root remaining read-only after boot, no obvious reason why

2023-05-16 Thread Brian Reichert
On Tue, May 16, 2023 at 05:26:00PM +, Dave Re wrote:

This is not a response to your issue, but I wanted to reach out and say that
I also work at Imprivata, in the Spirit group. :)

  breich...@imprivata.com

And I'm stuck with systemd for the now-ancient SLES12 distibution...

And I have a separate CentOS 7 distribution that I'm kepeing alive, as
well...

We could share notes, at some point...

> Dave Re
> Manager, DevOps Engineering
> www.imprivata.com
> 20 CityPoint, 6th Floor
> ???480 Totten Pond Road
> ???Waltham, MA  02451
> The information transmitted, including attachments, is intended only for the 
> person(s) or entity to which it is addressed and may contain confidential 
> and/or privileged material. Any review, retransmission, dissemination or 
> other use of, or taking of any action in reliance upon this information by 
> persons or entities other than the intended recipient is prohibited. If you 
> received this in error, please contact the sender and destroy any copies of 
> this information.  [ImprivataV12018]  

-- 
Brian Reichert  
BSD admin/developer at large


Re: [systemd-devel] dependent services status

2022-11-17 Thread Brian Reichert
On Thu, Nov 17, 2022 at 08:52:00AM -0600, Ted Toth wrote:
> I have a set of services that depend on each other however when
> services are started and considered 'active' that does not necessarily
> mean they are in a state that a dependent service requires them to be
> in to operate properly (for example an inotify watch has been
> established). systemd services, I think,  have a substate, is there a
> way I can set that to a custom value to indicate the services idea of
> its own state?

For any given service, I sometimes introduce a ExecStartPost that would
block until some resource is consumed (network port opened, lock created,
whatever).

Perhaps not the most efficient, but that's how I've enforced 'really
up'.

> Ted

-- 
Brian Reichert  
BSD admin/developer at large


Re: [systemd-devel] Antw: [EXT] [systemd???devel] starting networking from within single user mode?

2022-11-14 Thread Brian Reichert
On Mon, Nov 14, 2022 at 07:57:21AM +0100, Ulrich Windl wrote:
> Unless you used the options to ignore dependencies, that would mean that
> either the dependencies were not correct in the RPM packages, or some unistall
> scripts were not. Both would be bugs.

My organization is running some weird upgrade process that does
more than zypper does.  I'm certain the bugginess is in our weird
process.  We are honoring dependencies, but clearly something else
is awry.

Characterizing the effects of the bugginess is hard, which is what
spurred my original question.

> However: When you used SUSE's standard installation using BtrFS, you should
> have been able to boot a recent snapshot.

The codebase I inherited does not employ BtrFS. I have a large pile
of frustration about how we do things, but that's our mess, and not
related to systemd.

I do appreciate your feedback on the matter, nonetheless.

-- 
Brian Reichert  
BSD admin/developer at large


Re: [systemd-devel] Antw: [EXT] [systemd???devel] starting networking from within single user mode?

2022-11-11 Thread Brian Reichert
On Fri, Nov 11, 2022 at 08:02:00AM +0100, Ulrich Windl wrote:
> >>> Brian Reichert  schrieb am 10.11.2022 um 23:04 in
> Nachricht <20221110220426.ga17...@numachi.com>:
> > I've managed to hose a SLES12 SP5 host; it starts to boot, then hangs.
> 
> And what did you do to mess it up? And what do the boot messages say?

A good question, and not specific to systemd, so I don't want to
pollute the list archives too much on this matter.

'All' I did was remove many RPMs that I arbitrarily deemed
unnecessary.

I came up with a heavily trimmed-down list of SLES RPM for my SLES12
Sp5 environment.

I successfully installed a server using just that trimmed-down list;
yay me!

I then explored 'upgrading' a running (slight older) SP5 box, using
this trimmed-down list.  A purposeful side effect was to uninstall
RPMs not in that trimmed-down list.

This latter box begins to boot, and gets at least as far as loading
the initrd image, before hanging.

I'm pretty certain there's something mismanaged with replacing the
kernel, but not properly managing all of the related boot files
(kdump? device probing? etc.)

Anyway, that's my mess.  Not at all related to systemd, near as I
can tell.  I just have to methodically narrow down on where my
process jumps the tracks.

-- 
Brian Reichert  
BSD admin/developer at large


Re: [systemd-devel] starting networking from within single user mode?

2022-11-11 Thread Brian Reichert
On Fri, Nov 11, 2022 at 08:08:58AM +0200, Mantas Mikul??nas wrote:
> Boot with either "s" (aka "single" aka "rescue") or "-b" (aka "emergency")
> for two variants of single-user mode with init. The former starts some
> basic stuff (it's the real single-user mode) including udev so that modules
> for your network interfaces still get loaded automatically, while the
> latter doesn't start anything except init and a shell (emergency mode is
> *almost* like init=/bin/sh but in theory might at least let you `systemctl
> start` something).

I was able to get into the emergency target, using these notes:

  https://suay.site/?p=1681=noscript

The speed bump this article helped me with was to overcome systemd's
misconception that the root account was locked.

- it was not locked; verified with 'passwd -S root'
- root did have a password (known to me)

Anyway, I now am at a more functional command line, and I appreciate
everyone's patience.

> If udev is not running, try to `modprobe` whichever drivers you need for
> the Ethernet interface. (The name can be found by PCI ID, e.g. for
> 10ec:8136 "grep -i 10EC.*8136 /lib/modules/`uname -r`/modules.alias") Then
> manually bring eth0 up, add the IP address, add a default route (dhclient
> or dhcpcd will also work without udev, while systemd-networkd probably
> won't).
> 
> ip link set eth0 up
> ip addr add 192.168.1.55/24 dev eth0
> ip route add default via 192.168.1.1

These, in isolation, are useful notes.  It's been way too many years since
I had to rescue a failing-to-boot Linux server...

> -- 
> Mantas Mikul??nas

-- 
Brian Reichert  
BSD admin/developer at large


[systemd-devel] starting networking from within single user mode?

2022-11-10 Thread Brian Reichert
I've managed to hose a SLES12 SP5 host; it starts to boot, then hangs.

If I get it into single-user mode (getting into the grub menu, and adding
init=/bin/bash) I can at least review the file system.

What I want to do is get networking running, so that I can at least gather
logs, etc.

When I try to start networking with 'systemctl', I see this error:

systemd "failed to connect to bus; No such file or directory"

What can I do to minimally bring up the networking service? I don't even
have any network devices at this point...

-- 
Brian Reichert  
BSD admin/developer at large


Re: [systemd-devel] [EXT] Finding network interface name in different distro

2022-10-18 Thread Brian Reichert
> > :
> > > Hi All,
> > >
> > > When changing distro or distro major versions, network interfaces'
> > > names sometimes change.
> > > For example on some Dell server running CentOS 7 the interface is
> > > named em1 and running Alma 8 it's eno1.

This doesn't answer the OP's question, but my trick for enumerating
network devices was to use something like:

  egrep -v -e "lo:" /proc/net/dev | grep ':' | cut -d: -f

to get a list of non-loopback interfaces.

In my case, I went on to bury everything under a single bond0
interface, so a) no software had to guess a NIC name, and b) in the
case of physical cabling, they would all Just Work.

This was work done in my kickstart file, and worked through many
releases of Red Hat and CentOS.

I adopted this tactic as Dell kept switching up how they would
probe/name devices...

-- 
Brian Reichert  
BSD admin/developer at large


Re: [systemd-devel] systemd service causing bash to miss signals?

2022-09-19 Thread Brian Reichert
On Mon, Sep 19, 2022 at 08:25:32PM +0300, Mantas Mikul??nas wrote:
> Pipelines somewhat rely on the kernel delivering SIGPIPE to the writer as
> soon as the read end is closed. So if you have `foo | head -1`, then as
> soon as head reads enough and exits, foo gets killed via SIGPIPE.

In my case:

  cat /dev/urandom|tr -dc "a-zA-Z0-9"|fold -w 64|head -1

'fold' _is_ getting the SIGPIPE.  It won't get killed if it has a handler.

> But as
> most systemd-managed services aren't shell interpreters, systemd marks
> SIGPIPE as "ignored" when starting the service process, so that if the
> service is somehow tricked into opening a pipe that a user has mkfifo'd, at
> least the kernel can't be tricked into killing the service. You can opt out
> of this using IgnoreSIGPIPE=.

Ah, based on your explanation, I see this, which almost exactly
matches my situation.

  https://stackoverflow.com/a/44376786

For me, key takeway:

  However, when the pipeline is run under systemd, systemd sets the
  default action for SIGPIPE to SIG_IGN, which makes the processes
  in the pipeline ignore the signal.

For the archives, I can confirm that putting IgnoreSIGPIPE=false under
[Service] indeed allows my example to work correctly.

-- 
Brian Reichert  
BSD admin/developer at large


[systemd-devel] systemd service causing bash to miss signals?

2022-09-19 Thread Brian Reichert
I apologize for the vague subject.

The background: I've inherited some legacy software to manage.

This is on SLES12 SP5, running:

systemd-228-157.40.1.x86_64

One element is a systemd-managed service, written in Perl, that in
turn, is using bash to generate random numbers (don't ask me why
this tactic was adopted).

Here's an isolation of that logic:

  pheonix:~ # cat /root/random_str.pl
  #!/usr/bin/perl
  print "$0 start ".time."\n";
  my $randStr = `cat /dev/urandom|tr -dc "a-zA-Z0-9"|fold -w 64|head -1`;
  print "$0 end ".time."\n";

You can run this from the command-line, to see how quickly it
nominally operates.

What I can reproduce in my environment, very reliably, is that when
this is invoked as a service:

- the 'head' command exits very quickly (to be expected)
- the shell does not exit (maybe missed a SIGCHILD?)
- 'fold' chews a CPU core
- A kernel trace shows that 'fold' is spinning on SIGPIPEs, as it's
  STDOUT is no longer connected to another process.

My service unit:

  pheonix:~ # cat /etc/systemd/system/random_str.service
  [Unit]
  Description=gernate random number
  After=network.target local-fs.target
  
  [Service]
  Type=oneshot
  RemainAfterExit=yes
  ExecStart=/root/random_str.pl
  ExecStop=/usr/bin/true
  #TimeoutSec=infinity
  TimeoutSec=900
  
  [Install]
  WantedBy=multi-user.target

Easy to repro; this hangs forever, instead of exiting quickly.

  pheonix:~ # systemctl daemon-reload
  pheonix:~ # systemctl start random_str

Let me know if there are any other details of my environment that
would be helpful here.

-- 
Brian Reichert  
BSD admin/developer at large


Re: [systemd-devel] cannot unsubscribe from this list

2019-10-16 Thread Brian Reichert
On Wed, Oct 16, 2019 at 07:43:10AM +, Zbigniew J??drzejewski-Szmek wrote:
> On Tue, Oct 15, 2019 at 04:08:24PM -0400, Brian Reichert wrote:
> > I initiated an unsubscribe from this web page:
> > 
> >   https://lists.freedesktop.org/mailman/options/systemd-devel
> > 
> > That created a confirmation email, that I replied to.
> 
> Yeah, that doesn't work. Use the web interface:
> > https://lists.freedesktop.org/mailman/listinfo/systemd-devel

I just tried that second web interface.

The resulting confirmation email aLso bounces:

  :
   131.252.210.177 does not like recipient.
  Remote host said: 550 5.1.1 :
  Recipient address rejected: User unknown in local recipient table
  Giving up on 131.252.210.177.

I can provide email headers, SMTP logs, etc., for either failing
use case, if anyone thinks that would help.

> Zbyszek

-- 
Brian Reichert  
BSD admin/developer at large
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

[systemd-devel] cannot unsubscribe from this list

2019-10-15 Thread Brian Reichert
I initiated an unsubscribe from this web page:

  https://lists.freedesktop.org/mailman/options/systemd-devel

That created a confirmation email, that I replied to.

That yielded this bounce message:

  :
   131.252.210.177 does not like recipient.
  Remote host said: 550 5.1.1 :
  Recipient address rejected: User unknown in local recipient table
  Giving up on 131.252.210.177.

What steps should I be taking?

-- 
Brian Reichert  
BSD admin/developer at large
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] systemd's connections to /run/systemd/private ?

2019-08-14 Thread Brian Reichert
On Wed, Aug 14, 2019 at 04:19:46PM +0100, Simon McVittie wrote:
> On Wed, 14 Aug 2019 at 10:26:53 -0400, Brian Reichert wrote:
> Doesn't daemonize(1) make stdin, stdout and stderr point to /dev/null,
> instead of closing them?

Looking at the source, yes, it does.

> Expecting arbitrary subprocesses to cope gracefully with being invoked
> without the three standard fds seems likely to be a losing battle.
> I've implemented this myself, in dbus; it isn't a whole lot of code,
> but it also isn't something that I would expect the authors of all CLI
> tools to get right.

I concede that reopening FD 0,1,2 is a good practice to insulate
against the issues you cite.

I agree with your points; I code aggressively, and sometimes forget
others don't.

> smcv
> 
> [1] I'm sure there are lots of other executables named daemon or daemonize
> in other OSs, and perhaps some of them get this wrong?
> ___
> systemd-devel mailing list
> systemd-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/systemd-devel

-- 
Brian Reichert  
BSD admin/developer at large
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] systemd's connections to /run/systemd/private ?

2019-08-14 Thread Brian Reichert
On Wed, Aug 14, 2019 at 11:34:21AM +0200, Lennart Poettering wrote:
> Hence: your code that closes fd1 like this is simply buggy. Don't do
> that, you are shooting yourself in the foot.

Buggy or no, this is fifteen-year-old code, and prior cron/service
mgmt framework implementations had no issue.

And, if I were to ever use daemonize(1), or any other other canonical
mechanism for daemonizing code, STDOUT would normally be closed
under those circumstances, as well.

I'm wading into hypotheticals here, but daemonized code should, in
turn, be able to invoke whatever code is subsequently wants.

> Lennart
> 
> --
> Lennart Poettering, Berlin

-- 
Brian Reichert  
BSD admin/developer at large
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] systemd's connections to /run/systemd/private ?

2019-08-13 Thread Brian Reichert
On Thu, Aug 01, 2019 at 07:18:20PM +, Zbigniew J??drzejewski-Szmek wrote:
> Yes. (With the caveat that there *are* legitimate reasons to have new
> long-lived fds created, so not every long-lived fd is "wrong".)

I finally was able to track down what's happening on my system.

This is sufficient to reproduce the effect of increasing the number
of file descriptors open to /run/systemd/private; at least, on my
box, in it's current state:

  sh -c 'exec 1>&-; /usr/bin/systemctl status ntpd.service'

We have cronjob that closes STDOUT, remaps STDERR to a log file,
and runs this systemctl command.  In my environment, this one-liner
will cause that FD count to go up by, 100% reproducible.

Somehow, closing STDOUT is necessary to see this.

FWIW, the strace effort didn't yeild anything; instead, I configured
auditd to reveal when systemctl was invoked, and found a pattern
of invocations I was able to backtrack to the cronjob.

> Zbyszek

-- 
Brian Reichert  
BSD admin/developer at large
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] systemd's connections to /run/systemd/private ?

2019-08-01 Thread Brian Reichert
On Thu, Aug 01, 2019 at 08:17:01AM +, Zbigniew J??drzejewski-Szmek wrote:
> The kernel will use the lower-numbered available fd, so there's lot of
> "reuse" of the same numbers happening. This strace means that between
> each of those close()s here, some other function call returned fd 19.
> Until we know what those calls are, we cannot say why fd19 remains
> open. (In fact, the only thing we can say for sure, is that the
> accept4() call shown above is not relevant.)

So, what I propose at this step:

- Restart my strace, this time using '-e trace=desc' (Trace all
  file descriptor related system calls.)

- Choose to focus on a single descriptor; when I passively notice
  that '19' has been reused a couple of time, stop the trace.

That should give me a smaller trace to analyze.

> Zbyszek

-- 
Brian Reichert  
BSD admin/developer at large
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] systemd's connections to /run/systemd/private ?

2019-07-31 Thread Brian Reichert
On Wed, Jul 31, 2019 at 12:36:41AM +0300, Uoti Urpala wrote:
> On Tue, 2019-07-30 at 14:56 -0400, Brian Reichert wrote:
> > I see, between 13:49:30 and 13:50:01, I see 25 'successful' calls
> > for close(), e.g.:
> > 
> > 13:50:01 close(19)  = 0
> > 
> > Followed by getsockopt(), and a received message on the supposedly-closed
> > file descriptor:
> > 
> >   13:50:01 getsockopt(19, SOL_SOCKET, SO_PEERCRED, {pid=3323, uid=0, 
> > gid=0}, [12]) = 0
> 
> Are you sure it's the same file descriptor? You don't explicitly say
> anything about there not being any relevant lines between those. Does
> systemd really just call getsockopt() on fd 19 after closing it, with
> nothing to trigger that? Obvious candidates to check in the strace
> would be an accept call returning a new fd 19, or epoll indicating
> activity on the fd (though I'd expect systemd to remove the fd from the
> epoll set after closing it).

My analysis is naive.

There was an earlier suggestion to use strace, limiting it to a
limited number of system calls.

I then used a simple RE to look for the string '(19', to see calls where
'19' was used as an initial argument to system calls.  That's way too
simplistic.

To address some of your questions/points.

- No, I don't know if it's the same file descriptor.  I could not
  start strace early enough to catch the creation of several dozen
  file descriptors.

- I didn't say anything about lines between those that I cited, as
  I could not ascertain relevance.

- And I completely missed the case of the accept4() calls returning
  the value of 19, among other cases where '19' shows up as a value.

A regex-based search is certainly inconclusive, but now I'm using this:

  egrep -e '[^0-9:]19(\)|\}|\]|,)?' /home/systemd.strace.trimmed | less

The rhythm now seems to be more like this:

  13:50:01 accept4(13, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = 19
  13:50:01 getsockopt(19, SOL_SOCKET, SO_PEERCRED, {pid=3323, uid=0, gid=0}, 
[12]) = 0
  13:50:01 getsockopt(19, SOL_SOCKET, SO_RCVBUF, [4194304], [4]) = 0
  13:50:01 getsockopt(19, SOL_SOCKET, SO_SNDBUF, [262144], [4]) = 0
  13:50:01 getsockopt(19, SOL_SOCKET, SO_PEERCRED, {pid=3323, uid=0, gid=0}, 
[12]) = 0
  13:50:01 getsockopt(19, SOL_SOCKET, SO_ACCEPTCONN, [0], [4]) = 0 13:50:01 
getsockname(19, {sa_family=AF_LOCAL, sun_path="/run/systemd/private"}, [23]) = 
0 13:50:01 recvmsg(19, {msg_name(0)=NULL, msg_iov(1)=[{"\0AUTH EXTERNAL 
30\r\nNEGOTIATE_UNIX_FD\r\nBEGIN\r\n", 256}], msg_controllen=0, 
msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 45
  13:50:01 sendmsg(19, {msg_name(0)=NULL, msg_iov(3)=[{"OK 
9fcf621ece0a4fe897586e28058cd2fb\r\nAGREE_UNIX_FD\r\n", 52}, {NULL, 0}, {NULL, 
0}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 52 13:50:01 
sendmsg(19, {msg_name(0)=NULL, 
msg_iov(2)=[{"l\4\1\1P\0\0\0\1\0\0\0p\0\0\0\1\1o\0\31\0\0\0/org/freedesktop/systemd1\0\0\0\0\0\0\0\2\1s\0
 
\0\0\0org.freedesktop.systemd1.Manager\0\0\0\0\0\0\0\0\3\1s\0\7\0\0\0UnitNew\0\10\1g\0\2so\0",
 128}, 
{"\20\0\0\0session-11.scope\0\0\0\0003\0\0\0/org/freedesktop/systemd1/unit/session_2d11_2escope\0",
 80}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = -1 EPIPE 
(Broken pipe)
  13:50:01 close(19)  = 0
  13:50:01 close(19)  = 0
  13:50:01 close(19)  = 0
  13:50:01 close(19)  = 0
  13:50:01 close(19)  = 0
  13:50:01 close(19)  = 0
  13:50:01 close(19)  = 0
  ...

Mind you, I see a _lot_ more close() calls than accepts():

  localhost:~ # egrep -e '[^0-9:]19(\)|\}|\]|,)?' /home/systemd.strace.trimmed 
> /home/systemd.strace.trimmed.19
  localhost:~ # grep accept4\( /home/systemd.strace.trimmed.19 | cut -d' ' -f 
2- | sort -u
  accept4(13, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = 19
  localhost:~ # grep accept4\( /home/systemd.strace.trimmed.19 | cut -d' ' -f 
2- | wc -l
  55
  localhost:~ # grep close\( /home/systemd.strace.trimmed.19 | cut -d' ' -f 2- 
| sort -u
  close(19)  = 0
  localhost:~ # grep close\( /home/systemd.strace.trimmed.19 | cut -d' ' -f 2- 
| wc -l
  1051

I'm not asserting the frequencies are indicative of anything wrong;
I'm just more used to a 1:1 correlation.

And, again, the age of the file in /proc/1/fd/19 never seems to
change:

  localhost:~ # ls -ld --full-time /proc/1/fd/19
  lrwx-- 1 root root 64 2019-07-30 15:45:25.531468318 -0400 /proc/1/fd/19
  -> socket:[27085]
  localhost:~ # date
  Wed Jul 31 11:31:37 EDT 2019

That may be a red herring. I have been assuming that if an FD was
closed, then reopened/recreated by a process, that file would have
a new age.





> 
> 
> _______
> systemd-devel mailing list
> system

Re: [systemd-devel] systemd's connections to /run/systemd/private ?

2019-07-30 Thread Brian Reichert
On Thu, Jul 11, 2019 at 08:35:38PM +, Zbigniew J??drzejewski-Szmek wrote:
> On Thu, Jul 11, 2019 at 10:08:43AM -0400, Brian Reichert wrote:
> > Does that sound like expected behavior?
> 
> No, this shouldn't happen.
> 
> What I was trying to say, is that if you have the strace log, you
> can figure out what created the stale connection and what the dbus
> call was, and from all that info it should be fairly simply to figure
> out what the calling command was. Once you have that, it'll be much
> easier to reproduce the issue in controlled setting and look for the
> fix.

I'm finally revisiting this. I haven't found a way to get a trace
to start early enough to catch the initial open() on all of the
targeted file descriptors, but I'm trying to make do with what I
have.

To sum up, in my naive analysis, I see close() called many times
on a file descriptor. I then see more messages come in on that same
descriptor.  But the timestamp of the descriptor in /proc never
changes.

I created a service to launch strace as early as I can figure:

  localhost:~ # cat /usr/lib/systemd/system/systemd_strace.service
  [Unit]
  Description=strace systemd
  DefaultDependencies=no
  After=local-fs.target
  Before=sysinit.target
  ConditionPathExists=!/etc/initrd-release
  
  [Service]
  ExecStart=/usr/bin/strace -p1 -t -o /home/systemd.strace -e
  recvmsg,close,accept4,getsockname,getsockopt,sendmsg -s999
  ExecStop=/bin/echo systemd_strace.service will soon exit
  Type=simple
  
  [Install]
  WantedBy=multi-user.target
  
I introduced the '-t' flag, so I'd get timestamps on the recorded
entries.

I rebooted the server, and after ~20 minutes, I found stale
descriptors, that seem to date to when the host first booted.

Note the age of them, relative to the boot time, and they have no
connected peers.

  localhost:~ # uptime
   14:10pm  up   0:21,  3 users,  load average: 0.81, 0.24, 0.15
  localhost:~ # date
  Tue Jul 30 14:10:09 EDT 2019
  localhost:~ # lsof -nP /run/systemd/private | awk '/systemd/ { sub(/u/, "",
  $4); print $4}' | ( cd /proc/1/fd; xargs ls -t --full-time ) | tail -5
  lrwx-- 1 root root 64 2019-07-30 13:49:25.458694632 -0400 14 -> 
socket:[28742]
  lrwx-- 1 root root 64 2019-07-30 13:49:25.458694632 -0400 16 -> 
socket:[35430]
  lrwx-- 1 root root 64 2019-07-30 13:49:25.458694632 -0400 17 -> 
socket:[37758]
  lrwx-- 1 root root 64 2019-07-30 13:49:25.458694632 -0400 18 -> 
socket:[41044]
  lrwx-- 1 root root 64 2019-07-30 13:49:25.458694632 -0400 19 -> 
socket:[43411]
  localhost:~ # ss -x | grep /run/systemd/private | grep -v -e '* 0' | wc -l
  0

This is an XFS filesystem, so I can't directly get the creation
time of my trace file, but I can see the first entry is timestamped
'13:49:07'.

I copied the trace file aside, and edited that copy to trim everything
off after 14:10:09, when I ran that 'date' command above.

As early as I tried to start this trace, dozens of file descriptors
had already been created.

Trying to focus on FD 19 (the oldest connection to /run/systemd/private):

I see, between 13:49:30 and 13:50:01, I see 25 'successful' calls
for close(), e.g.:

13:50:01 close(19)  = 0

Followed by getsockopt(), and a received message on the supposedly-closed
file descriptor:

  13:50:01 getsockopt(19, SOL_SOCKET, SO_PEERCRED, {pid=3323, uid=0, gid=0}, 
[12]) = 0
  13:50:01 getsockopt(19, SOL_SOCKET, SO_RCVBUF, [4194304], [4]) = 0
  13:50:01 getsockopt(19, SOL_SOCKET, SO_SNDBUF, [262144], [4]) = 0
  13:50:01 getsockopt(19, SOL_SOCKET, SO_PEERCRED, {pid=3323, uid=0, gid=0}, 
[12]) = 0
  13:50:01 getsockopt(19, SOL_SOCKET, SO_ACCEPTCONN, [0], [4]) = 0
  13:50:01 getsockname(19, {sa_family=AF_LOCAL, 
sun_path="/run/systemd/private"}, [23]) = 0
  13:50:01 recvmsg(19, {msg_name(0)=NULL, msg_iov(1)=[{"\0AUTH EXTERNAL 
30\r\nNEGOTIATE_UNIX_FD\r\nBEGIN\r\n", 256}], msg_controllen=0, 
msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 45
  13:50:01 sendmsg(19, {msg_name(0)=NULL, msg_iov(3)=[{"OK 
9fcf621ece0a4fe897586e28058cd2fb\r\nAGREE_UNIX_FD\r\n", 52}, {NULL, 0}, {NULL, 
0}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 52 13:50:01 
sendmsg(19, {msg_name(0)=NULL, 
msg_iov(2)=[{"l\4\1\1P\0\0\0\1\0\0\0p\0\0\0\1\1o\0\31\0\0\0/org/freedesktop/systemd1\0\0\0\0\0\0\0\2\1s\0\0\0\0org.freedesktop.systemd1.Manager\0\0\0\0\0\0\0\0\3\1s\0\7\0\0\0UnitNew\0\10\1g\0\2so\0",
 128}, 
{"\20\0\0\0session-11.scope\0\0\0\0003\0\0\0/org/freedesktop/systemd1/unit/session_2d11_2escope\0",
 80}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = -1 EPIPE 
(Broken pipe)

I see a continuous stream of messages coming in on FD 19, though
the end of the trace, but the age of the file descriptor in /proc
never seems to change.

Am I misinterpreting something?

> Zbyszek

-- 
Brian Reichert  
B

Re: [systemd-devel] systemd's connections to /run/systemd/private ?

2019-07-11 Thread Brian Reichert
On Wed, Jul 10, 2019 at 10:44:14PM +, Zbigniew J??drzejewski-Szmek wrote:
> That's ancient... 228 was released almost four years ago.

That's the joy of using a commercial Linux distribution; they tend
to be conservative about updates.  SLES may very well have backported
fixes to the packaged version they maintain.

They may also have a newer version of a systemd RPM for us to take.

I'm looking for an efficient way to repro the symptoms, as to confirm
whether a newer RPM solves this for us.

> > > > When we first spin up a new SLES12 host with our custom services,
> > > > the number of connections to /run/systemd/private numbers in the
> > > > mere hundreds. 
> > 
> > > That sounds wrong already. Please figure out what those connections
> > > are. I'm afraid that you might have to do some debugging on your
> > > own, since this issue doesn't seem easily reproducible.

Above, I cite a desire for reproducing the symptoms.  If you're
confident that a newly-spun-up idle host should not hover at hundreds
of connections, then hypothetically I could update the vendor-provided
systemd RPM (if there is one), reboot, and see if the connection
count is reduced.

> strace -p1 -e recvmsg,close,accept4,getsockname,getsockopt,sendmsg -s999
>
> yields the relevant info. In particular, the pid, uid, and guid of the
> remote is shown. My approach would be to log this to some file, and
> then see which fds remain, and then look up this fd in the log.
> The recvmsg calls contain the serialized dbus calls, a bit messy but
> understandable. E.g. 'systemctl show systemd-udevd' gives something
> like this:

Thanks for such succinct feedback; I'll see what I can get from this.

In my prior email, I showed how some of the connections were
hours/days old, even with no connecting peer.

Does that sound like expected behavior?

> HTH,
> Zbyszek

-- 
Brian Reichert  
BSD admin/developer at large
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] systemd's connections to /run/systemd/private ?

2019-07-10 Thread Brian Reichert
On Wed, Jul 10, 2019 at 07:37:19AM +, Zbigniew J??drzejewski-Szmek wrote:

> It's a bug report as any other. Writing a meaningful reply takes time
> and effort. Lack of time is a much better explanation than ressentiments.

I wasn't expressing resentment; I apologize if it came off that way.

> Please always specify the systemd version in use. We're not all SLES
> users, and even if we were, I assume that there might be different
> package versions over time.

Quite reasonable:

  localhost:/var/tmp # cat /etc/os-release
  NAME="SLES"
  VERSION="12-SP3"
  VERSION_ID="12.3"
  PRETTY_NAME="SUSE Linux Enterprise Server 12 SP3"
  ID="sles"
  ANSI_COLOR="0;32"
  CPE_NAME="cpe:/o:suse:sles:12:sp3"

  localhost:/var/tmp # rpm -q systemd
  systemd-228-142.1.x86_64

> > When we first spin up a new SLES12 host with our custom services,
> > the number of connections to /run/systemd/private numbers in the
> > mere hundreds. 

> That sounds wrong already. Please figure out what those connections
> are. I'm afraid that you might have to do some debugging on your
> own, since this issue doesn't seem easily reproducible.

What tactics should I employ?  All of those file handles to
/run/systemd/private are owned by PID 1, and 'ss' implies there are
no peers.

'strace' in pid shows messages are flowing, but that doesn't reveal
the logic about how the connections get created or culled, nor who
initiated them.

On a box with ~500 of these file handles, I can see that many of
them are hours or days old:

  localhost:/var/tmp # date
  Wed Jul 10 09:45:01 EDT 2019

  # new ones
  localhost:/var/tmp # lsof -nP /run/systemd/private | awk '/systemd/ {
  sub(/u/, "", $4); print $4}' | (  cd /proc/1/fd; xargs ls -t --full-time ) | 
head -5
  lrwx-- 1 root root 64 2019-07-10 09:45:05.211722809 -0400 561 -> 
socket:[1183838]
  lrwx-- 1 root root 64 2019-07-10 09:40:02.611726025 -0400 559 -> 
socket:[1173429]
  lrwx-- 1 root root 64 2019-07-10 09:40:02.611726025 -0400 560 -> 
socket:[1176265]
  lrwx-- 1 root root 64 2019-07-10 09:33:10.687730403 -0400 100 -> 
socket:[113992]
  lrwx-- 1 root root 64 2019-07-10 09:33:10.687730403 -0400 101 -> 
socket:[115163]
  xargs: ls: terminated by signal 13

  # old ones
  localhost:/var/tmp # lsof -nP /run/systemd/private | awk '/systemd/ {
  sub(/u/, "", $4); print $4}' | (  cd /proc/1/fd; xargs ls -t --full-time ) | 
tail -5
  lrwx-- 1 root root 64 2019-07-08 15:12:04.725350882 -0400 59 -> 
socket:[43097]
  lrwx-- 1 root root 64 2019-07-08 15:12:04.725350882 -0400 60 -> 
socket:[44029]
  lrwx-- 1 root root 64 2019-07-08 15:12:04.725350882 -0400 63 -> 
socket:[46234]
  lrwx-- 1 root root 64 2019-07-08 15:12:04.725350882 -0400 65 -> 
socket:[49252]
  lrwx-- 1 root root 64 2019-07-08 15:12:04.725350882 -0400 71 -> 
socket:[54064]
  
> > Is my guess about CONNECTIONS_MAX's relationship to /run/systemd/private
> > correct?
> 
> Yes. The number is hardcoded because it's expected to be "large
> enough". The connection count shouldn't be more than "a few" or maybe
> a dozen at any time.

Thanks for confirming that.

> Zbyszek

-- 
Brian Reichert  
BSD admin/developer at large
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] OFFLIST Re: systemd's connections to /run/systemd/private ?

2019-07-09 Thread Brian Reichert
On Tue, Jul 09, 2019 at 06:20:02PM +0100, systemd-devel@lists.freedesktop.org 
wrote:
> 
> Posting private messages to a public list is generally considered very
> RUDE.

I agree, and I apologize.

The message I received, and replied to, did not come from a private
email address; it apparently came from the mailing list software,
and I did not realize that until I hit 'reply':

  Date: Tue, 9 Jul 2019 11:21:13 +0100
  From: systemd-devel@lists.freedesktop.org
  To: Brian Reichert 
  Subject: OFFLIST Re: [systemd-devel] systemd's connections to
   /run/systemd/private ?

-- 
Brian Reichert  
BSD admin/developer at large
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] OFFLIST Re: systemd's connections to /run/systemd/private ?

2019-07-09 Thread Brian Reichert
On Tue, Jul 09, 2019 at 11:21:13AM +0100, systemd-devel@lists.freedesktop.org 
wrote:
> Hi Brian
> 
> I feel embarrassed at having recommended you to join the systemd-devel
> list :( I don't understand why nobody is responding to you, and I'm not
> qualified to help!

I appreciate the private feedback.  I recognize this is an all-volunteer
ecosystem, but I'm not used to radio silence. :/

> There is a bit of anti-SUSE feeling for some reason
> that I don't really understand, but Lennart in particular normally
> seems to be very helpful, as does Zbigniew.

I'm new to this list, so haven't seen any anti-SLES sentiments as
of yet.  But, based on the original symptoms I reported, this occurs
on many distributions.

> Perhaps it would be worth restating your problem. I would suggest
> sticking to the facts of the problem as you have experienced them and
> post the full logs somewhere so that people can see the problem. What is
> logged when a server fails to reboot, for example.

I'd love to restate the problem in a way that's tractable, and
distinct from other people's reports of these symptoms.  If you
search the Internet for forum messages:

  systemd "Too many concurrent connections, refusing"

You'll see a lot of hits.  The only solutions I've seen to date is
the systemd maintainers bumping up a hard-coded constant, a few
times over the last few years.

(The fact that they've adjusted it at least twice, but never went
so far as to make it a tunable in a config file somewhere is
worrisome.)

> Just report a bug for people to
> solve.

I wanted to avoid calling it a 'bug' report, as I wanted to establish
what expected behavior is.

But, your advice isn't bad.  I'll try to come up with something more
succinct.

Thanks again...

> HTH, Dave
> 

-- 
Brian Reichert  
BSD admin/developer at large
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] systemd's connections to /run/systemd/private ?

2019-07-08 Thread Brian Reichert
On Tue, Jul 02, 2019 at 09:57:44AM -0400, Brian Reichert wrote:
> At $JOB, on some of our SLES12 boxes, our logs are getting swamped
> with messages saying:
> 
>   "Too many concurrent connections, refusing"
> 
> It's hampering our ability to manage services, e.g.:
> 
>   # systemctl status ntpd
>   Failed to get properties: Connection reset by peer

Can anyone at least confirm that the CONNECTIONS_MAX limit in dbus.c
does relate to the number of connections systemd has to
/run/systemd/private?

We can't even reliably reboot the server in question under these
circumstances...


-- 
Brian Reichert  
BSD admin/developer at large
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

[systemd-devel] systemd's connections to /run/systemd/private ?

2019-07-02 Thread Brian Reichert
At $JOB, on some of our SLES12 boxes, our logs are getting swamped
with messages saying:

  "Too many concurrent connections, refusing"

It's hampering our ability to manage services, e.g.:

  # systemctl status ntpd
  Failed to get properties: Connection reset by peer

Near as I can tell from a quick read of the source of dbus.c, we're
hitting a hard-coded limit of CONNECTIONS_MAX (set to 4096).  I
think this is related to the number of connections systemd (pid 1)
has to /run/systemd/private, but I'm guessing here:

  # ss -x | grep /run/systemd/private | wc -l
  4015

But, despite the almost 4k connections, 'ss' shows that there are
no connected peers:

  # ss -x | grep /run/systemd/private | grep -v -e '* 0' | wc -l
  0

The symptom here is that depending on system activity, systemd stops
being able to process new requests. systemd allows requests to come
in (e.g. via an invocation of 'systemctl', but if I understand the
source of dbus.c, when there are too many connections to it's
outgoing stream, systemd rejects the efforts, apparently with no
retry.

When we first spin up a new SLES12 host with our custom services,
the number of connections to /run/systemd/private numbers in the
mere hundreds.  As workloads increase, the number of connections
raises to the thousands.  Some hosts are plagued with the 'Too many
concurrent' connections, some are not. Empirically, all I've been
able to see is that the number of systemd's connections to
/run/systemd/private tips over 4k.

Is my guess about CONNECTIONS_MAX's relationship to /run/systemd/private
correct?

- I can't demonstrate that there are any consumers of this stream.
- I can't explain why the connection count increases over time.
- The CONNECTION_MAX constant is hard-coded, and it gets increased
  every few months/years, but never seems to be expressed as something
  you can set in a config file.
- I don't know what tunables affect the lifetime/culling of those
  connections.

I have a hypothesis that this may be some resource leak in systemd,
but I've not found a way to test that.

-- 
Brian Reichert  
BSD admin/developer at large
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel