On 2020-01-12T23:24+0100, Michael Biebl wrote:
Am 12.01.20 um 23:08 schrieb Vincent Blut:
On 2020-01-12T20:41+0100, Michael Biebl wrote:
Am 12.01.20 um 20:15 schrieb Santiago Vila:
My theory is that this is some kind of race condition.

I initially cloned the machine from another one where this happened.

Then I discovered that the problem also happens (randomly) in a brand
new machine if I copy the journal from the "bad" machine.

However, there is nothing special about the journal (or at least
"journalctl --verify" says it's ok), except maybe that it's several
megabytes long.

Could it be that systemd spends some time processing the journal at
boot time and this is what triggers the race condition?

On my buster system:


an 12 20:04:46 debian systemd[1]: systemd-timesyncd.service: Looking
at job systemd-timesyncd.service/stop conflicted_by=yes
Jan 12 20:04:46 debian systemd[1]: systemd-timesyncd.service: Looking
at job systemd-timesyncd.service/start conflicted_by=no
Jan 12 20:04:46 debian systemd[1]: systemd-timesyncd.service: Fixing
conflicting jobs
systemd-timesyncd.service/stop,systemd-timesyncd.service/start by
deleting job systemd-timesyncd.service/start
Jan 12 20:04:46 debian systemd[1]: chrony.service: Looking at job
chrony.service/start conflicted_by=no
Jan 12 20:04:46 debian systemd[1]: chrony.service: Looking at job
chrony.service/stop conflicted_by=no
Jan 12 20:04:46 debian systemd[1]: chrony.service: Fixing conflicting
jobs chrony.service/start,chrony.service/stop by deleting job
chrony.service/stop
Jan 12 20:04:46 debian systemd[1]: Found redundant job
systemd-timesyncd.service/stop, dropping from transaction.


Those lines are missing on GCE system. Instead I see
Jan 12 17:02:01 d1 systemd[1]: Keeping job
systemd-timesyncd.service/start because of sysinit.target/start
Jan 12 17:02:01 d1 systemd[1]: Keeping job chrony.service/stop
because of systemd-timesyncd.service/start
Jan 12 17:02:01 d1 systemd[1]: systemd-timesyncd.service: Installed
new job systemd-timesyncd.service/start as 119
Jan 12 17:02:01 d1 systemd[1]: chrony.service: Job 82
chrony.service/start finished, result=canceled
Jan 12 17:02:01 d1 systemd[1]: chrony.service: Installed new job
chrony.service/stop as 121
Jan 12 17:02:01 d1 systemd[1]: chrony.service: Job 121
chrony.service/stop finished, result=done


The problem is, that the transaction is computed *before*
ConditionFileIsExecutable=!/usr/sbin/chronyd is evaluated (conditions
are evaluation just before the binary is executed) and it might indeed
depend on the ordering, in which the jobs are scheduled.

Makes sense. Thanks for debugging this.

So, the simplest fix would be to drop the line
Conflicts=systemd-timesyncd.service openntpd.service ntp.service
ntpsec.service
from chrony.service.
This way, systemd will schedule the start of both services.
chrony.service will be started properly and for
systemd-timesyncd.service the ConditionFileIsExecutable will kick in.

If there is no risk of regression, then I’m all for making this change.

I don't see a risk for a regression, given that systemd in buster ships
disable-with-time-daemon.conf.
So if the drop the Conflicts again, we'd basically have the situation
again as in stretch.

Not for chrony and openntpd. Both were already in conflict with systemd-timesyncd.service in Stretch.

That said, we should probably first upload this change to sid to give it
some wider exposure first.

Sure.

We probably should not keep this hack forever, but instead let timedated
read known NTP implementation unit names from
usr/lib/systemd/ntp-units.d/*.list since this has been reintroduced in
systemd 243. I added the necessary bits in chrony 3.5-5.

Hm, I don't see how this change in timedate would actually help in this
situation.
Support for ntp-units.d in timedated/timedatectl just enables that if
you use "timedatectl set-ntp true|false" it will prefer alternatives if
installed.

Doesn’t systemd-timesyncd look for foreign services in ntp-units.d/ when starting? I thought that was the case and that it remained inactive in case an NTP implementation with a higher priority was found there.

We don't actually use "timedatectl set-ntp true|false" though in our
maintainer scripts though (and I don't think we should).

Agreed.

Attachment: signature.asc
Description: PGP signature

Reply via email to