On 2020-01-12T20:41+0100, Michael Biebl wrote:
Am 12.01.20 um 20:15 schrieb Santiago Vila:
My theory is that this is some kind of race condition.

I initially cloned the machine from another one where this happened.

Then I discovered that the problem also happens (randomly) in a brand
new machine if I copy the journal from the "bad" machine.

However, there is nothing special about the journal (or at least
"journalctl --verify" says it's ok), except maybe that it's several
megabytes long.

Could it be that systemd spends some time processing the journal at
boot time and this is what triggers the race condition?

On my buster system:


an 12 20:04:46 debian systemd[1]: systemd-timesyncd.service: Looking at job 
systemd-timesyncd.service/stop conflicted_by=yes
Jan 12 20:04:46 debian systemd[1]: systemd-timesyncd.service: Looking at job 
systemd-timesyncd.service/start conflicted_by=no
Jan 12 20:04:46 debian systemd[1]: systemd-timesyncd.service: Fixing 
conflicting jobs systemd-timesyncd.service/stop,systemd-timesyncd.service/start 
by deleting job systemd-timesyncd.service/start
Jan 12 20:04:46 debian systemd[1]: chrony.service: Looking at job 
chrony.service/start conflicted_by=no
Jan 12 20:04:46 debian systemd[1]: chrony.service: Looking at job 
chrony.service/stop conflicted_by=no
Jan 12 20:04:46 debian systemd[1]: chrony.service: Fixing conflicting jobs 
chrony.service/start,chrony.service/stop by deleting job chrony.service/stop
Jan 12 20:04:46 debian systemd[1]: Found redundant job 
systemd-timesyncd.service/stop, dropping from transaction.


Those lines are missing on GCE system. Instead I see
Jan 12 17:02:01 d1 systemd[1]: Keeping job systemd-timesyncd.service/start 
because of sysinit.target/start
Jan 12 17:02:01 d1 systemd[1]: Keeping job chrony.service/stop because of 
systemd-timesyncd.service/start
Jan 12 17:02:01 d1 systemd[1]: systemd-timesyncd.service: Installed new job 
systemd-timesyncd.service/start as 119
Jan 12 17:02:01 d1 systemd[1]: chrony.service: Job 82 chrony.service/start 
finished, result=canceled
Jan 12 17:02:01 d1 systemd[1]: chrony.service: Installed new job 
chrony.service/stop as 121
Jan 12 17:02:01 d1 systemd[1]: chrony.service: Job 121 chrony.service/stop 
finished, result=done


The problem is, that the transaction is computed *before*
ConditionFileIsExecutable=!/usr/sbin/chronyd is evaluated (conditions
are evaluation just before the binary is executed) and it might indeed
depend on the ordering, in which the jobs are scheduled.

Makes sense. Thanks for debugging this.

So, the simplest fix would be to drop the line
Conflicts=systemd-timesyncd.service openntpd.service ntp.service ntpsec.service
from chrony.service.
This way, systemd will schedule the start of both services.
chrony.service will be started properly and for
systemd-timesyncd.service the ConditionFileIsExecutable will kick in.

If there is no risk of regression, then I’m all for making this change.

I'm sorry, that I first asked chrony (and other NTP implementations) to
add a Conflicts=systemd-timesyncd.service, hoping I could get rid of the
/lib/systemd/system/systemd-timesyncd.service.d/disable-with-time-daemon.conf
hack.

Don’t feel sorry Michael, that was even requested by systemd upstream.
From systemd changelog:
“… Alternative NTP implementations should add a Conflicts=systemd-timesyncd.service to their unit files to take over and replace systemd's NTP default functionality.”

But it appears, this hack is actually much more reliable then using
Conflicts=
See
https://github.com/systemd/systemd/issues/7104#issuecomment-456755980
which explains the problem. We'd need some kind of asymmetrical conflicts.

We probably should not keep this hack forever, but instead let timedated read known NTP implementation unit names from usr/lib/systemd/ntp-units.d/*.list since this has been reintroduced in systemd 243. I added the necessary bits in chrony 3.5-5.

By the way, it would be great to know if this issue can be triggered by upgrading chrony to the 3.5-5 version (and with systemd >= 243-1).

I guess at this point it is best to ask chrony, ntp, openntpd, ntpsec
and virtualbox [1] to drop the Conflicts= line again.
Maybe we should even do that for buster via a stable upload?

Thoughts?

Seems good.

Michael

Cheers,
Vincent

Attachment: signature.asc
Description: PGP signature

Reply via email to