Bug#947936: chrony: Does (still) not start properly on boot on buster
Control: tags -1 pending Santiago, Michael, does that changelog entry about this issue feels clear to you? https://salsa.debian.org/debian/chrony/-/blob/master/debian/changelog On 2020-01-02T13:38+0100, Santiago Vila wrote: Package: chrony Version: 3.4-4 Severity: important Dear maintainer: Apparently, installing chrony does not ensure at all that it will work. Google has moved from ntp in Debian 9 to chrony in Debian 10 for their default Debian GCE images, and I discovered this on a lot of GCE instances having a clock several minutes off. The problem I found is very similar to the one described here: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=933370 I believe the best summary of the problem was made by Michael Biebl here: https://github.com/systemd/systemd/issues/7104#issuecomment-471329392 Quoting Michael: As it stands, the current practice of having systemd-timesyncd.service enabled by default (in Debian) and alternative implementations like chrony or ntpd declare Conflicts=systemd-timesyncd.service in their service file does not work reliably. AFAIK, this has been fixed on the systemd side in version 241-3 by dropping the "Conflicts" systemd had on chrony or ntpd. Unfortunately, AFAIK, conflicts are bi-directional, so apparently the problem will persist in buster as far as chrony still has conflicts in the systemd unit file. I've noticed this problem happens randomly (it happens in some instances, it does not happens in some others), so I don't have a "recipe" as such to reproduce it. However, I have a particular instance at GCE showing this behaviour which I could try to clone to give you ssh access if required (please contact me privately for details). The behaviour is the following: Both systemd-timesyncd and chrony are enabled (which is the default on GCE instances). Just after a reboot, "systemctl status chrony" shows this: ● chrony.service - chrony, an NTP client/server Loaded: loaded (/lib/systemd/system/chrony.service; enabled; vendor preset: enabled) Active: inactive (dead) Docs: man:chronyd(8) man:chronyc(1) man:chrony.conf(5) and I see this in the boot log: systemd[1]: Condition check resulted in Network Time Synchronization being skipped. If I do "systemctl disable systemd-timesyncd" and reboot, chrony is properly loaded and it runs. If I do "systemctl enable systemd-timesyncd" again and reboot, chrony is shown as "inactive (dead)" again. At this point, if I edit /etc/systemd/system/multi-user.target.wants/chrony.service to remove the Conflicts line and reboot, chrony is properly loaded and it runs. [ I'm reporting this as "important" because I believe it to be the kind of bug that should be fixed in a point release of Debian 10 ]. [ Cc to Michael Biebl in case he would like to comment on the issue ]. Thanks. signature.asc Description: PGP signature
Bug#947936: chrony: Does (still) not start properly on boot on buster
On Sat, Feb 01, 2020 at 10:46:24PM +0100, Michael Biebl wrote: > > Would it make sense to ship systemd-timesyncd disabled by default in > > buster and add a README to enable it only if the user really decides > > to enable it? (Maybe also documenting this in the release notes). > > I think this would break more setups then it fixes. > The default behaviour of systemd-timesyncd has been since two releases > to be enabled by default. We can't easily change that. > > > That would be the most simple solution for stable that I can think, > > as it would reduce the number of packages to change to just one. > > Unfortunately I think that disabling systemd-timesyncd by default is one > of the most intrusive changes. After all, systemd is installed by > default (and thus systemd-timesyncd enabled by default). I fear this is > a no-go. Ok, in such case, the only other solution which comes to mind is what you proposed in the message I was replying to, i.e. this: > I guess at this point it is best to ask chrony, ntp, openntpd, ntpsec > and virtualbox [1] to drop the Conflicts= line again. If I'm not mistaken, this is how it was done in stretch, so the fix would be as "conservative" as it can be. I would not worry about the number of packages that need to be changed being "high". If you as systemd maintainer believe that this is the best solution for buster, I would hope that the other maintainers agree. Thanks.
Bug#947936: chrony: Does (still) not start properly on boot on buster
On 2020-02-01T22:46+0100, Michael Biebl wrote: Am 01.02.20 um 22:37 schrieb Santiago Vila: On Sun, Jan 12, 2020 at 08:41:19PM +0100, Michael Biebl wrote: I guess at this point it is best to ask chrony, ntp, openntpd, ntpsec and virtualbox [1] to drop the Conflicts= line again. Maybe we should even do that for buster via a stable upload? Thoughts? Hi. I definitely think this should be fixed in stable, in whatever way it's considered best for stable. The last thing a system admin would expect from Debian stable is that the clock is off by several minutes in a system where a time-keeping package like ntp or chrony is present. This was completely unexpected for me. (In fact, I would have reported this as serious but I prefer to concentrate on finding a good fix). Regarding the proper fix: Anything which makes chrony and ntp work again (without surprises) would do. I agree that the less intrusive the change, the better. In the Debian 10 instances at GCE where I found this I just did this: systemctl disable systemd-timesyncd Would it make sense to ship systemd-timesyncd disabled by default in buster and add a README to enable it only if the user really decides to enable it? (Maybe also documenting this in the release notes). I think this would break more setups then it fixes. The default behaviour of systemd-timesyncd has been since two releases to be enabled by default. We can't easily change that. That would be the most simple solution for stable that I can think, as it would reduce the number of packages to change to just one. Unfortunately I think that disabling systemd-timesyncd by default is one of the most intrusive changes. After all, systemd is installed by default (and thus systemd-timesyncd enabled by default). I fear this is a no-go. I wholeheartedly agree with you, Michael. To me having an {S}NTP implementation by default is a must. Disabling systemd-timesyncd at this point would probably seen as a serious regression. Despite what Balint thinks, I think that we, maintainers of alternative NTP implementations, should just stop conflicting with systemd-timesyncd.service on stable and oldstable. Regarding Bulleye, that should not be an issue if Balint and Michael’s work is pushed in the archive. signature.asc Description: PGP signature
Bug#947936: chrony: Does (still) not start properly on boot on buster
Am 01.02.20 um 22:37 schrieb Santiago Vila: > On Sun, Jan 12, 2020 at 08:41:19PM +0100, Michael Biebl wrote: > >> I guess at this point it is best to ask chrony, ntp, openntpd, ntpsec >> and virtualbox [1] to drop the Conflicts= line again. >> Maybe we should even do that for buster via a stable upload? >> >> Thoughts? > > Hi. I definitely think this should be fixed in stable, in whatever way > it's considered best for stable. > > The last thing a system admin would expect from Debian stable is that > the clock is off by several minutes in a system where a time-keeping > package like ntp or chrony is present. This was completely unexpected > for me. (In fact, I would have reported this as serious but I prefer > to concentrate on finding a good fix). > > Regarding the proper fix: Anything which makes chrony and ntp work > again (without surprises) would do. I agree that the less intrusive > the change, the better. > > In the Debian 10 instances at GCE where I found this I just did this: > > systemctl disable systemd-timesyncd > > Would it make sense to ship systemd-timesyncd disabled by default in > buster and add a README to enable it only if the user really decides > to enable it? (Maybe also documenting this in the release notes). I think this would break more setups then it fixes. The default behaviour of systemd-timesyncd has been since two releases to be enabled by default. We can't easily change that. > That would be the most simple solution for stable that I can think, > as it would reduce the number of packages to change to just one. Unfortunately I think that disabling systemd-timesyncd by default is one of the most intrusive changes. After all, systemd is installed by default (and thus systemd-timesyncd enabled by default). I fear this is a no-go. signature.asc Description: OpenPGP digital signature
Bug#947936: chrony: Does (still) not start properly on boot on buster
On Sun, Jan 12, 2020 at 08:41:19PM +0100, Michael Biebl wrote: > I guess at this point it is best to ask chrony, ntp, openntpd, ntpsec > and virtualbox [1] to drop the Conflicts= line again. > Maybe we should even do that for buster via a stable upload? > > Thoughts? Hi. I definitely think this should be fixed in stable, in whatever way it's considered best for stable. The last thing a system admin would expect from Debian stable is that the clock is off by several minutes in a system where a time-keeping package like ntp or chrony is present. This was completely unexpected for me. (In fact, I would have reported this as serious but I prefer to concentrate on finding a good fix). Regarding the proper fix: Anything which makes chrony and ntp work again (without surprises) would do. I agree that the less intrusive the change, the better. In the Debian 10 instances at GCE where I found this I just did this: systemctl disable systemd-timesyncd Would it make sense to ship systemd-timesyncd disabled by default in buster and add a README to enable it only if the user really decides to enable it? (Maybe also documenting this in the release notes). That would be the most simple solution for stable that I can think, as it would reduce the number of packages to change to just one. Thanks.
Bug#947936: chrony: Does (still) not start properly on boot on buster
Hi Feri, On 2020-01-09T19:31+0100, wf...@niif.hu wrote: Hi, Don't you think #873057 is actually the same issue discovered with ntp? Please consider merging it in if you agree. Indeed, this is the same issue. signature.asc Description: PGP signature
Bug#947936: chrony: Does (still) not start properly on boot on buster
On 2020-01-13T00:48+0100, Michael Biebl wrote: Am 13.01.20 um 00:43 schrieb Michael Biebl: Am 13.01.20 um 00:05 schrieb Vincent Blut: On 2020-01-12T23:24+0100, Michael Biebl wrote: Am 12.01.20 um 23:08 schrieb Vincent Blut: We probably should not keep this hack forever, but instead let timedated read known NTP implementation unit names from usr/lib/systemd/ntp-units.d/*.list since this has been reintroduced in systemd 243. I added the necessary bits in chrony 3.5-5. Hm, I don't see how this change in timedate would actually help in this situation. Support for ntp-units.d in timedated/timedatectl just enables that if you use "timedatectl set-ntp true|false" it will prefer alternatives if installed. Doesn’t systemd-timesyncd look for foreign services in ntp-units.d/ when starting? I thought that was the case and that it remained inactive in case an NTP implementation with a higher priority was found there. Reading through the diff at https://github.com/systemd/systemd/pull/13136 that doesn't seem to be the case. ntp-units.d is only used by timedated, not timesyncd. Fwiw, there is this related MR https://salsa.debian.org/systemd-team/systemd/merge_requests/61 I mentioned another bug report about a similar issue with ntp. Ideally, only one NTP client should be enabled at a time. Note that we restore timesyncd (if it is enabled) on chrony removal to not wait for the next reboot to get the clock synchronized again, so if having only one NTP client enabled at a time is where we are heading for, then I’ll have to drop that code snippet. signature.asc Description: PGP signature
Bug#947936: chrony: Does (still) not start properly on boot on buster
Am 13.01.20 um 00:43 schrieb Michael Biebl: > Am 13.01.20 um 00:05 schrieb Vincent Blut: >> On 2020-01-12T23:24+0100, Michael Biebl wrote: >>> Am 12.01.20 um 23:08 schrieb Vincent Blut: > We probably should not keep this hack forever, but instead let timedated read known NTP implementation unit names from usr/lib/systemd/ntp-units.d/*.list since this has been reintroduced in systemd 243. I added the necessary bits in chrony 3.5-5. >>> >>> Hm, I don't see how this change in timedate would actually help in this >>> situation. >>> Support for ntp-units.d in timedated/timedatectl just enables that if >>> you use "timedatectl set-ntp true|false" it will prefer alternatives if >>> installed. >> >> Doesn’t systemd-timesyncd look for foreign services in ntp-units.d/ when >> starting? >> I thought that was the case and that it remained inactive in case an >> NTP implementation with a higher priority was found there. > > Reading through the diff at > https://github.com/systemd/systemd/pull/13136 that doesn't seem to be > the case. ntp-units.d is only used by timedated, not timesyncd. > > Fwiw, there is this related MR https://salsa.debian.org/systemd-team/systemd/merge_requests/61 Ideally, only one NTP client should be enabled at a time. signature.asc Description: OpenPGP digital signature
Bug#947936: chrony: Does (still) not start properly on boot on buster
Am 13.01.20 um 00:05 schrieb Vincent Blut: > On 2020-01-12T23:24+0100, Michael Biebl wrote: >> Am 12.01.20 um 23:08 schrieb Vincent Blut: >>> We probably should not keep this hack forever, but instead let timedated >>> read known NTP implementation unit names from >>> usr/lib/systemd/ntp-units.d/*.list since this has been reintroduced in >>> systemd 243. I added the necessary bits in chrony 3.5-5. >> >> Hm, I don't see how this change in timedate would actually help in this >> situation. >> Support for ntp-units.d in timedated/timedatectl just enables that if >> you use "timedatectl set-ntp true|false" it will prefer alternatives if >> installed. > > Doesn’t systemd-timesyncd look for foreign services in ntp-units.d/ when > starting? > I thought that was the case and that it remained inactive in case an > NTP implementation with a higher priority was found there. Reading through the diff at https://github.com/systemd/systemd/pull/13136 that doesn't seem to be the case. ntp-units.d is only used by timedated, not timesyncd. signature.asc Description: OpenPGP digital signature
Bug#947936: chrony: Does (still) not start properly on boot on buster
On 2020-01-12T23:24+0100, Michael Biebl wrote: Am 12.01.20 um 23:08 schrieb Vincent Blut: On 2020-01-12T20:41+0100, Michael Biebl wrote: Am 12.01.20 um 20:15 schrieb Santiago Vila: My theory is that this is some kind of race condition. I initially cloned the machine from another one where this happened. Then I discovered that the problem also happens (randomly) in a brand new machine if I copy the journal from the "bad" machine. However, there is nothing special about the journal (or at least "journalctl --verify" says it's ok), except maybe that it's several megabytes long. Could it be that systemd spends some time processing the journal at boot time and this is what triggers the race condition? On my buster system: an 12 20:04:46 debian systemd[1]: systemd-timesyncd.service: Looking at job systemd-timesyncd.service/stop conflicted_by=yes Jan 12 20:04:46 debian systemd[1]: systemd-timesyncd.service: Looking at job systemd-timesyncd.service/start conflicted_by=no Jan 12 20:04:46 debian systemd[1]: systemd-timesyncd.service: Fixing conflicting jobs systemd-timesyncd.service/stop,systemd-timesyncd.service/start by deleting job systemd-timesyncd.service/start Jan 12 20:04:46 debian systemd[1]: chrony.service: Looking at job chrony.service/start conflicted_by=no Jan 12 20:04:46 debian systemd[1]: chrony.service: Looking at job chrony.service/stop conflicted_by=no Jan 12 20:04:46 debian systemd[1]: chrony.service: Fixing conflicting jobs chrony.service/start,chrony.service/stop by deleting job chrony.service/stop Jan 12 20:04:46 debian systemd[1]: Found redundant job systemd-timesyncd.service/stop, dropping from transaction. Those lines are missing on GCE system. Instead I see Jan 12 17:02:01 d1 systemd[1]: Keeping job systemd-timesyncd.service/start because of sysinit.target/start Jan 12 17:02:01 d1 systemd[1]: Keeping job chrony.service/stop because of systemd-timesyncd.service/start Jan 12 17:02:01 d1 systemd[1]: systemd-timesyncd.service: Installed new job systemd-timesyncd.service/start as 119 Jan 12 17:02:01 d1 systemd[1]: chrony.service: Job 82 chrony.service/start finished, result=canceled Jan 12 17:02:01 d1 systemd[1]: chrony.service: Installed new job chrony.service/stop as 121 Jan 12 17:02:01 d1 systemd[1]: chrony.service: Job 121 chrony.service/stop finished, result=done The problem is, that the transaction is computed *before* ConditionFileIsExecutable=!/usr/sbin/chronyd is evaluated (conditions are evaluation just before the binary is executed) and it might indeed depend on the ordering, in which the jobs are scheduled. Makes sense. Thanks for debugging this. So, the simplest fix would be to drop the line Conflicts=systemd-timesyncd.service openntpd.service ntp.service ntpsec.service from chrony.service. This way, systemd will schedule the start of both services. chrony.service will be started properly and for systemd-timesyncd.service the ConditionFileIsExecutable will kick in. If there is no risk of regression, then I’m all for making this change. I don't see a risk for a regression, given that systemd in buster ships disable-with-time-daemon.conf. So if the drop the Conflicts again, we'd basically have the situation again as in stretch. Not for chrony and openntpd. Both were already in conflict with systemd-timesyncd.service in Stretch. That said, we should probably first upload this change to sid to give it some wider exposure first. Sure. We probably should not keep this hack forever, but instead let timedated read known NTP implementation unit names from usr/lib/systemd/ntp-units.d/*.list since this has been reintroduced in systemd 243. I added the necessary bits in chrony 3.5-5. Hm, I don't see how this change in timedate would actually help in this situation. Support for ntp-units.d in timedated/timedatectl just enables that if you use "timedatectl set-ntp true|false" it will prefer alternatives if installed. Doesn’t systemd-timesyncd look for foreign services in ntp-units.d/ when starting? I thought that was the case and that it remained inactive in case an NTP implementation with a higher priority was found there. We don't actually use "timedatectl set-ntp true|false" though in our maintainer scripts though (and I don't think we should). Agreed. signature.asc Description: PGP signature
Bug#947936: chrony: Does (still) not start properly on boot on buster
Am 12.01.20 um 23:08 schrieb Vincent Blut: > On 2020-01-12T20:41+0100, Michael Biebl wrote: >> Am 12.01.20 um 20:15 schrieb Santiago Vila: >>> My theory is that this is some kind of race condition. >>> >>> I initially cloned the machine from another one where this happened. >>> >>> Then I discovered that the problem also happens (randomly) in a brand >>> new machine if I copy the journal from the "bad" machine. >>> >>> However, there is nothing special about the journal (or at least >>> "journalctl --verify" says it's ok), except maybe that it's several >>> megabytes long. >>> >>> Could it be that systemd spends some time processing the journal at >>> boot time and this is what triggers the race condition? >> >> On my buster system: >> >> >>> an 12 20:04:46 debian systemd[1]: systemd-timesyncd.service: Looking >>> at job systemd-timesyncd.service/stop conflicted_by=yes >>> Jan 12 20:04:46 debian systemd[1]: systemd-timesyncd.service: Looking >>> at job systemd-timesyncd.service/start conflicted_by=no >>> Jan 12 20:04:46 debian systemd[1]: systemd-timesyncd.service: Fixing >>> conflicting jobs >>> systemd-timesyncd.service/stop,systemd-timesyncd.service/start by >>> deleting job systemd-timesyncd.service/start >>> Jan 12 20:04:46 debian systemd[1]: chrony.service: Looking at job >>> chrony.service/start conflicted_by=no >>> Jan 12 20:04:46 debian systemd[1]: chrony.service: Looking at job >>> chrony.service/stop conflicted_by=no >>> Jan 12 20:04:46 debian systemd[1]: chrony.service: Fixing conflicting >>> jobs chrony.service/start,chrony.service/stop by deleting job >>> chrony.service/stop >>> Jan 12 20:04:46 debian systemd[1]: Found redundant job >>> systemd-timesyncd.service/stop, dropping from transaction. >> >> >> Those lines are missing on GCE system. Instead I see >>> Jan 12 17:02:01 d1 systemd[1]: Keeping job >>> systemd-timesyncd.service/start because of sysinit.target/start >>> Jan 12 17:02:01 d1 systemd[1]: Keeping job chrony.service/stop >>> because of systemd-timesyncd.service/start >>> Jan 12 17:02:01 d1 systemd[1]: systemd-timesyncd.service: Installed >>> new job systemd-timesyncd.service/start as 119 >>> Jan 12 17:02:01 d1 systemd[1]: chrony.service: Job 82 >>> chrony.service/start finished, result=canceled >>> Jan 12 17:02:01 d1 systemd[1]: chrony.service: Installed new job >>> chrony.service/stop as 121 >>> Jan 12 17:02:01 d1 systemd[1]: chrony.service: Job 121 >>> chrony.service/stop finished, result=done >> >> >> The problem is, that the transaction is computed *before* >> ConditionFileIsExecutable=!/usr/sbin/chronyd is evaluated (conditions >> are evaluation just before the binary is executed) and it might indeed >> depend on the ordering, in which the jobs are scheduled. > > Makes sense. Thanks for debugging this. > >> So, the simplest fix would be to drop the line >>> Conflicts=systemd-timesyncd.service openntpd.service ntp.service >>> ntpsec.service >> from chrony.service. >> This way, systemd will schedule the start of both services. >> chrony.service will be started properly and for >> systemd-timesyncd.service the ConditionFileIsExecutable will kick in. > > If there is no risk of regression, then I’m all for making this change. I don't see a risk for a regression, given that systemd in buster ships disable-with-time-daemon.conf. So if the drop the Conflicts again, we'd basically have the situation again as in stretch. That said, we should probably first upload this change to sid to give it some wider exposure first. > We probably should not keep this hack forever, but instead let timedated > read known NTP implementation unit names from > usr/lib/systemd/ntp-units.d/*.list since this has been reintroduced in > systemd 243. I added the necessary bits in chrony 3.5-5. Hm, I don't see how this change in timedate would actually help in this situation. Support for ntp-units.d in timedated/timedatectl just enables that if you use "timedatectl set-ntp true|false" it will prefer alternatives if installed. We don't actually use "timedatectl set-ntp true|false" though in our maintainer scripts though (and I don't think we should). signature.asc Description: OpenPGP digital signature
Bug#947936: chrony: Does (still) not start properly on boot on buster
On 2020-01-12T20:41+0100, Michael Biebl wrote: Am 12.01.20 um 20:15 schrieb Santiago Vila: My theory is that this is some kind of race condition. I initially cloned the machine from another one where this happened. Then I discovered that the problem also happens (randomly) in a brand new machine if I copy the journal from the "bad" machine. However, there is nothing special about the journal (or at least "journalctl --verify" says it's ok), except maybe that it's several megabytes long. Could it be that systemd spends some time processing the journal at boot time and this is what triggers the race condition? On my buster system: an 12 20:04:46 debian systemd[1]: systemd-timesyncd.service: Looking at job systemd-timesyncd.service/stop conflicted_by=yes Jan 12 20:04:46 debian systemd[1]: systemd-timesyncd.service: Looking at job systemd-timesyncd.service/start conflicted_by=no Jan 12 20:04:46 debian systemd[1]: systemd-timesyncd.service: Fixing conflicting jobs systemd-timesyncd.service/stop,systemd-timesyncd.service/start by deleting job systemd-timesyncd.service/start Jan 12 20:04:46 debian systemd[1]: chrony.service: Looking at job chrony.service/start conflicted_by=no Jan 12 20:04:46 debian systemd[1]: chrony.service: Looking at job chrony.service/stop conflicted_by=no Jan 12 20:04:46 debian systemd[1]: chrony.service: Fixing conflicting jobs chrony.service/start,chrony.service/stop by deleting job chrony.service/stop Jan 12 20:04:46 debian systemd[1]: Found redundant job systemd-timesyncd.service/stop, dropping from transaction. Those lines are missing on GCE system. Instead I see Jan 12 17:02:01 d1 systemd[1]: Keeping job systemd-timesyncd.service/start because of sysinit.target/start Jan 12 17:02:01 d1 systemd[1]: Keeping job chrony.service/stop because of systemd-timesyncd.service/start Jan 12 17:02:01 d1 systemd[1]: systemd-timesyncd.service: Installed new job systemd-timesyncd.service/start as 119 Jan 12 17:02:01 d1 systemd[1]: chrony.service: Job 82 chrony.service/start finished, result=canceled Jan 12 17:02:01 d1 systemd[1]: chrony.service: Installed new job chrony.service/stop as 121 Jan 12 17:02:01 d1 systemd[1]: chrony.service: Job 121 chrony.service/stop finished, result=done The problem is, that the transaction is computed *before* ConditionFileIsExecutable=!/usr/sbin/chronyd is evaluated (conditions are evaluation just before the binary is executed) and it might indeed depend on the ordering, in which the jobs are scheduled. Makes sense. Thanks for debugging this. So, the simplest fix would be to drop the line Conflicts=systemd-timesyncd.service openntpd.service ntp.service ntpsec.service from chrony.service. This way, systemd will schedule the start of both services. chrony.service will be started properly and for systemd-timesyncd.service the ConditionFileIsExecutable will kick in. If there is no risk of regression, then I’m all for making this change. I'm sorry, that I first asked chrony (and other NTP implementations) to add a Conflicts=systemd-timesyncd.service, hoping I could get rid of the /lib/systemd/system/systemd-timesyncd.service.d/disable-with-time-daemon.conf hack. Don’t feel sorry Michael, that was even requested by systemd upstream. From systemd changelog: “… Alternative NTP implementations should add a Conflicts=systemd-timesyncd.service to their unit files to take over and replace systemd's NTP default functionality.” But it appears, this hack is actually much more reliable then using Conflicts= See https://github.com/systemd/systemd/issues/7104#issuecomment-456755980 which explains the problem. We'd need some kind of asymmetrical conflicts. We probably should not keep this hack forever, but instead let timedated read known NTP implementation unit names from usr/lib/systemd/ntp-units.d/*.list since this has been reintroduced in systemd 243. I added the necessary bits in chrony 3.5-5. By the way, it would be great to know if this issue can be triggered by upgrading chrony to the 3.5-5 version (and with systemd >= 243-1). I guess at this point it is best to ask chrony, ntp, openntpd, ntpsec and virtualbox [1] to drop the Conflicts= line again. Maybe we should even do that for buster via a stable upload? Thoughts? Seems good. Michael Cheers, Vincent signature.asc Description: PGP signature
Bug#947936: chrony: Does (still) not start properly on boot on buster
Am 12.01.20 um 20:15 schrieb Santiago Vila: > My theory is that this is some kind of race condition. > > I initially cloned the machine from another one where this happened. > > Then I discovered that the problem also happens (randomly) in a brand > new machine if I copy the journal from the "bad" machine. > > However, there is nothing special about the journal (or at least > "journalctl --verify" says it's ok), except maybe that it's several > megabytes long. > > Could it be that systemd spends some time processing the journal at > boot time and this is what triggers the race condition? On my buster system: > an 12 20:04:46 debian systemd[1]: systemd-timesyncd.service: Looking at job > systemd-timesyncd.service/stop conflicted_by=yes > Jan 12 20:04:46 debian systemd[1]: systemd-timesyncd.service: Looking at job > systemd-timesyncd.service/start conflicted_by=no > Jan 12 20:04:46 debian systemd[1]: systemd-timesyncd.service: Fixing > conflicting jobs > systemd-timesyncd.service/stop,systemd-timesyncd.service/start by deleting > job systemd-timesyncd.service/start > Jan 12 20:04:46 debian systemd[1]: chrony.service: Looking at job > chrony.service/start conflicted_by=no > Jan 12 20:04:46 debian systemd[1]: chrony.service: Looking at job > chrony.service/stop conflicted_by=no > Jan 12 20:04:46 debian systemd[1]: chrony.service: Fixing conflicting jobs > chrony.service/start,chrony.service/stop by deleting job chrony.service/stop > Jan 12 20:04:46 debian systemd[1]: Found redundant job > systemd-timesyncd.service/stop, dropping from transaction. Those lines are missing on GCE system. Instead I see > Jan 12 17:02:01 d1 systemd[1]: Keeping job systemd-timesyncd.service/start > because of sysinit.target/start > Jan 12 17:02:01 d1 systemd[1]: Keeping job chrony.service/stop because of > systemd-timesyncd.service/start > Jan 12 17:02:01 d1 systemd[1]: systemd-timesyncd.service: Installed new job > systemd-timesyncd.service/start as 119 > Jan 12 17:02:01 d1 systemd[1]: chrony.service: Job 82 chrony.service/start > finished, result=canceled > Jan 12 17:02:01 d1 systemd[1]: chrony.service: Installed new job > chrony.service/stop as 121 > Jan 12 17:02:01 d1 systemd[1]: chrony.service: Job 121 chrony.service/stop > finished, result=done The problem is, that the transaction is computed *before* ConditionFileIsExecutable=!/usr/sbin/chronyd is evaluated (conditions are evaluation just before the binary is executed) and it might indeed depend on the ordering, in which the jobs are scheduled. So, the simplest fix would be to drop the line > Conflicts=systemd-timesyncd.service openntpd.service ntp.service > ntpsec.service from chrony.service. This way, systemd will schedule the start of both services. chrony.service will be started properly and for systemd-timesyncd.service the ConditionFileIsExecutable will kick in. I'm sorry, that I first asked chrony (and other NTP implementations) to add a Conflicts=systemd-timesyncd.service, hoping I could get rid of the /lib/systemd/system/systemd-timesyncd.service.d/disable-with-time-daemon.conf hack. But it appears, this hack is actually much more reliable then using Conflicts= See https://github.com/systemd/systemd/issues/7104#issuecomment-456755980 which explains the problem. We'd need some kind of asymmetrical conflicts. I guess at this point it is best to ask chrony, ntp, openntpd, ntpsec and virtualbox [1] to drop the Conflicts= line again. Maybe we should even do that for buster via a stable upload? Thoughts? Michael [1] https://codesearch.debian.net/search?q=Conflicts%3Dsystemd-timesyncd.service=1 signature.asc Description: OpenPGP digital signature
Bug#947936: chrony: Does (still) not start properly on boot on buster
My theory is that this is some kind of race condition. I initially cloned the machine from another one where this happened. Then I discovered that the problem also happens (randomly) in a brand new machine if I copy the journal from the "bad" machine. However, there is nothing special about the journal (or at least "journalctl --verify" says it's ok), except maybe that it's several megabytes long. Could it be that systemd spends some time processing the journal at boot time and this is what triggers the race condition? Thanks.
Bug#947936: chrony: Does (still) not start properly on boot on buster
Am 12.01.20 um 18:18 schrieb Michael Biebl: > Am 04.01.20 um 20:22 schrieb Santiago Vila: >> Hi. I managed to put the instances in a state where a reboot does not >> make the problem to go away. Please give it a try. > > I could reproduce the issue on the system provided by Santiago. > > I'm dumping the relevant parts of the log here: From a buster system of mine which has chrony and systemd-timesyncd installed and enabled: > ● chrony.service - chrony, an NTP client/server >Loaded: loaded (/lib/systemd/system/chrony.service; enabled; vendor > preset: enabled) >Active: active (running) since Sun 2020-01-12 19:48:09 CET; 1min 15s ago > Docs: man:chronyd(8) >man:chronyc(1) >man:chrony.conf(5) > Process: 395 ExecStart=/usr/sbin/chronyd $DAEMON_OPTS (code=exited, > status=0/SUCCESS) > Process: 404 ExecStartPost=/usr/lib/chrony/chrony-helper update-daemon > (code=exited, status=0/SUCCESS) > Main PID: 401 (chronyd) > Tasks: 2 (limit: 1149) >Memory: 3.3M >CGroup: /system.slice/chrony.service >├─401 /usr/sbin/chronyd -F -1 >└─403 /usr/sbin/chronyd -F -1 > > Jan 12 19:48:09 debian systemd[1]: chrony.service: Control process exited, > code=exited, status=0/SUCCESS > Jan 12 19:48:09 debian systemd[1]: chrony.service: Got final SIGCHLD for > state start-post. > Jan 12 19:48:09 debian systemd[1]: chrony.service: Changed start-post -> > running > Jan 12 19:48:09 debian systemd[1]: chrony.service: Job 75 > chrony.service/start finished, result=done > Jan 12 19:48:09 debian systemd[1]: Started chrony, an NTP client/server. > Jan 12 19:48:14 debian systemd[1]: chrony.service: chrony.service/stop would > stop a running service. > Jan 12 19:48:14 debian systemd[1]: chrony.service: Deleting > chrony.service/stop to minimize impact. > Jan 12 19:48:17 debian chronyd[401]: Selected source 162.159.200.1 > Jan 12 19:48:18 debian chronyd[401]: Selected source 173.249.58.145 > Jan 12 19:48:19 debian chronyd[401]: Source 129.70.132.37 replaced with > 2a01:4f8:110:12c8::123 > > ● systemd-timesyncd.service - Network Time Synchronization >Loaded: loaded (/lib/systemd/system/systemd-timesyncd.service; enabled; > vendor preset: enabled) > Drop-In: /usr/lib/systemd/system/systemd-timesyncd.service.d >└─disable-with-time-daemon.conf >Active: inactive (dead) > Docs: man:systemd-timesyncd.service(8) > Jan 12 19:48:08 debian systemd-sysv-generator[232]: Native unit for > chrony.service already exists, skipping. > Jan 12 19:48:09 debian audit[296]: AVC apparmor="STATUS" > operation="profile_load" profile="unconfined" name="/usr/sbin/chronyd" > pid=296 comm="apparmor_parser" > Jan 12 19:48:09 debian kernel: audit: type=1400 audit(1578854889.112:7): > apparmor="STATUS" operation="profile_load" profile="unconfined" > name="/usr/sbin/chronyd" pid=296 comm="apparmor_parser" > Jan 12 19:48:09 debian chronyd[401]: chronyd version 3.4 starting (+CMDMON > +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SIGND +ASYNCDNS +SECHASH +IPV6 > -DEBUG) > Jan 12 19:48:09 debian chronyd[401]: Frequency 0.000 +/- 100.000 ppm read > from /var/lib/chrony/chrony.drift > Jan 12 19:48:09 debian chronyd[401]: Loaded seccomp filter > Jan 12 19:48:09 debian systemd[1]: chrony.service: > ConditionCapability=CAP_SYS_TIME succeeded. > Jan 12 19:48:09 debian systemd[1]: chrony.service: Passing 0 fds to service > Jan 12 19:48:09 debian systemd[1]: chrony.service: About to execute: > /usr/sbin/chronyd $DAEMON_OPTS > Jan 12 19:48:09 debian systemd[1]: chrony.service: Forked /usr/sbin/chronyd > as 395 > Jan 12 19:48:09 debian systemd[1]: chrony.service: Changed dead -> start > Jan 12 19:48:09 debian systemd[1]: Starting chrony, an NTP client/server... > Jan 12 19:48:09 debian systemd[395]: Successfully mounted > /tmp/systemd-private-c32de5dec7564bdd85daa27163fecfdb-chrony.service-2LqUhJ/tmp > to /run/systemd/unit-root/tmp > Jan 12 19:48:09 debian systemd[395]: Successfully mounted > /var/tmp/systemd-private-c32de5dec7564bdd85daa27163fecfdb-chrony.service-aYqukO/tmp > to /run/systemd/unit-root/var/tmp > Jan 12 19:48:09 debian systemd[395]: chrony.service: Executing: > /usr/sbin/chronyd -F -1 > Jan 12 19:48:09 debian systemd[1]: Received SIGCHLD from PID 400 (chronyd). > Jan 12 19:48:09 debian systemd[1]: Child 395 (chronyd) died (code=exited, > status=0/SUCCESS) > Jan 12 19:48:09 debian systemd[1]: chrony.service: Child 395 belongs to > chrony.service. > Jan 12 19:48:09 debian systemd[1]: chrony.service: Control process exited, > code=exited, status=0/SUCCESS > Jan 12 19:48:09 debian systemd[1]: chrony.service: Got final SIGCHLD for > state start. > Jan 12 19:48:09 debian systemd[1]: chrony.service: New main PID 401 belongs > to service, we are happy. > Jan 12 19:48:09 debian systemd[1]: chrony.service: Main PID loaded: 401 > Jan 12 19:48:09 debian systemd[1]: chrony.service: About to execute: > /usr/lib/chrony/chrony-helper update-daemon >
Bug#947936: chrony: Does (still) not start properly on boot on buster
Am 04.01.20 um 20:22 schrieb Santiago Vila: > Hi. I managed to put the instances in a state where a reboot does not > make the problem to go away. Please give it a try. I could reproduce the issue on the system provided by Santiago. I'm dumping the relevant parts of the log here: > root@d1:~# journalctl -b | egrep "chrony.service|timesyncd.service" > Jan 12 17:01:51 d1 systemd-sysv-generator[202]: Native unit for > chrony.service already exists, skipping. > Jan 12 17:02:01 d1 systemd[1]: Pulling in systemd-timesyncd.service/start > from sysinit.target/start > Jan 12 17:02:01 d1 systemd[1]: Added job systemd-timesyncd.service/start to > transaction. > Jan 12 17:02:01 d1 systemd[1]: Pulling in system.slice/start from > systemd-timesyncd.service/start > Jan 12 17:02:01 d1 systemd[1]: Pulling in -.mount/start from > systemd-timesyncd.service/start > Jan 12 17:02:01 d1 systemd[1]: Pulling in time-sync.target/start from > systemd-timesyncd.service/start > Jan 12 17:02:01 d1 systemd[1]: Pulling in shutdown.target/stop from > systemd-timesyncd.service/start > Jan 12 17:02:01 d1 systemd[1]: Pulling in chrony.service/stop from > systemd-timesyncd.service/start > Jan 12 17:02:01 d1 systemd[1]: Added job chrony.service/stop to transaction. > Jan 12 17:02:02 d1 ifup[263]: Executing dbus call > org.freedesktop.systemd1.Manager TryRestartUnit(systemd-timesyncd.service, > replace) > Jan 12 17:02:01 d1 systemd[1]: Keeping job systemd-timesyncd.service/start > because of sysinit.target/start > Jan 12 17:02:01 d1 systemd[1]: Keeping job chrony.service/stop because of > systemd-timesyncd.service/start > Jan 12 17:02:01 d1 systemd[1]: systemd-timesyncd.service: Installed new job > systemd-timesyncd.service/start as 119 > Jan 12 17:02:01 d1 systemd[1]: chrony.service: Job 82 chrony.service/start > finished, result=canceled > Jan 12 17:02:01 d1 systemd[1]: chrony.service: Installed new job > chrony.service/stop as 121 > Jan 12 17:02:01 d1 systemd[1]: chrony.service: Job 121 chrony.service/stop > finished, result=done > Jan 12 17:02:01 d1 systemd[1]: systemd-timesyncd.service: > ConditionFileIsExecutable=!/usr/sbin/chronyd failed. > Jan 12 17:02:01 d1 systemd[1]: systemd-timesyncd.service: Starting requested > but condition failed. Not starting unit. > Jan 12 17:02:01 d1 systemd[1]: systemd-timesyncd.service: Job 119 > systemd-timesyncd.service/start finished, result=done > Jan 12 17:02:02 d1 ifup[263]: Got result done/Success for job > systemd-timesyncd.service > Jan 12 17:02:02 d1 systemd[1]: systemd-timesyncd.service: Trying to enqueue > job systemd-timesyncd.service/try-restart/replace > Jan 12 17:02:02 d1 systemd[1]: Added job systemd-timesyncd.service/nop to > transaction. > Jan 12 17:02:02 d1 systemd[1]: systemd-timesyncd.service: Installed new job > systemd-timesyncd.service/nop as 155 > Jan 12 17:02:02 d1 systemd[1]: systemd-timesyncd.service: Enqueued job > systemd-timesyncd.service/nop as 155 > Jan 12 17:02:02 d1 systemd[1]: Pulling in systemd-timesyncd.service/start > from sysinit.target/start > Jan 12 17:02:02 d1 systemd[1]: Added job systemd-timesyncd.service/start to > transaction. > Jan 12 17:02:02 d1 systemd[1]: Pulling in system.slice/start from > systemd-timesyncd.service/start > Jan 12 17:02:02 d1 systemd[1]: Pulling in -.mount/start from > systemd-timesyncd.service/start > Jan 12 17:02:02 d1 systemd[1]: Pulling in time-sync.target/start from > systemd-timesyncd.service/start > Jan 12 17:02:02 d1 systemd[1]: Pulling in shutdown.target/stop from > systemd-timesyncd.service/start > Jan 12 17:02:02 d1 systemd[1]: Pulling in chrony.service/stop from > systemd-timesyncd.service/start > Jan 12 17:02:02 d1 systemd[1]: Added job chrony.service/stop to transaction. > Jan 12 17:02:02 d1 systemd[1]: Found redundant job chrony.service/stop, > dropping from transaction. > Jan 12 17:02:02 d1 systemd[1]: Garbage collecting job > systemd-timesyncd.service/start > Jan 12 17:02:02 d1 systemd[1]: systemd-timesyncd.service: Job 155 > systemd-timesyncd.service/nop finished, result=done > Jan 12 17:02:02 d1 systemd[1]: Pulling in systemd-timesyncd.service/start > from sysinit.target/start > Jan 12 17:02:02 d1 systemd[1]: Added job systemd-timesyncd.service/start to > transaction. > Jan 12 17:02:02 d1 systemd[1]: Pulling in system.slice/start from > systemd-timesyncd.service/start > Jan 12 17:02:02 d1 systemd[1]: Pulling in -.mount/start from > systemd-timesyncd.service/start > Jan 12 17:02:02 d1 systemd[1]: Pulling in time-sync.target/start from > systemd-timesyncd.service/start > Jan 12 17:02:02 d1 systemd[1]: Pulling in shutdown.target/stop from > systemd-timesyncd.service/start > Jan 12 17:02:02 d1 systemd[1]: Pulling in chrony.service/stop from > systemd-timesyncd.service/start > Jan 12 17:02:02 d1 systemd[1]: Added job chrony.service/stop to transaction. > Jan 12 17:02:02 d1 systemd[1]: Found redundant job chrony.service/stop, > dropping from transaction. > Jan 12 17:02:02 d1
Bug#947936: chrony: Does (still) not start properly on boot on buster
Hi, Don't you think #873057 is actually the same issue discovered with ntp? Please consider merging it in if you agree. -- Thanks, Feri
Bug#947936: chrony: Does (still) not start properly on boot on buster
On Fri, Jan 03, 2020 at 11:21:42PM +0100, Vincent Blut wrote: > > Unfortunately, AFAIK, conflicts are bi-directional, so apparently the > > problem will persist in buster as far as chrony still has conflicts > > in the systemd unit file. > > What do you mean by “conflicts are bi-directional”? I meant that it is enough that service "A" declares a conflict with service "B" so that both of them conflict with the other, i.e. they have the "symmetric property" even if the conflicts is only declared once. At least that's how I interpret the words "vice versa" here: Conflicts= A space-separated list of unit names. Configures negative requirement dependencies. If a unit has a Conflicts= setting on another unit, starting the former will stop the latter and vice versa. So: If we tried to solve this in buster by dropping the use of conflicts, and chrony still uses conflicts, maybe that would explain that the problem is still there. > Also, conflicting with systemd-timesyncd doesn’t seem to cause any issue on > most systems (well, I hope ;-), so we should be cautious about incriminating > “Conflicts= systemd-timesyncd.service” use as the root cause. You are right. We should be cautious. I really don't know which is the root cause, but I can reproduce this failure in as many GCE machines as desired via cloning. So I have just created two different machines for you and Michael to see this by yourself (please check your private email for access details). My suspicious is that this is some sort of race condition, but I am not expert enough on systemd stuff to debug it. Thanks.
Bug#947936: chrony: Does (still) not start properly on boot on buster
Hi Santiago, On 2020-01-02T13:38+0100, Santiago Vila wrote: Package: chrony Version: 3.4-4 Severity: important Dear maintainer: Apparently, installing chrony does not ensure at all that it will work. Google has moved from ntp in Debian 9 to chrony in Debian 10 for their default Debian GCE images, and I discovered this on a lot of GCE instances having a clock several minutes off. The problem I found is very similar to the one described here: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=933370 Indeed, it sounds pretty similar. I believe the best summary of the problem was made by Michael Biebl here: https://github.com/systemd/systemd/issues/7104#issuecomment-471329392 Quoting Michael: As it stands, the current practice of having systemd-timesyncd.service enabled by default (in Debian) and alternative implementations like chrony or ntpd declare Conflicts=systemd-timesyncd.service in their service file does not work reliably. AFAIK, this has been fixed on the systemd side in version 241-3 by dropping the "Conflicts" systemd had on chrony or ntpd. Exact, the Debian systemd maintainers reintroduced the following drop-in file in version 241-3: $ cat /lib/systemd/system/systemd-timesyncd.service.d/disable-with-time-daemon.conf [Unit] ConditionFileIsExecutable=!/usr/sbin/ntpd ConditionFileIsExecutable=!/usr/sbin/openntpd ConditionFileIsExecutable=!/usr/sbin/chronyd ConditionFileIsExecutable=!/usr/sbin/VBoxService It prevents systemd-timesyncd from starting if one of the above executables is present on the system. Unfortunately, AFAIK, conflicts are bi-directional, so apparently the problem will persist in buster as far as chrony still has conflicts in the systemd unit file. What do you mean by “conflicts are bi-directional”? Also, conflicting with systemd-timesyncd doesn’t seem to cause any issue on most systems (well, I hope ;-), so we should be cautious about incriminating “Conflicts= systemd-timesyncd.service” use as the root cause. Would you please tell me how things go when removing “ConditionFileIsExecutable=!/usr/sbin/chronyd” from the systemd-timesyncd service unit? Does that make chrony happy? Thanks. Cheers, Vincent signature.asc Description: PGP signature
Bug#947936: chrony: Does (still) not start properly on boot on buster
Am 03.01.20 um 18:06 schrieb Santiago Vila: > reassign 947936 chrony,systemd > thanks > > I am unsure at this point where is exactly the problem. > > I have several test instances at GCE (running buster) where chrony > does not start even if it's enabled. > > By trial and error I managed to make chrony to start again > properly if I do this: > > journalctl --rotate > init 6 > > So the natural question would be: What kind of bug in systemd can make > it not to start a service at all when there is a "bad" journal? Can you provide instructions how this issue can be reproduced? systemd in buster (and testing/sid) ships /lib/systemd/system/systemd-timesyncd.service.d/disable-with-time-daemon.conf which will prevent systemd-timesyncd from being started if chrony is installed. I'm running chrony here (with systemd-timesyncd being enabled) and it works as expected. signature.asc Description: OpenPGP digital signature
Bug#947936: chrony: Does (still) not start properly on boot on buster
reassign 947936 chrony,systemd thanks I am unsure at this point where is exactly the problem. I have several test instances at GCE (running buster) where chrony does not start even if it's enabled. By trial and error I managed to make chrony to start again properly if I do this: journalctl --rotate init 6 So the natural question would be: What kind of bug in systemd can make it not to start a service at all when there is a "bad" journal? Thanks.
Bug#947936: chrony: Does (still) not start properly on boot on buster
Package: chrony Version: 3.4-4 Severity: important Dear maintainer: Apparently, installing chrony does not ensure at all that it will work. Google has moved from ntp in Debian 9 to chrony in Debian 10 for their default Debian GCE images, and I discovered this on a lot of GCE instances having a clock several minutes off. The problem I found is very similar to the one described here: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=933370 I believe the best summary of the problem was made by Michael Biebl here: https://github.com/systemd/systemd/issues/7104#issuecomment-471329392 Quoting Michael: > As it stands, the current practice of having systemd-timesyncd.service > enabled by default (in Debian) and alternative implementations like > chrony or ntpd declare Conflicts=systemd-timesyncd.service in their > service file does not work reliably. AFAIK, this has been fixed on the systemd side in version 241-3 by dropping the "Conflicts" systemd had on chrony or ntpd. Unfortunately, AFAIK, conflicts are bi-directional, so apparently the problem will persist in buster as far as chrony still has conflicts in the systemd unit file. I've noticed this problem happens randomly (it happens in some instances, it does not happens in some others), so I don't have a "recipe" as such to reproduce it. However, I have a particular instance at GCE showing this behaviour which I could try to clone to give you ssh access if required (please contact me privately for details). The behaviour is the following: Both systemd-timesyncd and chrony are enabled (which is the default on GCE instances). Just after a reboot, "systemctl status chrony" shows this: ● chrony.service - chrony, an NTP client/server Loaded: loaded (/lib/systemd/system/chrony.service; enabled; vendor preset: enabled) Active: inactive (dead) Docs: man:chronyd(8) man:chronyc(1) man:chrony.conf(5) and I see this in the boot log: systemd[1]: Condition check resulted in Network Time Synchronization being skipped. If I do "systemctl disable systemd-timesyncd" and reboot, chrony is properly loaded and it runs. If I do "systemctl enable systemd-timesyncd" again and reboot, chrony is shown as "inactive (dead)" again. At this point, if I edit /etc/systemd/system/multi-user.target.wants/chrony.service to remove the Conflicts line and reboot, chrony is properly loaded and it runs. [ I'm reporting this as "important" because I believe it to be the kind of bug that should be fixed in a point release of Debian 10 ]. [ Cc to Michael Biebl in case he would like to comment on the issue ]. Thanks.