Bug#933139: bind9 fails on startup: can't find /var/cache/bind
I think I fixed the problem. I added the RequiresMountsFor=/var to resolvconf, deleted everything below /var on the root partition, and changed /etc/resolvconf/update.d/bind9 to use /run/named instead of /var/run/named (the actual reference in named.conf.options already used /run without the /var prefix). Combined with the previous changes (establishing RequiresMountsFor=/var for bind9 and bind9-resolvconf) seems to do the trick. I started the system twice, and both times bind9 started successfully on boot. What combination of changes are essential I don't know, though clearly at least one of the changes in the last round was. The change from /var/run to /run was inspired by internet discussion of somewhat similar problems, in which the fact that /var/run was a symlink made dependencies hard to work out. It has the added bonus of removing the dependency on /var being mounted for named.resolvers (the file created by the bind9 script for resolvconf). Of course, /run has to be mounted, but I guess that's already taken care of. The root partition's /var no longer has stuff being written to it on startup. I'm sure there are some general lessons to be learned, though I'm not sure what they are :) It's not clear to me if the presence to stuff below the root partitions /var, or the writing of files there, fooled or disabled RequiresMountsFor as you suspected, but it seems possible. I leave it to you to decide if any changes to the bind package are in order. The problem seems to require the interaction of a separately mounted /var (which I would guess is typical) and the effort to integrate resolvconf and bind9 (I think I enabled bind9-resolvconf manually). My recollection is the current package provides only general guidance about working with resolvconf, but I think specifics would be helpful. Ross
Bug#933139: Fwd: Bug#933139: bind9 fails on startup: can't find /var/cache/bind
And for the bug report: -- Forwarded message - From: Ross Boylan Date: Sun, Jul 28, 2019 at 5:11 PM Subject: Re: Bug#933139: bind9 fails on startup: can't find /var/cache/bind To: Bernhard Schmidt On Sun, Jul 28, 2019 at 12:36 PM Ross Boylan wrote: > On Sun, Jul 28, 2019 at 11:42 AM Bernhard Schmidt wrote: . > > > The root partition includes /var/cache, but no /var/cache/bind. > > > When the partition with /var is mounted it has a /var/cache/bind > > > directory. > > > > Is there a reason for this? I think having a non-empty mount target is > > somewhat unusual, maybe this is causing the systemd mount dependency > > logic to fail. > > I think it's a historical accident of how I setup the system; I don't > have any principled reason for the setup. > > I can try making /var empty on the root partition and test the next > time I have a chance to reboot. Emptying /var seemed to work once in conjunction with adding a dependency on /var for bind9-resolvconf, but then it failed. It has also proven difficult to keep the root partitions /var clear of content. I'm suspecting an interaction between bind9, bind9-resolvconf, and resolvconf itself with a bind9 extension. In order: 1. Deleted everything under root's var and restarted. Same failure to start bind9 as before. root's var had an empty tmp directory, timestamped well before restart, and a run subdirectory timestamped at the system start. It had only named/named.resolvers, a stub file with only // named.conf fragment automatically generated by /etc/resolvconf/update.d/bind9 // DO NOT EDIT THIS FILE. Instead edit /etc/bind/named.conf.options . This directed my attention to named-resolvconf, to which I added an override to tell it to wait for /var. I may have forgotten to clear out root's var. 2. Success! However, root's var still has named.resolvers, timestamped with the time from step 1, and the directories above it. The named directory is timestamped at this steps startup; run (above it) has step 1's timestamp. named.resolvers has the same trivial contents as before. rm all files and dirs below root's var. Reboot. 3. Failure! root's /var has the same files as in 2, now all timestamped with step 3 startup time. Delete them all and reboot. 4. Still failure. I notice that scripts in resolvconf (see below) may be involved and added a RequiresMountsFor=var to resolvconf.service. I have not had a chance to test that. Comments on the relations between the services: Comment 1 The startup order appears to be resolvconf first, then bind9, and finally bind9-resolvconf, where each must complete before the next starts. resolvconf is Before network.target while bind9 is After it; bind9-resolvconf is After bind9. However, bind9.service.wants has a link to bind9-resolvconf (I may have added that), which contradicts bind9 resolvconf's After directive. I think the Wants will be ignored, but maybe it's part of the problem? Comment 2 The resolvconf service runs the resolvconf program, which in turn triggers the update.d/bind9 script shown below. This script writes to /var/run/named/named.resolvers, exactly the file mentioned above as appearing on the root partition's var. The script is based on those formerly distributed with bind, though I don't think it's in the current release (I recall it was carved out of a more elaborate script). bind9-resolvconf also calls the resolvconf script, presumably triggering the same update.d/bind9 script. The rest of this message lists the contents of different files. # systemctl cat bind9 --- # /lib/systemd/system/bind9.service [Unit] Description=BIND Domain Name Server Documentation=man:named(8) After=network.target Wants=nss-lookup.target Before=nss-lookup.target [Service] Type=forking EnvironmentFile=-/etc/default/bind9 ExecStart=/usr/sbin/named $OPTIONS ExecReload=/usr/sbin/rndc reload ExecStop=/usr/sbin/rndc stop [Install] WantedBy=multi-user.target # /etc/systemd/system/bind9.service.d/override.conf [Unit] RequiresMountsFor=/var # systemctl cat bind9-resolvconf -- # /lib/systemd/system/bind9-resolvconf.service [Unit] Description=local BIND via resolvconf Documentation=man:named(8) man:resolvconf(8) PartOf=bind9.service After=bind9.service ConditionFileIsExecutable=/sbin/resolvconf [Service] Type=oneshot RemainAfterExit=yes ExecStart=/bin/sh -c 'echo nameserver 127.0.0.1 | /sbin/resolvconf -a lo.named' ExecStop=/sbin/resolvconf -d lo.named [Install] WantedBy=bind9.service # /etc/systemd/system/bind9-resolvconf.service.d/override.conf [Unit] RequiresMountsFor=/var # systemctl cat resolvconf --- # /lib/systemd/system/resolvconf.service [Unit] Description=Nameserver information manager Documentation=man:resolvconf(8) DefaultDependencies=no Before=networking.service
Bug#933139: Fwd: Bug#933139: bind9 fails on startup: can't find /var/cache/bind
Oops, forgot to cc the bug. -- Forwarded message - From: Ross Boylan Date: Sun, Jul 28, 2019 at 12:36 PM Subject: Re: Bug#933139: bind9 fails on startup: can't find /var/cache/bind To: Bernhard Schmidt Cc: Ross Boylan Thank you for your response. Answers interpolated below. On Sun, Jul 28, 2019 at 11:42 AM Bernhard Schmidt wrote: > > Control: tags -1 moreinfo > > Am 26.07.19 um 22:50 schrieb Ross Boylan: > > Hi Ross, > > > * What led up to the situation? > > > >Start or restart the system. bind9 attempts to start, but fails > >with the error that it is unable to cd to /var/cache/bind. This > >happens every time. > > > >New installation of buster with existing customizations from an old > >system ported over, with adoptions. > > > >resolvconf in use. > > > >/var is a separate partition. See details below for my theory the > >failure arises from a race condition in which bind starts before > >/var is mounted. > > I think this is more of a systemd bug or local misconfiguration than a > bind9 bug. That's my main suspicion, but I thought it best to start with the package reporting the error. Also, it would be nice if bind9 worked with mountable /var out of the box (if that is the problem). BTW I have attempted to sign up for the general (but called "developer") systemd list at freedesktop, but have not received the confirming message to which I must reply. > > > The root partition includes /var/cache, but no /var/cache/bind. > > When the partition with /var is mounted it has a /var/cache/bind > > directory. > > Is there a reason for this? I think having a non-empty mount target is > somewhat unusual, maybe this is causing the systemd mount dependency > logic to fail. I think it's a historical accident of how I setup the system; I don't have any principled reason for the setup. I can try making /var empty on the root partition and test the next time I have a chance to reboot. > > > /var is encrypted on top of lvm, so it takes a bunch of steps to mount > > it. However, it does not require me to enter any new passwords to > > decrypt it. The root partition does require me to enter a password, > > which happens early in startup. > > Are there manual steps involved mounting /var? Given that / is mounted, no. > > How does your /etc/fstab look like? UUIDs obscured: --fstab- /dev/mapper/vgbarley-root_crypt / ext4 errors=remount-ro 0 1 UUID=xx /boot ext2defaults0 2 /dev/mapper/vgbarley-home_crypt /home ext4defaults 0 2 /dev/mapper/vgbarley-usr /usrext4defaults0 2 /dev/mapper/vgbarley-var_crypt /varext4defaults0 2 /dev/sr0/media/cdrom0 udf,iso9660 user,noauto 0 0 /dev/vgbarley/media10 /srv/media10 xfs noatime,nodiratime,allocsize=512m 0 0 /dev/vgbarley/cache /var/local/cacheext4 defaults,noatime 0 0 --crypttab vgbarley-home_crypt UUID=yy none luks,discard vgbarley-root_crypt UUID=zz none luks,discard vgbarley-var_crypt UUID=ww none luks,discard --- The UUIDs in crypttab point at the correspondingly named logical volumes on vgbarley, e.g., vgbarley/var for var. So on startup lvm must activate vgbarley, and attempts to activate some other volume groups. This is a bit tricky, since some of the other volume groups are missing some physical disks, introducing delay and partial activation for those groups. I also have encrypted physical partitions and RAID; again neither is involved with vgbarley. Once vgbarley is active some of its logical volumes, including root and var, must be decrypted. This requires entering a password on startup; since the same one works for root and var, there is no separate prompt for var. Underlying disks are GPT. Ross
Bug#933139: bind9 fails on startup: can't find /var/cache/bind
Control: tags -1 moreinfo Am 26.07.19 um 22:50 schrieb Ross Boylan: Hi Ross, > * What led up to the situation? > >Start or restart the system. bind9 attempts to start, but fails >with the error that it is unable to cd to /var/cache/bind. This >happens every time. > >New installation of buster with existing customizations from an old >system ported over, with adoptions. > >resolvconf in use. > >/var is a separate partition. See details below for my theory the >failure arises from a race condition in which bind starts before >/var is mounted. I think this is more of a systemd bug or local misconfiguration than a bind9 bug. > The root partition includes /var/cache, but no /var/cache/bind. > When the partition with /var is mounted it has a /var/cache/bind > directory. Is there a reason for this? I think having a non-empty mount target is somewhat unusual, maybe this is causing the systemd mount dependency logic to fail. > /var is encrypted on top of lvm, so it takes a bunch of steps to mount > it. However, it does not require me to enter any new passwords to > decrypt it. The root partition does require me to enter a password, > which happens early in startup. Are there manual steps involved mounting /var? How does your /etc/fstab look like? Bernhard
Bug#933139: bind9 fails on startup: can't find /var/cache/bind
Package: bind9 Version: 1:9.11.5.P4+dfsg-5.1 Severity: normal Dear Maintainer, Severity is import for me, but obviously this isn't happening to everyone. Every time I start bind9 fails, and all the other startup services that require working DNS end up in various semi-broken states. I can fix this after the system is up. * What led up to the situation? Start or restart the system. bind9 attempts to start, but fails with the error that it is unable to cd to /var/cache/bind. This happens every time. New installation of buster with existing customizations from an old system ported over, with adoptions. resolvconf in use. /var is a separate partition. See details below for my theory the failure arises from a race condition in which bind starts before /var is mounted. * What exactly did you do (or not do) that was effective (or ineffective)? 1. ; I have added /etc/systemd/system/bind9.service.d/override.conf with [Unit] RequiresMountsFor=/var but it didn't help. 2. *After* the system has started I can start bind successfully. Then I need to restart other services that were messed up by lack of DNS. 3. Reviewed and maybe added some entries to the apparmor profile. This helped with some of my customizations, but a simple apparmor problem would not block initial access to a directory and then permit later access. 4. Discuss on debian-user. * What outcome did you expect instead? I expected bind would start the first time. * Details The root partition includes /var/cache, but no /var/cache/bind. When the partition with /var is mounted it has a /var/cache/bind directory. So it seems most likely the initial startup is happening before /var is mounted, and fails when it can't find /var/cache/bind, while the later startups happen after /var is mounted and so succeed. The default systemd is controlling startup. I'm not sure what I need to do to express this dependency properly; either my RequireMountsFor does not have the effects my reading of the documentation suggests it should, or my override is not actually taking effect. /var is encrypted on top of lvm, so it takes a bunch of steps to mount it. However, it does not require me to enter any new passwords to decrypt it. The root partition does require me to enter a password, which happens early in startup. /etc/systemd/system/bind9.service.wants/bind9-resolvconf.service is a symlink to /lib/systemd/system/bind9-resolvconf.service. I believe it is present because I deliberately activated it. This might be inconsistent with the debconf run-resolvconf setting (of false). I'm master for my local domain with files in /var/lib/bind; I have inside and outside views. I'm using /run/named/named.resolvers as suggested by this package (at least in earlier versions) and scripts based on those previously distributed with this package. Logs: -- Logs begin at Sat 2019-05-11 16:11:40 PDT, end at Sat 2019-05-11 17:08:02 PDT. -- May 11 16:11:42 barley named[609]: linked to libjson-c version: 0.12.1 May 11 16:11:42 barley named[609]: threads support is enabled May 11 16:11:42 barley named[609]: May 11 16:11:42 barley named[609]: BIND 9 is maintained by Internet Systems Consortium, May 11 16:11:42 barley named[609]: Inc. (ISC), a non-profit 501(c)(3) public-benefit May 11 16:11:42 barley named[609]: corporation. Support and training for BIND 9 are May 11 16:11:42 barley named[609]: available at https://www.isc.org/support May 11 16:11:42 barley named[609]: May 11 16:11:42 barley named[609]: adjusted limit on open files from 524288 to 1048576 May 11 16:11:42 barley named[609]: found 8 CPUs, using 8 worker threads May 11 16:11:42 barley named[609]: using 7 UDP listeners per interface May 11 16:11:42 barley named[609]: using up to 4096 sockets May 11 16:11:44 barley named[609]: loading configuration from '/etc/bind/named.conf' May 11 16:11:44 barley named[609]: /etc/bind/named.conf.options:2: change directory to '/var/cache/bind' failed: file not found May 11 16:11:44 barley named[609]: /etc/bind/named.conf.options:2: parsing failed: file not found May 11 16:11:44 barley named[609]: loading configuration: file not found May 11 16:11:44 barley named[609]: exiting (due to fatal error) -- System Information: Debian Release: 10.0 APT prefers stable APT policy: (500, 'stable') Architecture: amd64 (x86_64) Kernel: Linux 4.19.0-5-amd64 (SMP w/8 CPU cores) Kernel taint flags: TAINT_PROPRIETARY_MODULE, TAINT_OOT_MODULE, TAINT_UNSIGNED_MODULE Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE= (charmap=UTF-8) Shell: /bin/sh linked to /usr/bin/dash Init: systemd (via /run/systemd/system) LSM: AppArmor: enabled Versions of packages bind9 depends on: ii adduser3.118 ii bind9utils 1:9.11.5.P4+dfsg-5.1 ii debconf