Bug#933139: bind9 fails on startup: can't find /var/cache/bind

2019-07-29 Thread Ross Boylan
I think I fixed the problem.  I added the RequiresMountsFor=/var to
resolvconf, deleted everything below /var on the root partition, and
changed /etc/resolvconf/update.d/bind9 to use /run/named instead of
/var/run/named (the actual reference in named.conf.options already
used /run without the /var prefix).  Combined with the previous
changes (establishing RequiresMountsFor=/var for bind9 and
bind9-resolvconf) seems to do the trick.  I started the system twice,
and both times bind9 started successfully on boot.

What combination of changes are essential I don't know, though clearly
at least one of the changes in the last round was.

The change from /var/run to /run was inspired by internet discussion
of somewhat similar problems, in which the fact that /var/run was a
symlink made dependencies hard to work out.  It has the added bonus of
removing the dependency on /var being mounted for named.resolvers (the
file created by the bind9 script for resolvconf).  Of course, /run has
to be mounted, but I guess that's already taken care of.

The root partition's /var no longer has stuff being written to it on startup.

I'm sure there are some general lessons to be learned, though I'm not
sure what they are :)  It's not clear to me if the presence to stuff
below the root partitions /var, or the writing of files there, fooled
or disabled RequiresMountsFor as you suspected, but it seems possible.

I leave it to you to decide if any changes to the bind package are in
order.  The problem seems to require the interaction of a separately
mounted /var (which I would guess is typical) and the effort to
integrate resolvconf and bind9 (I think I enabled bind9-resolvconf
manually).  My recollection is the current package provides only
general guidance about working with resolvconf, but I think specifics
would be helpful.

Ross



Bug#933139: Fwd: Bug#933139: bind9 fails on startup: can't find /var/cache/bind

2019-07-28 Thread Ross Boylan
And for the bug report:
-- Forwarded message -
From: Ross Boylan 
Date: Sun, Jul 28, 2019 at 5:11 PM
Subject: Re: Bug#933139: bind9 fails on startup: can't find /var/cache/bind
To: Bernhard Schmidt 


On Sun, Jul 28, 2019 at 12:36 PM Ross Boylan
 wrote:

> On Sun, Jul 28, 2019 at 11:42 AM Bernhard Schmidt  wrote:
.
> > > The root partition includes /var/cache, but no /var/cache/bind.
> > > When the partition with /var is mounted it has a /var/cache/bind
> > > directory.
> >
> > Is there a reason for this? I think having a non-empty mount target is
> > somewhat unusual, maybe this is causing the systemd mount dependency
> > logic to fail.
>
> I think it's a historical accident of how I setup the system; I don't
> have any principled reason for the setup.
>
> I can try making /var empty on the root partition and test the next
> time I have a chance to reboot.
Emptying /var seemed to work once in conjunction with adding a
dependency on /var for bind9-resolvconf, but then it failed.
It has also proven difficult to keep the root partitions /var clear of
content.  I'm suspecting an interaction between bind9,
bind9-resolvconf, and resolvconf itself with a bind9 extension.

In order:

1. Deleted everything under root's var and restarted.  Same failure to
start bind9 as before.
root's var had an empty tmp directory, timestamped well before
restart, and a run subdirectory timestamped at the system start.  It
had only named/named.resolvers, a stub file with only
// named.conf fragment automatically generated by /etc/resolvconf/update.d/bind9
// DO NOT EDIT THIS FILE.  Instead edit /etc/bind/named.conf.options .
This directed my attention to named-resolvconf, to which I added an
override to tell it to wait for /var.
I may have forgotten to clear out root's var.

2. Success!  However, root's var still has named.resolvers,
timestamped with the time from step 1, and the directories above it.
The named directory is timestamped at this steps startup; run (above
it) has step 1's timestamp.  named.resolvers has the same trivial
contents as before.
rm all files and dirs below root's var. Reboot.

3. Failure!  root's /var has the same files as in 2, now all
timestamped with step 3 startup time.  Delete them all and  reboot.

4. Still failure.  I notice that scripts in resolvconf (see below) may
be involved and added a RequiresMountsFor=var to resolvconf.service.
I have not had a chance to test that.

Comments on the relations between the services:
Comment 1
The startup order appears to be resolvconf first, then bind9, and
finally bind9-resolvconf, where each must complete before the next
starts.  resolvconf is Before network.target while bind9 is After it;
bind9-resolvconf is After bind9. However, bind9.service.wants has a
link to bind9-resolvconf (I may have added that), which contradicts
bind9 resolvconf's After directive.  I think the Wants will be
ignored, but maybe it's part of the problem?

Comment 2
The resolvconf service runs the resolvconf program, which in turn
triggers the update.d/bind9 script shown  below.  This script writes
to /var/run/named/named.resolvers, exactly the file mentioned above as
appearing on the root partition's var.  The script is based on those
formerly distributed with bind, though I don't think it's in the
current release (I recall it was carved out of a more elaborate
script).

bind9-resolvconf also calls the resolvconf script, presumably
triggering the same update.d/bind9 script.

The rest of this message lists the contents of different files.

# systemctl cat bind9 ---
# /lib/systemd/system/bind9.service
[Unit]
Description=BIND Domain Name Server
Documentation=man:named(8)
After=network.target
Wants=nss-lookup.target
Before=nss-lookup.target

[Service]
Type=forking
EnvironmentFile=-/etc/default/bind9
ExecStart=/usr/sbin/named $OPTIONS
ExecReload=/usr/sbin/rndc reload
ExecStop=/usr/sbin/rndc stop

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/bind9.service.d/override.conf
[Unit]
RequiresMountsFor=/var

# systemctl cat bind9-resolvconf --
# /lib/systemd/system/bind9-resolvconf.service
[Unit]
Description=local BIND via resolvconf
Documentation=man:named(8) man:resolvconf(8)
PartOf=bind9.service
After=bind9.service
ConditionFileIsExecutable=/sbin/resolvconf

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/bin/sh -c 'echo nameserver 127.0.0.1 | /sbin/resolvconf -a lo.named'
ExecStop=/sbin/resolvconf -d lo.named

[Install]
WantedBy=bind9.service

# /etc/systemd/system/bind9-resolvconf.service.d/override.conf
[Unit]
RequiresMountsFor=/var

# systemctl cat resolvconf
---
# /lib/systemd/system/resolvconf.service
[Unit]
Description=Nameserver information manager
Documentation=man:resolvconf(8)
DefaultDependencies=no
Before=networking.service


Bug#933139: Fwd: Bug#933139: bind9 fails on startup: can't find /var/cache/bind

2019-07-28 Thread Ross Boylan
Oops, forgot to cc the bug.

-- Forwarded message -
From: Ross Boylan 
Date: Sun, Jul 28, 2019 at 12:36 PM
Subject: Re: Bug#933139: bind9 fails on startup: can't find /var/cache/bind
To: Bernhard Schmidt 
Cc: Ross Boylan 


Thank you for your response.  Answers interpolated below.

On Sun, Jul 28, 2019 at 11:42 AM Bernhard Schmidt  wrote:
>
> Control: tags -1 moreinfo
>
> Am 26.07.19 um 22:50 schrieb Ross Boylan:
>
> Hi Ross,
>
> > * What led up to the situation?
> >
> >Start or restart the system.  bind9 attempts to start, but fails
> >with the error that it is unable to cd to /var/cache/bind.  This
> >happens every time.
> >
> >New installation of buster with existing customizations from an old
> >system ported over, with adoptions.
> >
> >resolvconf in use.
> >
> >/var is a separate partition.  See details below for my theory the
> >failure arises from a race condition in which bind starts before
> >/var is mounted.
>
> I think this is more of a systemd bug or local misconfiguration than a
> bind9 bug.

That's my main suspicion, but I thought it best to start with the
package reporting the error.
Also, it would be nice if bind9 worked with mountable /var out of the
box (if that is the problem).

BTW I have attempted to sign up for the general (but called
"developer") systemd list at freedesktop, but have not received the
confirming message to which I must reply.

>
> > The root partition includes /var/cache, but no /var/cache/bind.
> > When the partition with /var is mounted it has a /var/cache/bind
> > directory.
>
> Is there a reason for this? I think having a non-empty mount target is
> somewhat unusual, maybe this is causing the systemd mount dependency
> logic to fail.

I think it's a historical accident of how I setup the system; I don't
have any principled reason for the setup.

I can try making /var empty on the root partition and test the next
time I have a chance to reboot.
>
> > /var is encrypted on top of lvm, so it takes a bunch of steps to mount
> > it.  However, it does not require me to enter any new passwords to
> > decrypt it.  The root partition does require me to enter a password,
> > which happens early in startup.
>
> Are there manual steps involved mounting /var?
Given that / is mounted, no.
>
> How does your /etc/fstab look like?
UUIDs obscured:
--fstab-
/dev/mapper/vgbarley-root_crypt /   ext4
errors=remount-ro 0   1
UUID=xx /boot   ext2defaults0   2
/dev/mapper/vgbarley-home_crypt /home   ext4defaults
 0   2
/dev/mapper/vgbarley-usr /usrext4defaults0   2
/dev/mapper/vgbarley-var_crypt /varext4defaults0   2
/dev/sr0/media/cdrom0   udf,iso9660 user,noauto 0   0
/dev/vgbarley/media10 /srv/media10  xfs noatime,nodiratime,allocsize=512m 0 0
/dev/vgbarley/cache   /var/local/cacheext4 defaults,noatime  0 0

--crypttab
vgbarley-home_crypt UUID=yy none luks,discard
vgbarley-root_crypt UUID=zz none luks,discard
vgbarley-var_crypt UUID=ww none luks,discard
---
The UUIDs in crypttab point at the correspondingly named logical
volumes on vgbarley, e.g., vgbarley/var for var.

So on startup lvm must activate vgbarley, and attempts to activate
some other volume groups.  This is a bit tricky, since some of the
other volume groups are missing some physical disks, introducing delay
and partial activation for those groups.
I also have encrypted physical partitions and RAID; again neither is
involved with vgbarley.
Once vgbarley is active some of its logical volumes, including root
and var, must be decrypted.
This requires entering a password on startup; since the same one works
for root and var, there is no separate prompt for var.
Underlying disks are GPT.

Ross



Bug#933139: bind9 fails on startup: can't find /var/cache/bind

2019-07-28 Thread Bernhard Schmidt
Control: tags -1 moreinfo

Am 26.07.19 um 22:50 schrieb Ross Boylan:

Hi Ross,

> * What led up to the situation?
> 
>Start or restart the system.  bind9 attempts to start, but fails
>with the error that it is unable to cd to /var/cache/bind.  This
>happens every time.
> 
>New installation of buster with existing customizations from an old
>system ported over, with adoptions.
> 
>resolvconf in use.
> 
>/var is a separate partition.  See details below for my theory the
>failure arises from a race condition in which bind starts before
>/var is mounted.

I think this is more of a systemd bug or local misconfiguration than a
bind9 bug.

> The root partition includes /var/cache, but no /var/cache/bind.
> When the partition with /var is mounted it has a /var/cache/bind
> directory.

Is there a reason for this? I think having a non-empty mount target is
somewhat unusual, maybe this is causing the systemd mount dependency
logic to fail.

> /var is encrypted on top of lvm, so it takes a bunch of steps to mount
> it.  However, it does not require me to enter any new passwords to
> decrypt it.  The root partition does require me to enter a password,
> which happens early in startup.

Are there manual steps involved mounting /var?

How does your /etc/fstab look like?

Bernhard



Bug#933139: bind9 fails on startup: can't find /var/cache/bind

2019-07-26 Thread Ross Boylan
Package: bind9
Version: 1:9.11.5.P4+dfsg-5.1
Severity: normal

Dear Maintainer,

Severity is import for me, but obviously this isn't happening to
everyone.  Every time I start bind9 fails, and all the other startup
services that require working DNS end up in various semi-broken
states. I can fix this after the system is up.

* What led up to the situation?

   Start or restart the system.  bind9 attempts to start, but fails
   with the error that it is unable to cd to /var/cache/bind.  This
   happens every time.

   New installation of buster with existing customizations from an old
   system ported over, with adoptions.

   resolvconf in use.

   /var is a separate partition.  See details below for my theory the
   failure arises from a race condition in which bind starts before
   /var is mounted.

* What exactly did you do (or not do) that was effective (or
 ineffective)?

   1. ; I have added /etc/systemd/system/bind9.service.d/override.conf with
   [Unit]
   RequiresMountsFor=/var
   
   but it didn't help.

   2. *After* the system has started I can start bind successfully.
   Then I need to restart other services that were messed up by lack
   of DNS.

   3. Reviewed and maybe added some entries to the apparmor profile.
   This helped with some of my customizations, but a simple apparmor
   problem would not block initial access to a directory and then
   permit later access.

   4. Discuss on debian-user.

* What outcome did you expect instead?
   I expected bind would start the first time.

* Details

The root partition includes /var/cache, but no /var/cache/bind.
When the partition with /var is mounted it has a /var/cache/bind
directory.

So it seems most likely the initial startup is happening before /var
is mounted, and fails when it can't find /var/cache/bind, while the
later startups happen after /var is mounted and so succeed.

The default systemd is controlling startup.

I'm not sure what I need to do to express this dependency properly;
either my RequireMountsFor does not have the effects my reading of the
documentation suggests it should, or my override is not actually
taking effect.

/var is encrypted on top of lvm, so it takes a bunch of steps to mount
it.  However, it does not require me to enter any new passwords to
decrypt it.  The root partition does require me to enter a password,
which happens early in startup.

/etc/systemd/system/bind9.service.wants/bind9-resolvconf.service is a
symlink to /lib/systemd/system/bind9-resolvconf.service.  I believe it
is present because I deliberately activated it.  This might be
inconsistent with the debconf run-resolvconf setting (of false).

I'm master for my local domain with files in /var/lib/bind; I have
inside and outside views.

I'm using /run/named/named.resolvers as suggested by this package (at
least in earlier versions) and scripts based on those previously
distributed with this package.

Logs:
-- Logs begin at Sat 2019-05-11 16:11:40 PDT, end at Sat 2019-05-11 17:08:02 
PDT. --
May 11 16:11:42 barley named[609]: linked to libjson-c version: 0.12.1
May 11 16:11:42 barley named[609]: threads support is enabled
May 11 16:11:42 barley named[609]: 

May 11 16:11:42 barley named[609]: BIND 9 is maintained by Internet Systems 
Consortium,
May 11 16:11:42 barley named[609]: Inc. (ISC), a non-profit 501(c)(3) 
public-benefit
May 11 16:11:42 barley named[609]: corporation.  Support and training for BIND 
9 are
May 11 16:11:42 barley named[609]: available at https://www.isc.org/support
May 11 16:11:42 barley named[609]: 

May 11 16:11:42 barley named[609]: adjusted limit on open files from 524288 to 
1048576
May 11 16:11:42 barley named[609]: found 8 CPUs, using 8 worker threads
May 11 16:11:42 barley named[609]: using 7 UDP listeners per interface
May 11 16:11:42 barley named[609]: using up to 4096 sockets
May 11 16:11:44 barley named[609]: loading configuration from 
'/etc/bind/named.conf'
May 11 16:11:44 barley named[609]: /etc/bind/named.conf.options:2: change 
directory to '/var/cache/bind' failed: file not found
May 11 16:11:44 barley named[609]: /etc/bind/named.conf.options:2: parsing 
failed: file not found
May 11 16:11:44 barley named[609]: loading configuration: file not found
May 11 16:11:44 barley named[609]: exiting (due to fatal error)


-- System Information:
Debian Release: 10.0
  APT prefers stable
  APT policy: (500, 'stable')
Architecture: amd64 (x86_64)

Kernel: Linux 4.19.0-5-amd64 (SMP w/8 CPU cores)
Kernel taint flags: TAINT_PROPRIETARY_MODULE, TAINT_OOT_MODULE, 
TAINT_UNSIGNED_MODULE
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE= 
(charmap=UTF-8)
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages bind9 depends on:
ii  adduser3.118
ii  bind9utils 1:9.11.5.P4+dfsg-5.1
ii  debconf