Re: Bug#904558: What should happen when maintscripts fail to restart a service

2019-04-17 Thread Margarita Manterola

Apologies for the long delay.

We discussed this issue in several TC meetings without being able to 

real progress.

After several rounds of discussions we came to the conclusion that the
reason why we can't make progress is that we always end up hitting the 
of "The Technical Committee does not engage in design of new proposals 
policies". While we recognize that this is a problem worth fixing, this 
not something that we can fix as a body and need the help of the 

to do it.

On the one hand, maintainers want to be able to notify sysadmins when
things don't go as expected. On the other hand, sysadmins don't want 
systems to be left in weird/broken states because one single thing 

go as expected.

A failing maintscript is a horrible way of notifying sysadmins, but it's
the only one available up to now and so package maintainers use it when
they think the failure is critical enough.

So, the TC declines to rule on what should maintscripts do when failing 

(re)start a service (or otherwise encountering a similarly serious

Instead, we recommend that a work group of developers is formed, to 
a better mechanism of notification that can be used to let sysadmins 

when things don't go as expected on their systems, without leaving the
machines in weird/broken states. Given that this is a problem faced by 
Linux distributions, it would be nice if this mechanism was developed 
published in a non Debian specific way that made it also available for 

distributions to use.

Once that mechanism exists, we would strongly recommend that almost all
failures use this mechanism, instead of failing maintscripts.

Marga, on behalf of the Technical Committee

Bug#904558: What should happen when maintscripts fail to restart a service

2018-10-18 Thread Wouter Verhelst

On Wed, Oct 17, 2018 at 09:47:57PM +0100, Simon McVittie wrote:
> However, it leaves the default as "fail hard", which I'm not convinced
> is the most appropriate thing for systems that lack an experienced
> sysadmin (which are the systems where defaults matter most, because an
> inexperienced user is the least able to make an informed decision about
> where they should deviate from defaults).

I think that's where we disagree, so allow me to focus on that.

I think everyone would agree that when a service fails to (re)start upon
package installation or upgrade, that there is a problem and that this
problem needs to be reported in whatever way is most appropriate (if
not, we have a bigger disagreement than I thought and we need to take a
step back ;-)

The question that remains is "how". Currently, Debian has four ways of
informing a system administrator of such failures:

- Log a message to stdout and/or stderr. This is liable to scroll by
  unnoticed, and therefore is not a reliable way to inform the system
  administrator. For that reason, I don't think it's a good idea.
- Log a message to syslog and/or the systemd journal. This will not
  scroll by, but relies on the system administrator to actively hunt for
  problems in system logs, which they probably won't do unless and until
  they notice that the daemon isn't running anymore (and by that time it
  may be too late).
- Produce a debconf error note. This is mildly better than the above
  two, since debconf error notes are shown at highest priority, and
  therefore will only be hidden if debconf is configured to be
  noninteractive; in that case, debconf will send an email to root. On
  systems without a configured MTA, this will not help; and for daemons
  where failure to restart is a catastrophic that needs to be resolved
  ASAP, such as sshd, this might not be desirable.
- Exit from postinst with nonzero exit state. This is unlikely to be
  missed by system administrators; however, it has several disadvantages
  that were pointed out by other people during this discussion.

I think it is perfectly fine to have the TC say that "failures to
restart a service must be reported, either by exiting nonzero, or by
another appropriate action", without going in detail what those other
actions could be.

> policy-rc.d also has some practical integration issues. It normally relies
> on putting an unpackaged file in /usr/sbin (unless you have installed
> policyrcd-script-zg2), and it's common for tools like debootstrap and
> debian-installer to create and delete policy-rc.d to suppress service
> startup while carrying out bootstrap operations. One Debian derivative
> that I'm involved in (SteamOS) is *meant* to have a policy-rc.d, but we
> recently discovered that it has always been deleted at the end of the
> debian-installer run, and so doesn't exist in practice.

I think that problem is not something that should be resolved by this

I'll readily admit that I did not actually test any of the suggestions I
made wrt policy-rc.d. There are other issues with it too; I'm thinking
of filing a wishlist bug to have it replaced by something better.

On top of that, policy-rc.d has alwyas irked me as a bit of an awkward
interface; it is the only type of Debian-specific configuration that
does not go into /etc, but for which you need to write a script in
/usr/sbin. This is confusing, as shown by debian-installer removing it
unconditionally. In an ideal world, the policies currently implementable
through policy-rc.d should be configuration snippets in a run-parts
style directory. The "just drop a script somewhere" idea is a
poorly-defined interface which is inflexible and inappropriate for the
purpose of a distribution, but "policy-rc.d should be replaced by
something better" is not an appropriate response to the question "what
should happen when a service fails to restart in postinst".

Also related to this problem is what happens with postinst failing for
other reasons than "the daemon doesn't restart". While that is probably
the most likely reason for postinst failures today, it is by no means
the only one; so if you say "postinst failing because of daemon restart
failing" is something that should not ever happen, I think you should
then also make guidelines as to when, exactly, a postinst should be
allowed to fail (and muck up the whole system).

To the thief who stole my anti-depressants: I hope you're happy

  -- seen somewhere on the Internet on a photo of a billboard

Bug#904558: What should happen when maintscripts fail to restart a service

2018-10-17 Thread Simon McVittie
On Tue, 09 Oct 2018 at 20:35:33 +0200, Wouter Verhelst wrote:
> According to "man invoke-rc.d", policy-rc.d can exit with exit state 106
> and provide a number of actions on stdout. These are then actions that
> invoke-rc.d must try in order "until one of them succeeds". As such, a
> policy-rc.d implementation written like so:
> #!/bin/sh
> if [ "$1" = ssh ] # logic error fixed as per subsequent mail
> then
>   exit 0
> fi
> echo "$2 stop"
> exit 106
> would result in the system attempting whatever init script action was
> being asked for, followed by a "stop" action (except in the case of the
> "ssh" service, which must not fail before we close a shell, ever). This
> assumes that a "stop" action when the daemon fails to start will be
> successful

If I'm reading invoke-rc.d correctly, this is implemented (in a cross-init
way), but probably doesn't interact well with the logic that avoids
(re)starting services that are disabled, because that doesn't consider
"restart stop" to match "restart".

Obviously, if I'm right about that limitation, then that's a bug, and
bugs can be fixed. However, it makes me concerned that the exit status
106 thing is not well-understood or well-tested, even by invoke-rc.d

Packages that have systemd units with no corresponding LSB init
script (not necessarily services - timer, socket, path and (auto)mount
units are also units) use deb-systemd-invoke instead of
invoke-rc.d. deb-systemd-invoke doesn't implement the full generality
of the policy-rc.d interface, but only 0, 101 and 104 (in particular
not 106). That would be a reasonable feature request, particularly if
we want to encourage this route, but it isn't currently implemented.

While discussing this on IRC we wondered whether maintainer scripts
that restart services should be normally be using an interface that is
analogous to "systemctl try-restart", namely: check whether the service
is running, then restart it if it was. (This can't work for maintainer
scripts that stop the service in prerm and start it in postinst, but
that is no longer the default behaviour in recent debhelper compat
levels.) However, both dh_installinit and dh_installsystemd currently
use plain "restart", so if the service is not running (possibly because
it's already broken), it will usually be started.

> With that background, IMHO the proper reply to this question before the
> committee is that yes, postinst scripts should fail when an init script
> fails, but we should also better document the policy-rc.d interface to
> point out that the above is possible and can be done where it makes
> sense.

This would solve Marga's use case with a very large fleet of machines
maintained by a small number of sysadmins: they can install a policy-rc.d
on all those machines that does the right thing.

However, it leaves the default as "fail hard", which I'm not convinced
is the most appropriate thing for systems that lack an experienced
sysadmin (which are the systems where defaults matter most, because an
inexperienced user is the least able to make an informed decision about
where they should deviate from defaults).

policy-rc.d also has some practical integration issues. It normally relies
on putting an unpackaged file in /usr/sbin (unless you have installed
policyrcd-script-zg2), and it's common for tools like debootstrap and
debian-installer to create and delete policy-rc.d to suppress service
startup while carrying out bootstrap operations. One Debian derivative
that I'm involved in (SteamOS) is *meant* to have a policy-rc.d, but we
recently discovered that it has always been deleted at the end of the
debian-installer run, and so doesn't exist in practice.


Bug#904558: What should happen when maintscripts fail to restart a service

2018-10-10 Thread Wouter Verhelst
I must stop writing emails when tired...

On Tue, Oct 09, 2018 at 08:35:33PM +0200, Wouter Verhelst wrote:
> On Tue, Oct 09, 2018 at 10:52:15AM +0200, Wouter Verhelst wrote:
> > - The policy-rc.d interface could be extended to allow it to signal a
> >   "restart, but do not fail on error" kind of policy. This would work
> >   for the "we have thousands of desktops and don't care about a service
> >   failing to restart" kind of enviromnent.
> Wanting to investigate this a bit further, I find that, actually, such a
> possibility already exists.
> According to "man invoke-rc.d", policy-rc.d can exit with exit state 106
> and provide a number of actions on stdout. These are then actions that
> invoke-rc.d must try in order "until one of them succeeds". As such, a
> policy-rc.d implementation written like so:
> #!/bin/sh
> if [ "$1" != ssh ]

That is, of course, a logic inversion. Whoops.

> then
>   exit 0

For clarity, this means that whatever action was requested would be
allowed; and so if things fail they will cause the init script to fail,

> fi
> echo "$2 stop"
> exit 106
Could you people please use IRC like normal people?!?

  -- Amaya Rodrigo Sastre, trying to quiet down the buzz in the DebConf 2008

Bug#904558: What should happen when maintscripts fail to restart a service

2018-10-09 Thread Wouter Verhelst
On Tue, Oct 09, 2018 at 10:52:15AM +0200, Wouter Verhelst wrote:
> - The policy-rc.d interface could be extended to allow it to signal a
>   "restart, but do not fail on error" kind of policy. This would work
>   for the "we have thousands of desktops and don't care about a service
>   failing to restart" kind of enviromnent.

Wanting to investigate this a bit further, I find that, actually, such a
possibility already exists.

According to "man invoke-rc.d", policy-rc.d can exit with exit state 106
and provide a number of actions on stdout. These are then actions that
invoke-rc.d must try in order "until one of them succeeds". As such, a
policy-rc.d implementation written like so:


if [ "$1" != ssh ]
exit 0
echo "$2 stop"
exit 106

would result in the system attempting whatever init script action was
being asked for, followed by a "stop" action (except in the case of the
"ssh" service, which must not fail before we close a shell, ever). This
assumes that a "stop" action when the daemon fails to start will be
successful; I don't know whether all init scripts in Debian act that
way, but I do think that they should. If they do, then this will cause
mean that init scripts which fail will not cause general packaging

With that background, IMHO the proper reply to this question before the
committee is that yes, postinst scripts should fail when an init script
fails, but we should also better document the policy-rc.d interface to
point out that the above is possible and can be done where it makes
sense. If long-time Debian Developers (not just me, but also the members
of the committee) do not know well how it works, then clearly it is

(Having said that, I haven't tested any of this, so it is certainly
possible that the implementation does not match the documentation...)

Could you people please use IRC like normal people?!?

  -- Amaya Rodrigo Sastre, trying to quiet down the buzz in the DebConf 2008

Bug#904558: What should happen when maintscripts fail to restart a service

2018-10-09 Thread Sam Hartman
> "Ian" == Ian Jackson  writes:

Ian>  * If the maintainer has no particular reason to diverge the
Ian> right answer is usually to fail the postinst with init systems
Ian> that do not provide service supervision; but to not fail the
Ian> postinst with ones that do.  (I think from earlier messages
Ian> that this is how the default implementations already work.)

So, it's not really the case that this is the default for init systems
today, and that actually has some important historical significance and
implications for perceived user-facing changes.

It's absolutely been the case that if an init script (init.d lsb script)
fails, the default behavior was to fail the postinst.

However, start-stop-daemon did not detect a lot of failures, especially
after fork.
So, there are all sorts of  things that caused daemons to fail to start
that used to not cause postinst failures.

I don't know what the default is today, but certainly for Jessie and for
a lot of the stretch cycle, dh_installinit would fail the postinst
whenever systemctl failed to start or restart a service.

Now, depending on how you wrote your service units, you might get the
same behavior as with sysvinit.  But you probably didn't do that.
So, suddenly, a whole bunch more conditions started showing up  as
things that caused postinst to fail.

If somewhere in stretch and with the migration from dh_installinit for
service units fto dh_systemd_*, we managed to change the default, then
we're probably reasonably close to what happened in the pre-systemd
days.  And that was reasonably OK.

Bug#904558: What should happen when maintscripts fail to restart a service

2018-10-09 Thread Ian Jackson
Wouter Verhelst writes ("Re: Bug#904558: What should happen when maintscripts 
fail to restart a service"):
> Perhaps the error handler should also be configurable by policy-rc.d, as
> I hinted to before.

I think this is a key point.  We do not have to make a single decision
which everyone has to be happy with.  We can instead continue to be
all things to all people.

I think the best answer would be:

 * Individual maintainers decide for themselves whether to treat
   service (re)start failure as postinst failure, based on their own
   perception; maintainers may make different decisions for different
   init systems.

 * If the maintainer has no particular reason to diverge the right
   answer is usually to fail the postinst with init systems that do
   not provide service supervision; but to not fail the postinst with
   ones that do.  (I think from earlier messages that this is how the
   default implementations already work.)

 * The administrator should be able to override this policy question
   globally for the whole system, or on a per-package basis.

This is probably a manageable amount of actual work: the prescription
for individual package sis roughly what they do right now.

The support for configuration in something like policy-rc.d has a few
design decisions to be made but doesn't seem really difficult.  Also
nothing blocks on it.  The TC would simply be saying "this would be a
good thing to have".


Ian JacksonThese opinions are my own.

If I emailed you from an address or, that is
a private address which bypasses my fierce spamfilter.

Bug#904558: What should happen when maintscripts fail to restart a service

2018-10-09 Thread Wouter Verhelst
Hi Simon,

Thanks for your summary.

On Sun, Oct 07, 2018 at 11:49:09AM +0100, Simon McVittie wrote:
> Attempting to summarize what was said on this topic in the thread so
> far, and at the last technical committee meeting:
> It's perhaps important to note that we are not discussing ideal situations
> here: any time this conversation becomes relevant, something is already
> wrong. We're aiming to recommend the lesser evil, rather than something
> actually desirable.
> One of the points of view here is Ian and Wouter's assertion that
> whenever a service fails to restart in a maintainer script, the most
> important thing is to make sure the sysadmin pays attention and fixes
> it before proceeding.
> Julien Cristau made another point in support of "failure to restart
> implies failure to configure" on IRC, namely that the only straightforward
> thing for an automated upgrade to do is to look at the successful or
> failed exit status of the package manager (whether that means dpkg,
> apt, unattended-upgrades or whatever), and assume that exiting 0 means
> everything is fine and exiting nonzero means attention is required.

I think this is the core of the issue: it is incorrect to state that
when a service restart was successful, that then everything was fine.
There was a problem. We currently don't have a way to distinguish
between "there was a terrible problem and the sky is going to fall" and
"there was a problem but you might want ignore it", so technically the
only correct thing to do is to exit with a nonzero exit state,
signalling a problem. Put otherwise, I think that if the following
preconditions are true:

1. The service was running before the package upgrade
2. The package's postinst wants to restart the daemon
3. After the package upgrade, the service fails to start again

Then that means the package upgrade broke something, and the system
administrator should be informed of that fact. We currently have only
one *certain* avenue to inform the system administrator, and that is
through producing a nonzero exit state from apt. A debconf error or
message to stdout or stderr would work too in some cases, but the first
is not always shown and the second might scroll by too fast to be
noticeable, so it is not a certain way to tell the system administrator.
As such, exiting nonzero is the only avenue open to maintainers to do
the right thing.

Having said all that...

> At the opposite extreme, Marga's team manages thousands of desktops,
> and having to do *anything* manual to any significant number of them
> doesn't scale. We can think of inexperienced users' desktops as a bit
> like this scenario too, except that instead of having a professional
> sysadmin, they have to ask volunteers for help through channels like
> debian-user and #debian (and those volunteers' help doesn't really scale
> well either). It's also undesirable if the mechanism we use to escalate
> the failure to the user is one that itself makes it harder to diagnose or
> fix the problem, and in particular there's a concern that when packages
> fail to configure, that can make it harder to use apt to install the
> necessary tools to diagnose what has gone wrong; Stuart points out that in
> his experience of helping people in #debian, this is a practical problem.

It is true that there is a larger picture, and that in some
environments, breaking all future upgrades is way more problematic than
not restarting a service once. This is arguably a bug in apt though, and
it feels wrong to me to "fix" such an issue by introducing what is
essentially a workaround in multiple unrelated places; if then the
problem gets fixed properly, we would have to go around the whole system
to undo the workarounds again, which would be a sad state of affairs.

I can think of some alternatives that could be done and that would work
towards a resolution (rather than a workaround) for this problem:

- The policy-rc.d interface could be extended to allow it to signal a
  "restart, but do not fail on error" kind of policy. This would work
  for the "we have thousands of desktops and don't care about a service
  failing to restart" kind of enviromnent.
- Apt could be fixed so that when a package fails to configure, it would
  still be impossible to install and/or configure reverse-dependencies
  of the failing package, but not of packages that are unrelated. This
  would help the "users asking in our support channels can't install
  diagnostic tools to investigate" kind of situation.
- A new state could be created in dpkg to signal "configuration failed,
  but package will work for dependencies". When this is the case, apt
  should inform the user that configuration of some package failed and
  that they might want to investigate, but should not refuse to install
  and/or configure other packages, even reverse dependencies of the
  failing package. This feels right, but I can't come up with a good
  example of the kind of situation which this would fix; 

Re: Bug#904558: What should happen when maintscripts fail to restart a service

2018-10-07 Thread Michael Biebl
Am 07.10.18 um 20:55 schrieb Michael Biebl:
> Am 07.10.18 um 20:46 schrieb Michael Biebl:
>> Let me add here, that a lot of sysv init scripts I looked at do not
>> actually return proper error codes in case the service fails to start.
>> Picking a random example, like anacron, I see for the start action:
>> start-stop-daemon --start --exec /usr/sbin/anacron -- -s
>> log_end_msg 0
> And even if the init scripts use "log_end_msg $?"
> most of them do not exit at this point but have an explicit
> "exit 0" at the end of the script [1].
> So while you get a log message on stdout which indicates failure, you
> don't get a return code which would cause dpkg to abort.
> IIRC, this basically was the reason why we used "|| true" in
> dh_systemd_start to mimic the effective dh_installinit behaviour.

Or putting this a different way:
If the decision is that failure to start a service should result in a
dpkg failure, when this in turn means that most of our sysv init scripts
are insta-buggy and need to be fixed to return proper error codes.
Personally I would welcome this to be the case, but keep in mind that we
have around 1200 init scripts in the archive and afaics only very few
actually do return proper exit codes.


Why is it that all of the instruments seeking intelligent life in the
universe are pointed away from Earth?

Description: OpenPGP digital signature

Re: Bug#904558: What should happen when maintscripts fail to restart a service

2018-10-07 Thread Michael Biebl
Am 07.10.18 um 20:46 schrieb Michael Biebl:
> Let me add here, that a lot of sysv init scripts I looked at do not
> actually return proper error codes in case the service fails to start.
> Picking a random example, like anacron, I see for the start action:
> start-stop-daemon --start --exec /usr/sbin/anacron -- -s
> log_end_msg 0

And even if the init scripts use "log_end_msg $?"
most of them do not exit at this point but have an explicit
"exit 0" at the end of the script [1].
So while you get a log message on stdout which indicates failure, you
don't get a return code which would cause dpkg to abort.

IIRC, this basically was the reason why we used "|| true" in
dh_systemd_start to mimic the effective dh_installinit behaviour.


[1] IIRC not even the skeleton file which is shipped for sysvinit does
this correctly and as a result was copied and pasted into numerous packages.

Why is it that all of the instruments seeking intelligent life in the
universe are pointed away from Earth?

Description: OpenPGP digital signature

Re: Bug#904558: What should happen when maintscripts fail to restart a service

2018-10-07 Thread Michael Biebl
> * dh_installinit: defaults to "failure to (re)start is failure to
>   configure", but can be overridden with --error-handler; some packages
>   set the error handler to "true" (e.g. apache2, isc-dhcp) or to a custom
>   shell function (e.g. krb5, samba).
>   This is used for LSB init scripts, and for systemd units that have a
>   corresponding LSB init script.
> * dh_systemd_start: unconditionally uses "|| true".
>   This is only used for systemd units that *do not* have a corresponding
>   LSB init script. A dh_installinit-style --error-handler would probably
>   be a reasonable feature request.

Let me add here, that a lot of sysv init scripts I looked at do not
actually return proper error codes in case the service fails to start.
Picking a random example, like anacron, I see for the start action:

start-stop-daemon --start --exec /usr/sbin/anacron -- -s
log_end_msg 0

This seems to be a rather common issue among SysV init scripts to signal
a (wrong) return code of 0 even if the service failed.

systemd is much more consistent in that regard as it will report a
failure code if the service failed to start/stop/restart.

Why is it that all of the instruments seeking intelligent life in the
universe are pointed away from Earth?

Description: OpenPGP digital signature

Bug#904558: What should happen when maintscripts fail to restart a service

2018-10-07 Thread Sam Hartman
> "Simon" == Simon McVittie  writes:

Simon> the error path is most important were packages that provide a
Simon> system-level API to other packages, so their failures are
Simon> likely to cause other packages to fail to configure (such as
Simon> local DNS caches and authentication services like LDAP); and
Simon> packages that provide remote access, so their failures need
Simon> to be fixed before a potentially remote sysadmin logs out to
Simon> prevent the sysadmin from being locked out longer-term (like
Simon> sshd).

As a maintainer of one of the more important packages (krb5-kdc and
krb5-admin-server), ;I'd like to chime in here.  krb5-kdc provides
enterprise level authentication and if it fails may well take out
authentication for an entire environment.

Even so,  I've found that causing upgrades to fail does far more harm
than good even for this package.

Here is my experience based on my own observations and based on bug
reports and helping people diagnose problems in krb5:

* The vast majority of failures are when krb5-kdc gets installed on a
  system where it is not actually needed, or where it was partially
  configured for  a test.  In these cases, breaking an kupgrade does
  much more harm than good.  It may break other services, because those
  services may end up in a half-configured state, so a service that is
  not critical for a given system may break critical services for that

* When krb5 is a critical service, it's failure is going to be quite
  obvious regardless of whatever the maint script does.

* It is almost always the case that debugging  the situation involves
  installing some package and that  the first thing I end up doing is
  walking a user through adding exit 0 at the top of postinst in
  /var/lib/dpkg/info before going forward.  Even if  I don't need some
  additional tool, I've been burned by other parts of the system being
  in half-configured state.

* Leaving large chunks of the system in half-configured states is about
  one of the worst things you can do for system stability.  It's not
  something we test very often, and the interactions are very difficult
  to predict.

If I understood the cause of an error in a maintainer script and knew
that it indicated a problem that the sysadmin needed to fix (and one
that likely indicated krb5 was important on this system) I would be open
to returning a failure in postinst.
In almost all other situations I'd rather simply let the service fail to

Description: PGP signature

Bug#904558: What should happen when maintscripts fail to restart a service

2018-10-07 Thread Simon McVittie
Attempting to summarize what was said on this topic in the thread so
far, and at the last technical committee meeting:

It's perhaps important to note that we are not discussing ideal situations
here: any time this conversation becomes relevant, something is already
wrong. We're aiming to recommend the lesser evil, rather than something
actually desirable.

One of the points of view here is Ian and Wouter's assertion that
whenever a service fails to restart in a maintainer script, the most
important thing is to make sure the sysadmin pays attention and fixes
it before proceeding.

Julien Cristau made another point in support of "failure to restart
implies failure to configure" on IRC, namely that the only straightforward
thing for an automated upgrade to do is to look at the successful or
failed exit status of the package manager (whether that means dpkg,
apt, unattended-upgrades or whatever), and assume that exiting 0 means
everything is fine and exiting nonzero means attention is required.

At the opposite extreme, Marga's team manages thousands of desktops,
and having to do *anything* manual to any significant number of them
doesn't scale. We can think of inexperienced users' desktops as a bit
like this scenario too, except that instead of having a professional
sysadmin, they have to ask volunteers for help through channels like
debian-user and #debian (and those volunteers' help doesn't really scale
well either). It's also undesirable if the mechanism we use to escalate
the failure to the user is one that itself makes it harder to diagnose or
fix the problem, and in particular there's a concern that when packages
fail to configure, that can make it harder to use apt to install the
necessary tools to diagnose what has gone wrong; Stuart points out that in
his experience of helping people in #debian, this is a practical problem.

Ian considers it to be design flaw in apt that the actions the user
can take while a package is unconfigured are so constrained; however,
we work with the tools we have, not the tools we'd like to have.

We seem to have consensus among the technical committee that it is at
least occasionally appropriate for failure to restart to cause failure
to configure, although this might be the exception rather than the
rule. The examples given where the error path is most important were
packages that provide a system-level API to other packages, so their
failures are likely to cause other packages to fail to configure (such
as local DNS caches and authentication services like LDAP); and packages
that provide remote access, so their failures need to be fixed before a
potentially remote sysadmin logs out to prevent the sysadmin from being
locked out longer-term (like sshd).

I'm not sure whether we have a concrete example yet of packages at the
opposite extreme, that are the least important to be able to restart. I'd
like to propose the game servers that I maintain, like openarena-server,
as a concrete example here: I hope we can agree that inability to capture
the flag does not justify getting the package management system into a
problematic state? :-) (I think this is currently a bug in those packages,
but I'm not going to fix it until we have consensus here.)

There's a general feeling among the technical committee that a package
failing to configure is far from a user-friendly way to signal errors:
Phil's memorable analogy was that it's like telling a car driver that they
are low on fuel by having the wheels fall off. Historically, we had few
other ways to manage service failures, and perhaps when all you have is
a hammer, everything looks like the Failed-Config state; but in a default
Debian installation we now have a service manager that monitors the state
of all services at all times (not just when they happen to be upgraded)
and collects their stderr at all times (not just writing it to the console
during boot, and dpkg's stderr during upgrades). Even before we considered
non-sysv init systems, monitoring systems like Nagios were available.

It's perhaps also worth noting that most services, if they fail during
boot rather than during upgrade, don't cause a drastic reaction.
Historically, initscripts would (attempt to) carry on regardless from
just about any failure mode, including failure of services that ought to
be considered critical-path. With systemd as default, our default init
system does have a more dramatic response to certain failures (going
to an emergency-mode shell), but it only does that for a very limited
subset of services (fsck and mount on required filesystems, according to
the man page).

As Anthony points out, we could benefit from there being a way
for packages to report "something is wrong, but carry on anyway":
continuing to get the system into the least-degraded state possible,
but then arranging for dpkg/apt to exit with a nonzero status so that
automated systems can detect that something is not right. However,
this mechanism does not currently exist. One 

Re: Bug#904558: What should happen when maintscripts fail to restart a service

2018-09-22 Thread Anthony DeRobertis
Someone asked for an example, here is one I've often seen when doing a 
release upgrade on many webservers I administer: Apache will fail to 
start. I don't recall if that currently causes Apache postinst to fail, 
but if not, it really ought to continue.

Apache has a complicated config, and upstream makes 
backwards-incompatible changes often enough that every Debian release 
seems to have some. It's often not possible to automatically update the 
config (and even if it were, the variety of configuration management 
systems in use mean you wouldn't want that to happen automatically). 
It's much easier to fix after the upgrade. And to the extent anything 
depends on Apache, Apache being completely broken doesn't generally 
break them (unless they try to restart apache themselves, e.g., apache 

Now, if my local DNS cache failed to start, that needs to be fixed 
before continuing (since, e.g., even apt-get won't work). Same with an 
LDAP (etc.) server, you may no longer have user accounts. Some things 
definitely lead to a cascade of failures.

I think in an ideal world, there would be two separate failure states 
for postinst: one for failed but probably safe to continue the upgrade, 
one for failed and probably going to cause a cascade of failures (or 
worse). dpkg (and the various frontends) would let you know about 
fail-but-continue errors after finishing, and maybe before starting, but 
still continue to work.

At least for daemon failed to start and with systemd, we already can 
have pretty close: have the postinst ignore the failed to start error 
(when it's of the safe to continue the upgrade variety), then use 
`systemctl --failed` to get the list of daemons that failed to start.

Bug#904558: What should happen when maintscripts fail to restart a service

2018-09-22 Thread Wouter Verhelst
On Sat, Sep 22, 2018 at 09:50:11AM +0200, Wouter Verhelst wrote:
> Nobody is arguing that if the init system or policy-rc.d block service
> starts, that then postinst should silently not start the daemon.

That should read: 

Nobody is arguing that if the init system or policy-rc.d block service
starts, that then postinst should fail for not starting the daemon.

I'll go get some IV with caffeine now ;-)

Could you people please use IRC like normal people?!?

  -- Amaya Rodrigo Sastre, trying to quiet down the buzz in the DebConf 2008

Bug#904558: What should happen when maintscripts fail to restart a service

2018-09-22 Thread Wouter Verhelst
Hi Tollef,

On Fri, Sep 21, 2018 at 09:53:13PM +0200, Tollef Fog Heen wrote:
> ]] Wouter Verhelst 
> > On Tue, Sep 18, 2018 at 10:04:26PM +0200, Tollef Fog Heen wrote:
> [...]
> > > The API provided by a package being in the configured state is not
> > > whether the relevant daemon is running or not; that is runtime and can
> > > and will change many times while the package is in the configured state,
> > > so dpkg dependencies are not useful for expressing «this service must be
> > > running».
> > 
> > No. But it *is* a useful way to express "this service must be able to
> > run".
> That's not what «configured» means, though.


> «apt install foo ; rm /etc/foo.conf» and the package will be in a «running,
> but can't restart» state, but also configured in dpkg terms.

Well, sure, but that's true for any kind of configuration, and is not
specific to daemons: if you blow away a package's configuration, all
bets are off, so I fail to see your point.

The point is not "what happens after the install run has happened"; it
is about finding problems early rather than late.

> > Additionally, if something fails to restart, then that is a serious
> > problem that I, as a system administrator, would like to know about.
> > Failure to configure a package signals that there is a serious problem
> > that I need to fix, so that informs me.
> I think monitoring should be implemented using monitoring tools, so if
> you actually care if a service is up, you should monitor it rather than
> relying on postinsts failing or succeeding.

First, the fact that there are tools to deal with this externally from
dpkg shouldn't mean that dpkg itself can't deal with it.

Second, if I manually upgrade something and postinst fails, I know
immediately that something is wrong; in contrast, if I upgrade something
but postinst does not fail, and then I have to rely on monitoring to
notify me, it may take a while before I notice something is wrong,
because monitoring tools often only tell me after a few minutes.

Third, the person who performs the upgrade is not necessarily the same
person as the one who notices something is wrong on the monitoring
system; the lack of immediate feedback that the upgrade broke things
will make debugging and fixing the problem more involved than it should

I think "there are tools to do X" is a terrible argument for "postinst
shouldn't do X".

> Alternatively, you could just add «systemctl is-system-running» to a
> post-dpkg-invoke hook, it'll tell you if there are daemons that have
> failed.

The fact that I can do something to fix the fact that someone (you?)
broke reasonable expectations isn't an excuse for breaking those
reasonable expectations in the first place.

> [...]
> > There are really only two[1] reasons why a daemon could fail to restart:
> > 
> > - The maintainer made a mistake in the default configuration, and the
> >   user didn't make any changes so the old conffiles are being replaced
> >   by the new ones, or the package is being newly installed; now the
> >   daemon encounters a syntax error. This is a bug, plain and simple, and
> >   catching bugs earlier rather than later is a good idea, which will
> >   happen if the daemon restart failure causes a postinst failure.
> > - The maintainer made no mistake, but the upgrading user made some local
> >   changes, so the conffile system ensures that the syntactic differences
> >   in the configuration are not incorporated and the daemon fails to
> >   restart. As a system administrator, I would want to know when
> >   something like that happens sooner rather than later, so that I can
> >   fix it (also sooner rather than later). Failing to finish postinst
> >   correctly ensures that that does happen.
> In addition to this: Any number of runtime problems.  The disk might be
> full.  The service might try to look up a user whose name is in LDAP and
> the network is down and thus the user lookup fails.  Some hardware the
> service needs is not plugged in or doesn't work correctly.  Data files
> are corrupted.  Out of memory.  I'm sure you can come up with more. :-)

Well, yeah, and I like it if dpkg gives me an error when I try to
install something and, say, the disk is full.

> This then also ties into what the semantics of «daemon is started»
> should be: is it that the service has started, or that it is working?
> What should happen if you, on a host with no network connectivity (or
> just heavily firewalled), do «apt install ntp»?  Should it wait until
> the clock is synced (effectively forever in this case?  Should the
> postinst fail until you've fixed the firewall?)?

If the daemon is running and it would work as soon as it can reach then
internet? No, it should continue.

If the daemon is failing to start because of, say, mandatory access
control not being configured yet? Yes, in that case it should fail,
because that is a dependency bug, and we want to know about it.

> > [1] There is also the 

Bug#904558: What should happen when maintscripts fail to restart a service

2018-09-22 Thread Wouter Verhelst
On Fri, Sep 21, 2018 at 10:07:31PM +0200, Tollef Fog Heen wrote:
> ]] Ian Jackson 
> > Tollef Fog Heen writes ("Bug#904558: What should happen when maintscripts 
> > fail to restart  a service"):
> [...]
> > > This means that failure to start a daemon should generally not cause the
> > > postinst to fail.
> > 
> > ... I disagree with that.  I think that in the usual case, if the
> > daemon is broken, and the package's purpose is to provide that daemon
> > service, then the package probably isn't providing its API.
> I don't think dpkg relationships are a good fit for expressing those
> kinds of statements.  They are not about in-memory and process state
> management, they're about what's on disk.

The point here is that failure to restart the daemon is a *symptom* of
breakage of the *on-disk state*, so we're really arguing the same thing?

> > I disagree with this.
> > 
> > dpkg dependencies are not just about what sets of packages can be
> > coinstalled.  They also imply sequencing of package setup.  And since
> > starting daemons is part of package setup, dpkg dependencies imply a
> > sequencing of daemon startup.
> If you include the word «attempted» there, I might agree.  policy-rc.d
> for instance enters the picture here.  Blacklisting in the init system
> does as well, probably others too.  The landscape is pretty crowded with
> actors.

Nobody is arguing that if the init system or policy-rc.d block service
starts, that then postinst should silently not start the daemon.

However, in the absense of such things, if postinst fails to restart the
daemon, it knows something is wrong. "something is wrong" should not
happen after package upgrade; if it did, we failed. "failed" means
postinst should not exist successfully.

> > That is actually necessary in the case where the startup of daemon B
> > can only successfully completed if daemon A is up,
> That's the job of the init system's dependency resolution mechanisms,
> not dpkg's.  dpkg does not have information about what is running and so
> can't do this.  Ordering is also separate from dependencies, at least
> for some init systems.

Some init system dependency resolution mechanisms also only work
properly once all the packages involved have been configured, so that's
not a valid point.

Could you people please use IRC like normal people?!?

  -- Amaya Rodrigo Sastre, trying to quiet down the buzz in the DebConf 2008

Bug#904558: What should happen when maintscripts fail to restart a service

2018-09-21 Thread Tollef Fog Heen
]] Ian Jackson 

> Tollef Fog Heen writes ("Bug#904558: What should happen when maintscripts 
> fail to restart  a service"):


> > This means that failure to start a daemon should generally not cause the
> > postinst to fail.
> ... I disagree with that.  I think that in the usual case, if the
> daemon is broken, and the package's purpose is to provide that daemon
> service, then the package probably isn't providing its API.

I don't think dpkg relationships are a good fit for expressing those
kinds of statements.  They are not about in-memory and process state
management, they're about what's on disk.


> Also:
> > The API provided by a package being in the configured state is not
> > whether the relevant daemon is running or not; that is runtime and can
> > and will change many times while the package is in the configured state,
> > so dpkg dependencies are not useful for expressing "this service must be
> > running".
> I disagree with this.
> dpkg dependencies are not just about what sets of packages can be
> coinstalled.  They also imply sequencing of package setup.  And since
> starting daemons is part of package setup, dpkg dependencies imply a
> sequencing of daemon startup.

If you include the word «attempted» there, I might agree.  policy-rc.d
for instance enters the picture here.  Blacklisting in the init system
does as well, probably others too.  The landscape is pretty crowded with

> That is actually necessary in the case where the startup of daemon B
> can only successfully completed if daemon A is up,

That's the job of the init system's dependency resolution mechanisms,
not dpkg's.  dpkg does not have information about what is running and so
can't do this.  Ordering is also separate from dependencies, at least
for some init systems.

Tollef Fog Heen
UNIX is user friendly, it's just picky about who its friends are

Bug#904558: What should happen when maintscripts fail to restart a service

2018-09-21 Thread Tollef Fog Heen
]] Wouter Verhelst 

> On Tue, Sep 18, 2018 at 10:04:26PM +0200, Tollef Fog Heen wrote:


> > The API provided by a package being in the configured state is not
> > whether the relevant daemon is running or not; that is runtime and can
> > and will change many times while the package is in the configured state,
> > so dpkg dependencies are not useful for expressing «this service must be
> > running».
> No. But it *is* a useful way to express "this service must be able to
> run".

That's not what «configured» means, though.  «apt install foo ; rm
/etc/foo.conf» and the package will be in a «running, but can't restart»
state, but also configured in dpkg terms.

> Additionally, if something fails to restart, then that is a serious
> problem that I, as a system administrator, would like to know about.
> Failure to configure a package signals that there is a serious problem
> that I need to fix, so that informs me.

I think monitoring should be implemented using monitoring tools, so if
you actually care if a service is up, you should monitor it rather than
relying on postinsts failing or succeeding.

Alternatively, you could just add «systemctl is-system-running» to a
post-dpkg-invoke hook, it'll tell you if there are daemons that have


> There are really only two[1] reasons why a daemon could fail to restart:
> - The maintainer made a mistake in the default configuration, and the
>   user didn't make any changes so the old conffiles are being replaced
>   by the new ones, or the package is being newly installed; now the
>   daemon encounters a syntax error. This is a bug, plain and simple, and
>   catching bugs earlier rather than later is a good idea, which will
>   happen if the daemon restart failure causes a postinst failure.
> - The maintainer made no mistake, but the upgrading user made some local
>   changes, so the conffile system ensures that the syntactic differences
>   in the configuration are not incorporated and the daemon fails to
>   restart. As a system administrator, I would want to know when
>   something like that happens sooner rather than later, so that I can
>   fix it (also sooner rather than later). Failing to finish postinst
>   correctly ensures that that does happen.

In addition to this: Any number of runtime problems.  The disk might be
full.  The service might try to look up a user whose name is in LDAP and
the network is down and thus the user lookup fails.  Some hardware the
service needs is not plugged in or doesn't work correctly.  Data files
are corrupted.  Out of memory.  I'm sure you can come up with more. :-)

This then also ties into what the semantics of «daemon is started»
should be: is it that the service has started, or that it is working?
What should happen if you, on a host with no network connectivity (or
just heavily firewalled), do «apt install ntp»?  Should it wait until
the clock is synced (effectively forever in this case?  Should the
postinst fail until you've fixed the firewall?)?

> [1] There is also the possibility of "the package ships with incomplete
> configuration on purpose, because there are no sane defaults to use
> and installing the package requires manual steps from the maintainer
> before it can be made to work", but (a) our best practices recommend
> against doing that if at all possible, and (b) in that case starting
> the daemon shouldn't even be attempted from postinst, and so failure
> to start can't be a consideration in the exit state of postinst.

You might still want to restart it on upgrade to ensure you don't run
outdated binaries.

Tollef Fog Heen
UNIX is user friendly, it's just picky about who its friends are

Bug#904558: What should happen when maintscripts fail to restart a service

2018-09-19 Thread Ian Jackson
Stuart Prescott writes ("Bug#904558: What should happen when maintscripts fail 
to restart  a service"):
> Ian Jackson wrote:
> > When I wrote that, it didn't occur to me that anyone would think that
> > a failure by a postinst script to perform an intended operation should
> > be treated any other way than a failure of the postinst script.
> That was perhaps also written before we started to realise that maintainer 
> scripts are actually best avoided

I don't think that makes any difference.

Whether things are implemented by handcoded code in postinst, or
dh-generated templatey postinst, or some kind of declarative system,
is important for manageability of our codebase etc. etc.

But it doesn't have any bearing on what the error handling should be
like.  Any kind of declarative or automatic system or whatever ought
to have similar error handling: failure to perform an intended
function is an error and should not be ignored.

See for example the handling of errors which occur during trigger

One of the things that I am most proud of in dpkg is the comprehensive
and thoughtful error behaviours.

> > If the postinst fails, then the user has the opportunity to fix the
> > root cause and rerun dpkg-source --configure --pending.  That will
> > then repair the system completely.
> \u2026 causing a snowball of errors in an awkward half-upgraded
> environment is nasty.
> The problem comes when you don't yet have the right tools installed to be 
> able to fix the problem. We see that scenario often enough in #debian where 
> someone has a failed upgrade and we try to collect more information via 
> pastebinit, strace, traceroute, netcat, gdb, etc; we frequently discover 
> that the relevant tool isn't installed and because apt is sufficiently 
> unhappy about broken packages and a half-completed upgrade, you can't ask it 
> to install the tool at that point in time.

This is a bug in apt, plain and simple.

Of course it is a design error, but that does not make it a bug.
There is nothing conceptually incoherent in installing strace while
cupsd and its dependencies are broken.  dpkg will happily do it.

I agree that in the absence of a fix to this, some workarounds would
be good.  Perhaps
  dpkg --configure --force-postinst-fail broken-package

> In the upgrade scenario, while you're trying to fix one particular
> problem, you're also in a completely untested half-upgraded
> situation and so latent bugs in any number of other tools may also
> be exposed.

dpkg is designed so that it is in general only the _configuration_ of
other packages which is blocked, not their actual upgrade.  So
hopefully you should be in a reasonably coherent state.

> So while ignoring errors is wrong, so is making it harder to fix them. This 
> isn't a question of absolutes.

As I say I think it is a bug in apt that when you have an error, apt
makes it hard to fix the error by insisting that you can't do anything
(even install diagnosis tools) until you have fixed the error (which
you can't do).


Ian JacksonThese opinions are my own.

If I emailed you from an address or, that is
a private address which bypasses my fierce spamfilter.

Bug#904558: What should happen when maintscripts fail to restart a service

2018-09-19 Thread Ian Jackson
Tollef Fog Heen writes ("Bug#904558: What should happen when maintscripts fail 
to restart  a service"):
> Ian Jackson:
> > There may be good reasons not to treat daemon startup failure as a
> > postinst failure, but the argument above is not one of them.
> I think this is the core question.  I largely agree with Ian here that
> having postinsts fail is not that big a deal if they can't make forward
> progress, but also we're being asked to advice on what happens when a
> maintainer script fails to restart a service.  I disagree with him on
> whether failure to start/restart a service should be considered a
> configuration failure.

I think whether it is a configuration failure depends on ...

> I think the general rule should be that the success/failure of the
> postinst script should signal whether the package considers itself ready
> to provide whatever API it exists to provide (disregarding the case of
> Essential packages here, since those are special).

... that.  I think I'm in agreement with you on that.  But ...

> This means that failure to start a daemon should generally not cause the
> postinst to fail.

... I disagree with that.  I think that in the usual case, if the
daemon is broken, and the package's purpose is to provide that daemon
service, then the package probably isn't providing its API.

Maybe part of the difficulty we are having with this conversation is
that we are lacking in examples.  This bug and the "parents" #780403
and #802501 are all entirely abstract.

Would someone care to give some examples of packages which with both
behaviours ?


> The API provided by a package being in the configured state is not
> whether the relevant daemon is running or not; that is runtime and can
> and will change many times while the package is in the configured state,
> so dpkg dependencies are not useful for expressing "this service must be
> running".

I disagree with this.

dpkg dependencies are not just about what sets of packages can be
coinstalled.  They also imply sequencing of package setup.  And since
starting daemons is part of package setup, dpkg dependencies imply a
sequencing of daemon startup.

That is actually necessary in the case where the startup of daemon B
can only successfully completed if daemon A is up,

>  (There's also the case where the service is running on a
> separate host, which is often the case for services such as databases
> and where the use of Depends is inappropriate.)

In that case, there would be a Recommends or Suggests instead, I would
have thought.


Ian JacksonThese opinions are my own.

If I emailed you from an address or, that is
a private address which bypasses my fierce spamfilter.

Bug#904558: What should happen when maintscripts fail to restart a service

2018-09-19 Thread Wouter Verhelst
On Tue, Sep 18, 2018 at 10:04:26PM +0200, Tollef Fog Heen wrote:
> ]] Ian Jackson 
> Hi,
> > There may be good reasons not to treat daemon startup failure as a
> > postinst failure, but the argument above is not one of them.
> I think this is the core question.  I largely agree with Ian here that
> having postinsts fail is not that big a deal if they can't make forward
> progress, but also we're being asked to advice on what happens when a
> maintainer script fails to restart a service.  I disagree with him on
> whether failure to start/restart a service should be considered a
> configuration failure.

I'm not sure why that position is even being considered valid.

> The API provided by a package being in the configured state is not
> whether the relevant daemon is running or not; that is runtime and can
> and will change many times while the package is in the configured state,
> so dpkg dependencies are not useful for expressing «this service must be
> running».

No. But it *is* a useful way to express "this service must be able to

Additionally, if something fails to restart, then that is a serious
problem that I, as a system administrator, would like to know about.
Failure to configure a package signals that there is a serious problem
that I need to fix, so that informs me.

> (There's also the case where the service is running on a
> separate host, which is often the case for services such as databases
> and where the use of Depends is inappropriate.)
> I think the general rule should be that the success/failure of the
> postinst script should signal whether the package considers itself ready
> to provide whatever API it exists to provide (disregarding the case of
> Essential packages here, since those are special).
> This means that failure to start a daemon should generally not cause the
> postinst to fail.

I think it should.

If the daemon fails to restart, that means its configuration is
incomplete or incorrect, which means the package failed to configure
correctly. The failure to restart is just a symptom; the actual problem
is the broken configuration, which may have further effects beyond just
"the daemon won't restart". As such, in the general case, I think
failure to restart is something that should cause failure to configure.

There are really only two[1] reasons why a daemon could fail to restart:

- The maintainer made a mistake in the default configuration, and the
  user didn't make any changes so the old conffiles are being replaced
  by the new ones, or the package is being newly installed; now the
  daemon encounters a syntax error. This is a bug, plain and simple, and
  catching bugs earlier rather than later is a good idea, which will
  happen if the daemon restart failure causes a postinst failure.
- The maintainer made no mistake, but the upgrading user made some local
  changes, so the conffile system ensures that the syntactic differences
  in the configuration are not incorporated and the daemon fails to
  restart. As a system administrator, I would want to know when
  something like that happens sooner rather than later, so that I can
  fix it (also sooner rather than later). Failing to finish postinst
  correctly ensures that that does happen.

This is now being countered by "but some people use tools that don't
show failures to system administrators", from which the (wrong)
conclusion is drawn "so we shouldn't fail anymore". It would be awesome
if we lived in a world where we could avoid bugs in code and thus avoid
all possible failures, but alas, we don't. So, given that failures
*will* happen, even if we don't fail when daemons fail to restart, the
correct conclusion would be "so those tools should be fixed to do their
utter best to inform the system administrator when something failed".
When those tools do that, failure to restart a service is no longer a
problem for them, and we can continue to do the right thing.

[1] There is also the possibility of "the package ships with incomplete
configuration on purpose, because there are no sane defaults to use
and installing the package requires manual steps from the maintainer
before it can be made to work", but (a) our best practices recommend
against doing that if at all possible, and (b) in that case starting
the daemon shouldn't even be attempted from postinst, and so failure
to start can't be a consideration in the exit state of postinst.

Could you people please use IRC like normal people?!?

  -- Amaya Rodrigo Sastre, trying to quiet down the buzz in the DebConf 2008

Bug#904558: What should happen when maintscripts fail to restart a service

2018-09-18 Thread Gunnar Wolf
Stuart Prescott dijo [Wed, Sep 19, 2018 at 12:18:24PM +1000]:
> (...)
> That was perhaps also written before we started to realise that maintainer 
> scripts are actually best avoided as they tend to be complicated, fragile, 
> difficult to do right and make upgrades harder for the package manager. In 
> the intervening two decades, we've gone from "maintainer scripts are cool" 
> to "the best maintainer script is the one that doesn't exist".
> So yes, ignoring errors seems wrong but…
> (...)
> … causing a snowball of errors in an awkward half-upgraded environment is 
> nasty.
> The problem comes when you don't yet have the right tools installed to be 
> able to fix the problem. We see that scenario often enough in #debian where 
> someone has a failed upgrade and we try to collect more information via 
> pastebinit, strace, traceroute, netcat, gdb, etc; we frequently discover 
> that the relevant tool isn't installed and because apt is sufficiently 
> unhappy about broken packages and a half-completed upgrade, you can't ask it 
> to install the tool at that point in time.
> In the upgrade scenario, while you're trying to fix one particular problem, 
> you're also in a completely untested half-upgraded situation and so latent 
> bugs in any number of other tools may also be exposed.
> So while ignoring errors is wrong, so is making it harder to fix them. This 
> isn't a question of absolutes.

I completely agree with Stuart here. Yes, of course, there is a reason
for maintainer scripts to exist, and if they fail to set up things
around the package, of course, the user _needs_ to know something is
off in their system.

But that should happen _very_ seldom. As Stuart says, helping
non-technical users out of this situation can be quite hard, and quite
discouraging for the user. We have to make sure the scripts are as
foolproof as possible — and failing to stop or restart a daemon it
should _never_ cause the system to enter such a state.

Description: PGP signature

Bug#904558: What should happen when maintscripts fail to restart a service

2018-09-18 Thread Stuart Prescott
Ian Jackson wrote:
>> I personally think that it would make sense for the policy to at least
>> recommend what should happen with regards to maintainer scripts and
>> typical operations that are performed in them.
> There is already a section on error handling in scripts, which (IMO
> correctly) says that shell scripts should use set -e.
> When I wrote that, it didn't occur to me that anyone would think that
> a failure by a postinst script to perform an intended operation should
> be treated any other way than a failure of the postinst script.

That was perhaps also written before we started to realise that maintainer 
scripts are actually best avoided as they tend to be complicated, fragile, 
difficult to do right and make upgrades harder for the package manager. In 
the intervening two decades, we've gone from "maintainer scripts are cool" 
to "the best maintainer script is the one that doesn't exist".

So yes, ignoring errors seems wrong but…

>> And, while I'm open to be convinced otherwise, I don't see any benefit
>> from postinst (particularly postinst + configure) ever failing.
> Frankly I'm disturbed to be reading this, here.  See above.
> If the postinst fails, then the user has the opportunity to fix the
> root cause and rerun dpkg-source --configure --pending.  That will
> then repair the system completely.

… causing a snowball of errors in an awkward half-upgraded environment is 

The problem comes when you don't yet have the right tools installed to be 
able to fix the problem. We see that scenario often enough in #debian where 
someone has a failed upgrade and we try to collect more information via 
pastebinit, strace, traceroute, netcat, gdb, etc; we frequently discover 
that the relevant tool isn't installed and because apt is sufficiently 
unhappy about broken packages and a half-completed upgrade, you can't ask it 
to install the tool at that point in time.

In the upgrade scenario, while you're trying to fix one particular problem, 
you're also in a completely untested half-upgraded situation and so latent 
bugs in any number of other tools may also be exposed.

So while ignoring errors is wrong, so is making it harder to fix them. This 
isn't a question of absolutes.


Stuart Prescott
Debian Developer
GPG fingerprint90E2 D2C1 AD14 6A1B 7EBB 891D BBC1 7EBB 1396 F2F7

Bug#904558: What should happen when maintscripts fail to restart a service

2018-09-18 Thread Tollef Fog Heen
]] Ian Jackson 


> There may be good reasons not to treat daemon startup failure as a
> postinst failure, but the argument above is not one of them.

I think this is the core question.  I largely agree with Ian here that
having postinsts fail is not that big a deal if they can't make forward
progress, but also we're being asked to advice on what happens when a
maintainer script fails to restart a service.  I disagree with him on
whether failure to start/restart a service should be considered a
configuration failure.

The API provided by a package being in the configured state is not
whether the relevant daemon is running or not; that is runtime and can
and will change many times while the package is in the configured state,
so dpkg dependencies are not useful for expressing «this service must be
running».  (There's also the case where the service is running on a
separate host, which is often the case for services such as databases
and where the use of Depends is inappropriate.)

I think the general rule should be that the success/failure of the
postinst script should signal whether the package considers itself ready
to provide whatever API it exists to provide (disregarding the case of
Essential packages here, since those are special).

This means that failure to start a daemon should generally not cause the
postinst to fail.  At the same time, I think there are exceptions to
this rule that should be left to maintainer judgement: sshd comes to
mind as a service where if it can't restart, you want the system to make
it very clear that something is wrong that you might want to fix sooner
rather than later (since failure to do so can lead to you not being able
to access it after a reboot).

Tollef Fog Heen
UNIX is user friendly, it's just picky about who its friends are

Bug#904558: What should happen when maintscripts fail to restart a service

2018-09-18 Thread Ian Jackson
Margarita Manterola writes ("Bug#904558: What should happen when maintscripts 
fail to restart  a service"):
> Sorry that it took so long to get back to this bug.  The other bug took
> all the attention.
> If a postinst fails (for whatever reason), the package is left in a
> broken state (Failed-Config) which in general makes the package
> management system unhappy.

The other effect is that the package's dependencies are not
configured, so their postinsts do not experience a broken situation.

> It seems that the only reason why one may want to do this is to call
> the attention of the sysadmin so that they can solve the problem.
> However, in a world where a large number of users are running automatic
> updates, leaving the package management system in a broken state is
> pretty sad, not very visible and rather confusing for the user when
> they finally encounter it.
> Is there an another use case for leaving the package in Failed-Config
> that we missed?

If you deliberately cause the postinst to succeed when the package is
nonfunctional, then the package's r-dependencies will be configured
(ie have their postinsts run) in the broken state.

The r-dependencies' postinsts may then do wrong things.  They may
leave the r-dependencies in anomalous states.  If one takes the
argument you make above to its logical conclusion, all those postinsts
should also report success.

The result is system where the only thing that is happy is the package
management systme, and the records of the root cause of the problem,
and how the failed operations might be reattempted, have been lost.

I guess you will infer from what I write above that "reporting errors
causes the next layer to be unhappy", and "reporting errors causes the
user to be unhappy" to be extraordinarily bad arguments.

There may be good reasons not to treat daemon startup failure as a
postinst failure, but the argument above is not one of them.

> It's unclear why the service (re)start needs to be a special case.

Service (re)starts are more likely to fail for unrelated reasons.
Also some packages are able to provide much of their intended API even
without the daemon.

I think the general rule of thumb should be that a daemon startup
failure should be treated as a configuration failure.

I'm content with a situation where maintainers Feel free to diverge
from this if there are reasons to do so.

> I personally think that it would make sense for the policy to at least
> recommend what should happen with regards to maintainer scripts and
> typical operations that are performed in them.

There is already a section on error handling in scripts, which (IMO
correctly) says that shell scripts should use set -e.

When I wrote that, it didn't occur to me that anyone would think that
a failure by a postinst script to perform an intended operation should
be treated any other way than a failure of the postinst script.

(In the usual case.  There are of course lots of situations where the
right approach is some kind of error recovery, or the operation was
attempted "just in case", or something, in which case more subtle
error handling is called for.)

> And, while I'm open to be convinced otherwise, I don't see any benefit
> from postinst (particularly postinst + configure) ever failing.

Frankly I'm disturbed to be reading this, here.  See above.

If the postinst fails, then the user has the opportunity to fix the
root cause and rerun dpkg-source --configure --pending.  That will
then repair the system completely.


Ian JacksonThese opinions are my own.

If I emailed you from an address or, that is
a private address which bypasses my fierce spamfilter.

Bug#904558: What should happen when maintscripts fail to restart a service

2018-09-17 Thread Margarita Manterola


Sorry that it took so long to get back to this bug.  The other bug took
all the attention.

On 2018-07-25 06:07, Sean Whitton wrote:

If postinst or one of the other scripts does a service restart and
the restart operation fails, should the postinst abort or should it
mask the error, continue and return success?

We had some discussion around this subject at the past ctte meeting [1],
and after some back and forth we came to the conclusion that in general
it's a bad idea for any postinst to purposely fail, regardless of
whether it was trying to (re)start a service or not.

If a postinst fails (for whatever reason), the package is left in a
broken state (Failed-Config) which in general makes the package
management system unhappy.

It seems that the only reason why one may want to do this is to call
the attention of the sysadmin so that they can solve the problem.
However, in a world where a large number of users are running automatic
updates, leaving the package management system in a broken state is
pretty sad, not very visible and rather confusing for the user when
they finally encounter it.

Is there an another use case for leaving the package in Failed-Config
that we missed?


As a Policy delegate I want to move this issue along, and I can see
three ways of doing that:

1. write a patch to explicitly state in Policy that what happens when a
   service (re)start fails in a maintscript is left up to package
   maintainer discretion, and close the bugs

2. make a further attempt to establish consensus on a requirement that
   maintscripts are consistent in the case of a (re)start failure (this
   is the default option, so to speak, and I cannot see it succeeding)

3. ask the T.C. to decide what maintscripts should do in these cases.

It's unclear why the service (re)start needs to be a special case. Any
operation that is performed in a postinst might fall under the same
question of what should happen when that operation fails. Operations 

creating users, creating directories, changing permissions, running a
command to update the contents of a file, and so on.

The general question about which I am seeking advice: does the
T.C. think that Debian can be consistent on service (re)starts in
maintscripts, or is the best we can do to leave it up to package
maintainer discretion?

We didn't reach this point in our discussion, so this is still an open

I personally think that it would make sense for the policy to at least
recommend what should happen with regards to maintainer scripts and
typical operations that are performed in them.

And, while I'm open to be convinced otherwise, I don't see any benefit
from postinst (particularly postinst + configure) ever failing.

If the only reason for postinst to fail is so that the user knows what
happened, we should devise a better mechanism for informing the user
about the failure.


Bug#904558: What should happen when maintscripts fail to restart a service

2018-08-10 Thread Sean Whitton

Thank you for your reply.

On Thu 09 Aug 2018 at 09:19pm +0200, Tollef Fog Heen wrote:

> ]] Sean Whitton
>> The general question about which I am seeking advice: does the
>> T.C. think that Debian can be consistent on service (re)starts in
>> maintscripts, or is the best we can do to leave it up to package
>> maintainer discretion?
> I think we can give advice on what the default should be and that people
> should not stray from that unless they have particular reasons.  That
> advice might be more appropriate for the developers reference than
> policy, though.

I disagree -- it's about the contents of packages, so it should go into
Policy.  We can make it a recommendation rather than a requirement.

> Due to the variety and complexity of daemons in the archive, I would be
> reluctant to require complete consistency, there are likely various edge
> cases we have not thought about.

It would be useful to write something like this into Policy, rather than
it remaining silent on the issue.  It would be a fine resolution for the
Policy bug in question.

Sean Whitton

Description: PGP signature

Bug#904558: What should happen when maintscripts fail to restart a service

2018-08-09 Thread Tollef Fog Heen
]] Sean Whitton 

> The general question about which I am seeking advice: does the
> T.C. think that Debian can be consistent on service (re)starts in
> maintscripts, or is the best we can do to leave it up to package
> maintainer discretion?

I think we can give advice on what the default should be and that people
should not stray from that unless they have particular reasons.  That
advice might be more appropriate for the developers reference than
policy, though.

Due to the variety and complexity of daemons in the archive, I would be
reluctant to require complete consistency, there are likely various edge
cases we have not thought about.

Tollef Fog Heen
UNIX is user friendly, it's just picky about who its friends are

Bug#904558: What should happen when maintscripts fail to restart a service

2018-07-24 Thread Sean Whitton
Package: tech-ctte
Control: block 780403 by -1

I hereby request advice from the Technical Committee on a decision that
I must take in my role as a Debian Policy delegate.  To be completely
clear, I am not seeking a decision.  I refer to the third power of the
T.C. listed under section 6.1 of the Debian Constitution: "Any person or
body may ... seek advice from [the Technical Committee]."

In bugs #780403 and #802501 the following question has been asked (I
quote Daniel Pocock):

If postinst or one of the other scripts does a service restart and
the restart operation fails, should the postinst abort or should it
mask the error, continue and return success?

At present the Policy Manual does not answer this question, and thus it
is left up to maintainer discretion: whatever the maintainer thinks
makes sense for the service in question.

Others have pointed out, however, that this means that users will see
inconsistent behaviour.  There is no practical way for a user to
determine what will happen when installing a given package that starts
or restarts a service, if that start or restart attempt fails.  So if it
were possible to come up with consistent answer to the question posed,
it would be useful to our users.

As a Policy delegate I want to move this issue along, and I can see
three ways of doing that:

1. write a patch to explicitly state in Policy that what happens when a
   service (re)start fails in a maintscript is left up to package
   maintainer discretion, and close the bugs

2. make a further attempt to establish consensus on a requirement that
   maintscripts are consistent in the case of a (re)start failure (this
   is the default option, so to speak, and I cannot see it succeeding)

3. ask the T.C. to decide what maintscripts should do in these cases.

The general question about which I am seeking advice: does the
T.C. think that Debian can be consistent on service (re)starts in
maintscripts, or is the best we can do to leave it up to package
maintainer discretion?


Sean Whitton

Description: PGP signature