[pruning the CC list]
On 2012-Dec-15 21:52:03 +0100, Pawel Jakub Dawidek <p...@freebsd.org> wrote:
>On Wed, Dec 12, 2012 at 04:02:58PM -0800, Navdeep Parhar wrote:
>> A KASSERT() really is for a condition that should never happen.

and can't be practically recovered from.

>I have sort of mixed feelings about this change, but in reality we have
>three cases:
>
>1. Fatal conditions that shouldn't happen, but may happen for some
>   reason and we definiately want to stop running (corrupted file system
>   metadata that can mess up the file system more badly). For those we have
>   direct panic(9) calls.
>
>2. Fatal conditions that cannot happen and for those we have KASSERT(9).

I don't see the difference between these two cases.  In both of them,
the system has entered a state that the designer/programmer didn't
envisage and can't recover from.  The best option is to abort as
quickly to minimise further corruption.

>3. Non-fatal conditions that cannot happen, which we have no way to
>   report more gracefully and we do it through KASSERT(9).

"Cannot" needs to be quoted here because you can't hit them unless they
actually do happen.

And:
4. Unexpected conditions that should be non-fatal but no-one has
   written the code to recover so someone has included a KASSERT(9) as
   a place-holder.

I suspect a lot of the heat in this discussion is associated with
points 3 & 4 - unexpected conditions that the code can't cope with.

>It is annoying to run INVARIANTS kernel and trigger 3. I had this
>problem few times, for example in TCP/IP stack. It turned out to be
>non-fatal and KASSERT(9) was there to understand the code better.
>I'd much prefer to see it logged than to panic my box. Of course it is
>also sometimes hard to judge if it is fatal or not.
>
>I use plenty of assertions in my code to catch bugs early, but
>assertions like any other part of the code might have bugs and during
>rapid development they help a lot when the code around is changing a lot
>and some earlier assumptions are no longer valid. Many of those
>assertions are non-fatal.

IMHO, having lots of assertions (ala design-by-contract) can make it
easier to understand a function's interfaces.  Of course, if there
are errors in the interface conditions, you can wind up with spurious
panic()s.

>For me this problem is pretty complex, because:
>
>1. It would be nice to have a macro to test non-fatal conditions,

This should be fairly trivial to implement - test the specified
condition and if it's false print out a rate-limited message and
backtrace.

> but it
>   is hard to tell sometimes if it is fatal or not.

As a rule of thumb, if the code can't handle the condition not being
met then it's fatal - this includes my point 4 above.  If the code can
recover (which may require it to take drastic action like aborting a
process or closing a socket) then it's not fatal.  OTOH, IMHO, not
holding an expected lock _is_ fatal, even if there's no immediately
obvious downside to just continuing.

>3. Logging non-fatal "assertions" might make them go unnoticed, but we
>   currently don't enable kernel dumps by default, so when panic occurs
>   user has no idea what has happend, especially if he is in X. Logging
>   would give better chance to actually notice the problem currently.

Rebooting does mean that we restore the system to a sane state, even
if we don't record the previous state.

>In my opinion we should do the following:
>
>1. Leave the option to make KASSERTs non-fatal, but log big fat warning
>   that this is development feature and should not be used in production.

I disagree with this but not strongly enough to say it should be
backed out.  IMHO, an active KASSERT() should be fatal.

>2. Introduce handy macro that would log the problem, but won't panic the
>   box for non-fatal conditions, just like we do with LORs.

Agreed.

>3. Enable kernel dumps by default. The main obstacle to do that was lack
>   of a way to limit number of kernel dumps, which could lead to filling
>   /var/ after hitting some panic/reboot cycle.

I agree that it's reasonable to enable crashdumps by default.  Note that
/var/crash/minfree is supposed to stop /var filling up.

>My contribution to solve this is implementation of 3:
>       http://people.freebsd.org/~pjd/patches/savecore.patch

Adding a "maxdumps" parameter seems a good idea.  The behaviour when
it hits the limit is less clear - throwing away the latest crashdump
is probably as easy to justify as throwing away the oldest one.

A further downside to enabling crashdumps is the time overhead -
initially writing the dump, copying it to /var/crash and running
crashinfo.  This all directly increases the length of the outage.

-- 
Peter Jeremy

Attachment: pgp5wkhrNBmFs.pgp
Description: PGP signature

Reply via email to