[pruning the CC list] On 2012-Dec-15 21:52:03 +0100, Pawel Jakub Dawidek <p...@freebsd.org> wrote: >On Wed, Dec 12, 2012 at 04:02:58PM -0800, Navdeep Parhar wrote: >> A KASSERT() really is for a condition that should never happen.
and can't be practically recovered from. >I have sort of mixed feelings about this change, but in reality we have >three cases: > >1. Fatal conditions that shouldn't happen, but may happen for some > reason and we definiately want to stop running (corrupted file system > metadata that can mess up the file system more badly). For those we have > direct panic(9) calls. > >2. Fatal conditions that cannot happen and for those we have KASSERT(9). I don't see the difference between these two cases. In both of them, the system has entered a state that the designer/programmer didn't envisage and can't recover from. The best option is to abort as quickly to minimise further corruption. >3. Non-fatal conditions that cannot happen, which we have no way to > report more gracefully and we do it through KASSERT(9). "Cannot" needs to be quoted here because you can't hit them unless they actually do happen. And: 4. Unexpected conditions that should be non-fatal but no-one has written the code to recover so someone has included a KASSERT(9) as a place-holder. I suspect a lot of the heat in this discussion is associated with points 3 & 4 - unexpected conditions that the code can't cope with. >It is annoying to run INVARIANTS kernel and trigger 3. I had this >problem few times, for example in TCP/IP stack. It turned out to be >non-fatal and KASSERT(9) was there to understand the code better. >I'd much prefer to see it logged than to panic my box. Of course it is >also sometimes hard to judge if it is fatal or not. > >I use plenty of assertions in my code to catch bugs early, but >assertions like any other part of the code might have bugs and during >rapid development they help a lot when the code around is changing a lot >and some earlier assumptions are no longer valid. Many of those >assertions are non-fatal. IMHO, having lots of assertions (ala design-by-contract) can make it easier to understand a function's interfaces. Of course, if there are errors in the interface conditions, you can wind up with spurious panic()s. >For me this problem is pretty complex, because: > >1. It would be nice to have a macro to test non-fatal conditions, This should be fairly trivial to implement - test the specified condition and if it's false print out a rate-limited message and backtrace. > but it > is hard to tell sometimes if it is fatal or not. As a rule of thumb, if the code can't handle the condition not being met then it's fatal - this includes my point 4 above. If the code can recover (which may require it to take drastic action like aborting a process or closing a socket) then it's not fatal. OTOH, IMHO, not holding an expected lock _is_ fatal, even if there's no immediately obvious downside to just continuing. >3. Logging non-fatal "assertions" might make them go unnoticed, but we > currently don't enable kernel dumps by default, so when panic occurs > user has no idea what has happend, especially if he is in X. Logging > would give better chance to actually notice the problem currently. Rebooting does mean that we restore the system to a sane state, even if we don't record the previous state. >In my opinion we should do the following: > >1. Leave the option to make KASSERTs non-fatal, but log big fat warning > that this is development feature and should not be used in production. I disagree with this but not strongly enough to say it should be backed out. IMHO, an active KASSERT() should be fatal. >2. Introduce handy macro that would log the problem, but won't panic the > box for non-fatal conditions, just like we do with LORs. Agreed. >3. Enable kernel dumps by default. The main obstacle to do that was lack > of a way to limit number of kernel dumps, which could lead to filling > /var/ after hitting some panic/reboot cycle. I agree that it's reasonable to enable crashdumps by default. Note that /var/crash/minfree is supposed to stop /var filling up. >My contribution to solve this is implementation of 3: > http://people.freebsd.org/~pjd/patches/savecore.patch Adding a "maxdumps" parameter seems a good idea. The behaviour when it hits the limit is less clear - throwing away the latest crashdump is probably as easy to justify as throwing away the oldest one. A further downside to enabling crashdumps is the time overhead - initially writing the dump, copying it to /var/crash and running crashinfo. This all directly increases the length of the outage. -- Peter Jeremy
pgp5wkhrNBmFs.pgp
Description: PGP signature