Re: svn commit: r332860 - head/sys/kern

2018-04-24 Thread Conrad Meyer
Next time you encounter something like this, please file a bug.
There's no reason to have broken kernel dumps for a year.  It took 10
minutes to diagnose.

On Tue, Apr 24, 2018 at 10:38 AM, Andrew Gallatin  wrote:
> On 04/24/18 13:24, Jonathan T. Looney wrote:
>>
>> On Mon, Apr 23, 2018 at 6:04 PM, John Baldwin > > wrote:
>>  >
>>  > I think this is actually a key question.  In my experience to date I
>> have not
>>  > encountered a large number of post-panic assertion failures.  Given
>> that
>>  > we already break all locks and disable assertions for locks I'd be
>> curious
>>  > which assertions are actually failing.  My inclination given my
>> experiences
>>  > to date would be to explicitly ignore those as we do for locking if it
>> is
>>  > constrained set rather than blacklisting all of them.  However, I would
>> be
>>  > most interested in seeing some examples of assertions that are failing.
>>
>> The latest example (the one that prompted me to finally commit this) is in
>> lockmgr_sunlock_try(): 'panic: Assertion (*xp & ~LK_EXCLUSIVE_SPINNERS) ==
>> LK_SHARERS_LOCK(1) failed at /usr/src/sys/kern/kern_lock.c:541'
>>
>> I don't see any obvious recent changes that would have caused this, so
>> this is probably a case where a change to another file suddenly made us trip
>> over this assert.
>
>
> FWIW, that assertion has prevented me from getting a dump from an
> INVARIANTS kernel for at least a year.
>
> Drew
___
svn-src-head@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r332860 - head/sys/kern

2018-04-24 Thread John Baldwin
On Tuesday, April 24, 2018 01:24:30 PM Jonathan T. Looney wrote:
> On Mon, Apr 23, 2018 at 6:04 PM, John Baldwin  wrote:
> >
> > I think this is actually a key question.  In my experience to date I have
> not
> > encountered a large number of post-panic assertion failures.  Given that
> > we already break all locks and disable assertions for locks I'd be curious
> > which assertions are actually failing.  My inclination given my
> experiences
> > to date would be to explicitly ignore those as we do for locking if it is
> > constrained set rather than blacklisting all of them.  However, I would be
> > most interested in seeing some examples of assertions that are failing.
> 
> The latest example (the one that prompted me to finally commit this) is in
> lockmgr_sunlock_try(): 'panic: Assertion (*xp & ~LK_EXCLUSIVE_SPINNERS) ==
> LK_SHARERS_LOCK(1) failed at /usr/src/sys/kern/kern_lock.c:541'

So that's one of the few assertions in a locking primitive that hasn't been
explicitly neutered.  I would neuter it explicitly by adjusting that assertion
to not fail if panicstr != NULL.  lockmgr() itself is mostly neutered already,
though I would move the existing panicstr check in _lockmgr_args slightly higher
above the list of KASSERT()'s.  I would say this is just a bug in lockmgr in
that it isn't careful to break locks and ignore assertions during a panic.  I
consider locking assertions to be a class that should be neutered, but I think
it's a different argument to expand that to ignoring all assertions.  Do you
have any examples of non-locking assertions?

> I don't see any obvious recent changes that would have caused this, so this
> is probably a case where a change to another file suddenly made us trip
> over this assert.

It's arguably a bug in r313683 which is only a year old.

-- 
John Baldwin
___
svn-src-head@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r332860 - head/sys/kern

2018-04-24 Thread Mark Johnston
On Tue, Apr 24, 2018 at 01:24:30PM -0400, Jonathan T. Looney wrote:
> On Mon, Apr 23, 2018 at 6:04 PM, John Baldwin  wrote:
> >
> > I think this is actually a key question.  In my experience to date I have
> not
> > encountered a large number of post-panic assertion failures.  Given that
> > we already break all locks and disable assertions for locks I'd be curious
> > which assertions are actually failing.  My inclination given my
> experiences
> > to date would be to explicitly ignore those as we do for locking if it is
> > constrained set rather than blacklisting all of them.  However, I would be
> > most interested in seeing some examples of assertions that are failing.
> 
> The latest example (the one that prompted me to finally commit this) is in
> lockmgr_sunlock_try(): 'panic: Assertion (*xp & ~LK_EXCLUSIVE_SPINNERS) ==
> LK_SHARERS_LOCK(1) failed at /usr/src/sys/kern/kern_lock.c:541'
> 
> I don't see any obvious recent changes that would have caused this, so this
> is probably a case where a change to another file suddenly made us trip
> over this assert.
> 
> And, that really illustrates my overall point:

Mine too. :)

Why is anything trying to acquire a lockmgr lock after a panic? What is
the stack? I suspect that CAM is completing non-dump CCBs after a panic,
which can cause deadlocks if the completion handler needs to perform a
TLB shootdown after destroying a mapping, for example. In fact, I had
forgotten that Isilon has some CAM patches which attempt to address this
because of the problems that such deadlocks had caused. I will work on
getting these reviewed and upstreamed.

> most assertions in
> general-use code have limited value after a panic.
>
> We expect developers to write high-quality assertions so we can catch bugs.
> This requires that they understand how their code will be used. However,
> once we've panic'd, many assumptions behind code change and the assertions
> are no longer valid. (And, sometimes, it is difficult for a developer to
> predict how these things will change in a panic situation.) We can either
> play whack-a-mole to modify assertions as we trip over them in our
> post-panic work, or we can switch to an opt-in model where we only check
> assertions which the developer actually intends to run post-panic.
> 
> Playing whack-a-mole seems like a Sisyphean task which will burn out
> developers and/or frustrate people who run INVARIANTS kernels. Switching to
> an opt-in model seems like the better long-term strategy.
> 
> Having said all of that, I am cognizant of at least two things:
> 1) Mark Johnston has done a lot of work in coredumps and thinks there are
> post-panic assertions that have value.
> 2) Until we have both agreement to switch our post-panic assertion paradigm
> and also infrastructure to allow developers to opt in, it probably is not
> wise to disable all assertions by default.
> 
> So, I will follow Mark's suggestions: I will change the default. I will
> also change the code so we print a limited number of failed assertions.

Thanks.

> However, I think that changing the post-panic assertion paradigm is an
> important conversation to have. We want people to run our INVARIANTS
> kernels. And, we want to get high-quality reports from those. I think we
> could better serve those goals by changing the post-panic assertion
> paradigm.
> 
> Jonathan
___
svn-src-head@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r332860 - head/sys/kern

2018-04-24 Thread Andrew Gallatin

On 04/24/18 13:24, Jonathan T. Looney wrote:
On Mon, Apr 23, 2018 at 6:04 PM, John Baldwin > wrote:

 >
 > I think this is actually a key question.  In my experience to date I 
have not

 > encountered a large number of post-panic assertion failures.  Given that
 > we already break all locks and disable assertions for locks I'd be 
curious
 > which assertions are actually failing.  My inclination given my 
experiences

 > to date would be to explicitly ignore those as we do for locking if it is
 > constrained set rather than blacklisting all of them.  However, I 
would be

 > most interested in seeing some examples of assertions that are failing.

The latest example (the one that prompted me to finally commit this) is 
in lockmgr_sunlock_try(): 'panic: Assertion (*xp & 
~LK_EXCLUSIVE_SPINNERS) == LK_SHARERS_LOCK(1) failed at 
/usr/src/sys/kern/kern_lock.c:541'


I don't see any obvious recent changes that would have caused this, so 
this is probably a case where a change to another file suddenly made us 
trip over this assert.


FWIW, that assertion has prevented me from getting a dump from an
INVARIANTS kernel for at least a year.

Drew
___
svn-src-head@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r332860 - head/sys/kern

2018-04-24 Thread Jonathan T. Looney
On Mon, Apr 23, 2018 at 6:04 PM, John Baldwin  wrote:
>
> I think this is actually a key question.  In my experience to date I have
not
> encountered a large number of post-panic assertion failures.  Given that
> we already break all locks and disable assertions for locks I'd be curious
> which assertions are actually failing.  My inclination given my
experiences
> to date would be to explicitly ignore those as we do for locking if it is
> constrained set rather than blacklisting all of them.  However, I would be
> most interested in seeing some examples of assertions that are failing.

The latest example (the one that prompted me to finally commit this) is in
lockmgr_sunlock_try(): 'panic: Assertion (*xp & ~LK_EXCLUSIVE_SPINNERS) ==
LK_SHARERS_LOCK(1) failed at /usr/src/sys/kern/kern_lock.c:541'

I don't see any obvious recent changes that would have caused this, so this
is probably a case where a change to another file suddenly made us trip
over this assert.

And, that really illustrates my overall point: most assertions in
general-use code have limited value after a panic.

We expect developers to write high-quality assertions so we can catch bugs.
This requires that they understand how their code will be used. However,
once we've panic'd, many assumptions behind code change and the assertions
are no longer valid. (And, sometimes, it is difficult for a developer to
predict how these things will change in a panic situation.) We can either
play whack-a-mole to modify assertions as we trip over them in our
post-panic work, or we can switch to an opt-in model where we only check
assertions which the developer actually intends to run post-panic.

Playing whack-a-mole seems like a Sisyphean task which will burn out
developers and/or frustrate people who run INVARIANTS kernels. Switching to
an opt-in model seems like the better long-term strategy.

Having said all of that, I am cognizant of at least two things:
1) Mark Johnston has done a lot of work in coredumps and thinks there are
post-panic assertions that have value.
2) Until we have both agreement to switch our post-panic assertion paradigm
and also infrastructure to allow developers to opt in, it probably is not
wise to disable all assertions by default.

So, I will follow Mark's suggestions: I will change the default. I will
also change the code so we print a limited number of failed assertions.

However, I think that changing the post-panic assertion paradigm is an
important conversation to have. We want people to run our INVARIANTS
kernels. And, we want to get high-quality reports from those. I think we
could better serve those goals by changing the post-panic assertion
paradigm.

Jonathan
___
svn-src-head@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r332860 - head/sys/kern

2018-04-23 Thread John Baldwin
On Monday, April 23, 2018 02:00:24 PM Mark Johnston wrote:
> On Mon, Apr 23, 2018 at 11:12:32AM -0400, Jonathan T. Looney wrote:
> > Hi Mark,
> > 
> > Let me start by saying that I appreciate your well-reasoned response. (I
> > think) I understand your reasoning, I appreciate your well-explained
> > argument, and I respect your opinion. I just wanted to make that clear up
> > front.
> > 
> > On Sun, Apr 22, 2018 at 1:11 PM, Mark Johnston  wrote:
> > >
> > > > All too often, my ability to debug assertion violations is hindered
> > because
> > > > the system trips over yet another assertion while dumping the core. If
> > we
> > > > skip the assertion, nothing bad happens. (The post-panic debugging code
> > > > already needs to deal with systems that are inconsistent, and it does a
> > > > pretty good job at it.)
> > >
> > > I think we make a decent effort to fix such problems as they arise, but
> > > you are declaring defeat on behalf of everyone. Did you make some effort
> > > to fix or report these issues before resorting to the more drastic
> > > measure taken here?
> > 
> > We try to report or fix them as they arise. However, you don't know there
> > is a problem until you actually run into it. And, you don't run into the
> > problem until you can't get a core dump due to the assertion.
> > 
> > (And, with elusive problems, it isn't always easy to duplicate them. So,
> > fixing the assertion is sometimes "too late".)
> 
> Sure, this is true. But unless it's a problem in practice it's obviously
> preferable to keep assertions enabled. Kernel dumping itself is a
> fundamentally unreliable mechanism, but it works well enough to be
> useful. I basically never see problems with post-panic assertion
> failures, and I test the kernel dump code a fair bit. Isilon exercises
> that code quite a lot as well without any problems that I'm aware of,
> and I can't think of any reports of such assertion failures that weren't
> quickly fixed. So I'm wondering what problems exist in your specific
> environment that we might instead address surgically.
> 
> (I could very well be wrong about how widespread post-panic assertion
> failures are. We've had problems of this sort before, e.g., with the
> updated DRM graphics drivers, where the code to grab the console after a
> panic didn't work properly. There, the bandaid was to just disable that
> specific mechanism.)

I think this is actually a key question.  In my experience to date I have not
encountered a large number of post-panic assertion failures.  Given that
we already break all locks and disable assertions for locks I'd be curious
which assertions are actually failing.  My inclination given my experiences
to date would be to explicitly ignore those as we do for locking if it is
constrained set rather than blacklisting all of them.  However, I would be
most interested in seeing some examples of assertions that are failing.

-- 
John Baldwin
___
svn-src-head@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r332860 - head/sys/kern

2018-04-23 Thread Mark Johnston
On Mon, Apr 23, 2018 at 11:12:32AM -0400, Jonathan T. Looney wrote:
> Hi Mark,
> 
> Let me start by saying that I appreciate your well-reasoned response. (I
> think) I understand your reasoning, I appreciate your well-explained
> argument, and I respect your opinion. I just wanted to make that clear up
> front.
> 
> On Sun, Apr 22, 2018 at 1:11 PM, Mark Johnston  wrote:
> >
> > > All too often, my ability to debug assertion violations is hindered
> because
> > > the system trips over yet another assertion while dumping the core. If
> we
> > > skip the assertion, nothing bad happens. (The post-panic debugging code
> > > already needs to deal with systems that are inconsistent, and it does a
> > > pretty good job at it.)
> >
> > I think we make a decent effort to fix such problems as they arise, but
> > you are declaring defeat on behalf of everyone. Did you make some effort
> > to fix or report these issues before resorting to the more drastic
> > measure taken here?
> 
> We try to report or fix them as they arise. However, you don't know there
> is a problem until you actually run into it. And, you don't run into the
> problem until you can't get a core dump due to the assertion.
> 
> (And, with elusive problems, it isn't always easy to duplicate them. So,
> fixing the assertion is sometimes "too late".)

Sure, this is true. But unless it's a problem in practice it's obviously
preferable to keep assertions enabled. Kernel dumping itself is a
fundamentally unreliable mechanism, but it works well enough to be
useful. I basically never see problems with post-panic assertion
failures, and I test the kernel dump code a fair bit. Isilon exercises
that code quite a lot as well without any problems that I'm aware of,
and I can't think of any reports of such assertion failures that weren't
quickly fixed. So I'm wondering what problems exist in your specific
environment that we might instead address surgically.

(I could very well be wrong about how widespread post-panic assertion
failures are. We've had problems of this sort before, e.g., with the
updated DRM graphics drivers, where the code to grab the console after a
panic didn't work properly. There, the bandaid was to just disable that
specific mechanism.)

> > > On the other hand, I really am not sure what you are worried might
> happen
> > > if we skip checking assertions after we've already panic'd. As far as I
> can
> > > tell, the likely worst case is that we hit a true panic of some kind. In
> > > that case, we're no worse off than before.
> > >
> > > I think the one obvious exception is when we're purposely trying to
> > > validate the post-panic debugging code. In that case, you can change the
> > > sysctl/tunable to enable troubleshooting.
> >
> > What about a user whose test system panics and fails to dump? With
> > assertions enabled, a developer has a better chance of spotting the
> > problem. Now we need at least one extra round trip to the user to
> > diagnose the problem, which may not be readily reproducible in the first
> > place.
> 
> That's true. However, this is equally true in the other direction: Prior to
> this change, when a user tripped over an assertion and was unable to get a
> coredump due to a second post-panic assertion, it took (at least) another
> round-trip to get a coredump.
> 
> First, without the capability to ignore assertions after a panic
> (introduced by this commit), you would need to fix the actual assertion to
> enable the user to get a coredump. At minimum, I think this change has
> value in that case. This change gives you a mechanism to get a coredump
> without requiring that you fix the assertion and get the user to recompile
> with the patch.
> 
> But, moreover, if we change the default back to panic'ing on a second
> assertion, we will hamper our ability to get usable reports about elusive
> bugs. If we leave the default "as is", it won't take an extra round-trip to
> tell the user how to get a coredump. If we change the default (or, perhaps
> more correctly, "restore the prior default"), we will still need a second
> round-trip to get coredumps. That makes it tough to chase elusive bugs.

I agree with what you're saying. I'm thinking more of the long-term
effects of this change and am concerned that it increases the potential
for actual bugs to appear in the kernel dump code paths. Those types of
bugs are often quite tricky to track down from a single instance, and
can cause dumps to fail. If that starts to happen, we're basically back
to where we started.
___
svn-src-head@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r332860 - head/sys/kern

2018-04-23 Thread Warner Losh
On Mon, Apr 23, 2018 at 9:12 AM, Jonathan T. Looney 
wrote:
>
> If we leave the default alone, I agree we should print the assertion
> message (albeit with some rate limit).
>

We should print the first N asserts we hit during a kernel panic core dump,
then stop. I'd suggest N should be 10 or 20.

For me, the only asserts that have value during a panic are the ones that
prevent bad things from happening. The only bad thing that really can
happen is either wedging the machine, so we don't finish the dump
(watchdogs solve that) or writing to a naughty place (defined as not within
the bounds of a swap partition). I'm having a hard time seeing in what
actual value assertions that aren't related to either of these areas buy us.

Warner
___
svn-src-head@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r332860 - head/sys/kern

2018-04-23 Thread Jonathan T. Looney
Hi Mark,

Let me start by saying that I appreciate your well-reasoned response. (I
think) I understand your reasoning, I appreciate your well-explained
argument, and I respect your opinion. I just wanted to make that clear up
front.

On Sun, Apr 22, 2018 at 1:11 PM, Mark Johnston  wrote:
>
> > All too often, my ability to debug assertion violations is hindered
because
> > the system trips over yet another assertion while dumping the core. If
we
> > skip the assertion, nothing bad happens. (The post-panic debugging code
> > already needs to deal with systems that are inconsistent, and it does a
> > pretty good job at it.)
>
> I think we make a decent effort to fix such problems as they arise, but
> you are declaring defeat on behalf of everyone. Did you make some effort
> to fix or report these issues before resorting to the more drastic
> measure taken here?

We try to report or fix them as they arise. However, you don't know there
is a problem until you actually run into it. And, you don't run into the
problem until you can't get a core dump due to the assertion.

(And, with elusive problems, it isn't always easy to duplicate them. So,
fixing the assertion is sometimes "too late".)


> > On the other hand, I really am not sure what you are worried might
happen
> > if we skip checking assertions after we've already panic'd. As far as I
can
> > tell, the likely worst case is that we hit a true panic of some kind. In
> > that case, we're no worse off than before.
> >
> > I think the one obvious exception is when we're purposely trying to
> > validate the post-panic debugging code. In that case, you can change the
> > sysctl/tunable to enable troubleshooting.
>
> What about a user whose test system panics and fails to dump? With
> assertions enabled, a developer has a better chance of spotting the
> problem. Now we need at least one extra round trip to the user to
> diagnose the problem, which may not be readily reproducible in the first
> place.

That's true. However, this is equally true in the other direction: Prior to
this change, when a user tripped over an assertion and was unable to get a
coredump due to a second post-panic assertion, it took (at least) another
round-trip to get a coredump.

First, without the capability to ignore assertions after a panic
(introduced by this commit), you would need to fix the actual assertion to
enable the user to get a coredump. At minimum, I think this change has
value in that case. This change gives you a mechanism to get a coredump
without requiring that you fix the assertion and get the user to recompile
with the patch.

But, moreover, if we change the default back to panic'ing on a second
assertion, we will hamper our ability to get usable reports about elusive
bugs. If we leave the default "as is", it won't take an extra round-trip to
tell the user how to get a coredump. If we change the default (or, perhaps
more correctly, "restore the prior default"), we will still need a second
round-trip to get coredumps. That makes it tough to chase elusive bugs.


> > I would honestly appreciate someone explaining the dangers in disabling
a
> > response to assertion violations after we've already panic'd and are
simply
> > trying to troubleshoot, because they are not obvious to me. But, I could
> > simply be missing them.
>
> The assertions help identify code that is being executed during a dump
> when it shouldn't be. In general we rely on users to opt in to running
> INVARIANTS kernels because developers don't catch every single bug. With
> this change it's harder to be confident in the kernel dump code. (Or in
> any post-panic debugging code for that matter.)

I can appreciate that. I am generally skeptical of the value of assertions
in general-use code after a panic, since we already know the system is in
an inconsistent/unexpected state. And, it is hard to predict all the
various ways it could possibly be broken. However, I do recognize that
there is code which specifically is written to run post-panic, and which
has assertions which SHOULD be true, even after a panic.


> I dislike the change and would prefer the default to be inverted. At the
> very least I think we should print the assertion message rather than
> returning silently from kassert_panic().

I still think this change has value (as described above). I can understand
the argument for changing the default. In fact, after thinking about your
email, I'm leaning towards doing that. But, I want to ponder it more today.

If we leave the default alone, I agree we should print the assertion
message (albeit with some rate limit).

Jonathan
___
svn-src-head@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r332860 - head/sys/kern

2018-04-22 Thread Mark Johnston
On Sat, Apr 21, 2018 at 04:33:33PM -0400, Jonathan Looney wrote:
> On Sat, Apr 21, 2018 at 1:53 PM, Conrad Meyer  wrote:
> >
> > I don't think this should be enabled by default.  Can we leave it
> > disabled by default and let consumers opt-in?
> 
> I'm willing to change the default if there's a good reason or consensus for
> that. However, it is not obvious to me that the default is actually wrong.
> 
> I think its important that we remember where we are when we hit this code.
> 
> By the time we hit this code, we've already panic'd (whether due to a
> "true" panic or a violated assertion). At this point, the system is already
> in an unknown/unexpected state and we want to preserve our ability to
> troubleshoot and/or recover. (And, since we're running a system with
> INVARIANTS, presumably the ability to troubleshoot is important.)
> 
> We may well violate a second assertion simply because the system is in an
> unknown/unexpected state. Once we do that, the question is whether we
> should try to preserve the ability to troubleshoot, or should give up and
> reset the system.
> 
> All too often, my ability to debug assertion violations is hindered because
> the system trips over yet another assertion while dumping the core. If we
> skip the assertion, nothing bad happens. (The post-panic debugging code
> already needs to deal with systems that are inconsistent, and it does a
> pretty good job at it.)

I think we make a decent effort to fix such problems as they arise, but
you are declaring defeat on behalf of everyone. Did you make some effort
to fix or report these issues before resorting to the more drastic
measure taken here?

> On the other hand, I really am not sure what you are worried might happen
> if we skip checking assertions after we've already panic'd. As far as I can
> tell, the likely worst case is that we hit a true panic of some kind. In
> that case, we're no worse off than before.
> 
> I think the one obvious exception is when we're purposely trying to
> validate the post-panic debugging code. In that case, you can change the
> sysctl/tunable to enable troubleshooting.

What about a user whose test system panics and fails to dump? With
assertions enabled, a developer has a better chance of spotting the
problem. Now we need at least one extra round trip to the user to
diagnose the problem, which may not be readily reproducible in the first
place.

> I would honestly appreciate someone explaining the dangers in disabling a
> response to assertion violations after we've already panic'd and are simply
> trying to troubleshoot, because they are not obvious to me. But, I could
> simply be missing them.

The assertions help identify code that is being executed during a dump
when it shouldn't be. In general we rely on users to opt in to running
INVARIANTS kernels because developers don't catch every single bug. With
this change it's harder to be confident in the kernel dump code. (Or in
any post-panic debugging code for that matter.)

I dislike the change and would prefer the default to be inverted. At the
very least I think we should print the assertion message rather than
returning silently from kassert_panic().
___
svn-src-head@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r332860 - head/sys/kern

2018-04-21 Thread Jonathan Looney
On Sat, Apr 21, 2018 at 1:53 PM, Conrad Meyer  wrote:
>
> I don't think this should be enabled by default.  Can we leave it
> disabled by default and let consumers opt-in?

I'm willing to change the default if there's a good reason or consensus for
that. However, it is not obvious to me that the default is actually wrong.

I think its important that we remember where we are when we hit this code.

By the time we hit this code, we've already panic'd (whether due to a
"true" panic or a violated assertion). At this point, the system is already
in an unknown/unexpected state and we want to preserve our ability to
troubleshoot and/or recover. (And, since we're running a system with
INVARIANTS, presumably the ability to troubleshoot is important.)

We may well violate a second assertion simply because the system is in an
unknown/unexpected state. Once we do that, the question is whether we
should try to preserve the ability to troubleshoot, or should give up and
reset the system.

All too often, my ability to debug assertion violations is hindered because
the system trips over yet another assertion while dumping the core. If we
skip the assertion, nothing bad happens. (The post-panic debugging code
already needs to deal with systems that are inconsistent, and it does a
pretty good job at it.)

On the other hand, I really am not sure what you are worried might happen
if we skip checking assertions after we've already panic'd. As far as I can
tell, the likely worst case is that we hit a true panic of some kind. In
that case, we're no worse off than before.

I think the one obvious exception is when we're purposely trying to
validate the post-panic debugging code. In that case, you can change the
sysctl/tunable to enable troubleshooting.

I would honestly appreciate someone explaining the dangers in disabling a
response to assertion violations after we've already panic'd and are simply
trying to troubleshoot, because they are not obvious to me. But, I could
simply be missing them.

Jonathan
___
svn-src-head@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r332860 - head/sys/kern

2018-04-21 Thread Eitan Adler
On 21 April 2018 at 10:05, Jonathan T. Looney  wrote:
> Author: jtl
> Date: Sat Apr 21 17:05:00 2018
> New Revision: 332860
> URL: https://svnweb.freebsd.org/changeset/base/332860
>
> Log:
>   When running with INVARIANTS, the kernel contains extra checks.  However,
>   these assumptions may not hold true once we've panic'd. Therefore, the
>   checks hold less value after a panic.  Additionally, if one of the checks
>   fails while we are already panic'd, this creates a double-panic which can
>   interfere with debugging the original panic.

Rather than do this I'd rather we modify the invariants to more
explicitly state under what conditions it holds.
This might be something like

KASSERT(!panic && ...)
or
KASSERT_NOT_IN_PANIC(...)
or some other spelling.



-- 
Eitan Adler
___
svn-src-head@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r332860 - head/sys/kern

2018-04-21 Thread Conrad Meyer
On Sat, Apr 21, 2018 at 10:41 AM, Bruce Evans  wrote:
> panic() can't return, but I see that KASSERT() has already been broken
> to use kassert_panic() which does return in some cases including this
> new one.

Oddly enough, I find myself agreeing with Bruce on this.  That
kassert_panic does not always assert, during ordinary (non-panic)
runtime, based on a runtime configurable toggle breaks the concept of
invariants and confuses the heck out of static analyzers like
Coverity.

Ideally, we just remove it.  IMO it is a crappy hack that should have
remained in iX's local tree.

If we want to be really generous, we can make it an off-by-default
build option.  Is anyone clamoring for allowing violation of multiple
assertions without panic, other than Linus Torvalds?

> KASSERT(9) is still documented to call panic(), and none of the options
> to break it including this new one, or kassert_panic() itself are
> documented in KASSERT(9) or in any other section 9 man page.

Yeah.  This is unfortunate :-(.

Best,
Conrad
___
svn-src-head@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r332860 - head/sys/kern

2018-04-21 Thread Conrad Meyer
On Sat, Apr 21, 2018 at 10:05 AM, Jonathan T. Looney  wrote:
> Author: jtl
> Date: Sat Apr 21 17:05:00 2018
> New Revision: 332860
> URL: https://svnweb.freebsd.org/changeset/base/332860
>
> Log:
>   When running with INVARIANTS, the kernel contains extra checks.  However,
>   these assumptions may not hold true once we've panic'd. Therefore, the
>   checks hold less value after a panic.  Additionally, if one of the checks
>   fails while we are already panic'd, this creates a double-panic which can
>   interfere with debugging the original panic.
>
>   Therefore, this commit allows an administrator to suppress a response to
>   KASSERT checks after a panic by setting a tunable/sysctl.  The
>   tunable/sysctl (debug.kassert.suppress_in_panic) defaults to being
>   enabled.

Hi Jonathan,

I don't think this should be enabled by default.  Can we leave it
disabled by default and let consumers opt-in?

To expand on this a little: this is a big hammer.  We already disable
specific invariants in a few cases during panic (lock assertions come
to mind).  If there are specific assertions that do not hold during
panic, we can/should selectively weaken them.  But in general,
invariants are invariant, and we should not proceed past violated ones
by default.

Thanks,
Conrad
___
svn-src-head@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r332860 - head/sys/kern

2018-04-21 Thread Bruce Evans

On Sat, 21 Apr 2018, Jonathan T. Looney wrote:


Log:
 When running with INVARIANTS, the kernel contains extra checks.  However,
 these assumptions may not hold true once we've panic'd. Therefore, the
 checks hold less value after a panic.  Additionally, if one of the checks
 fails while we are already panic'd, this creates a double-panic which can
 interfere with debugging the original panic.

 Therefore, this commit allows an administrator to suppress a response to
 KASSERT checks after a panic by setting a tunable/sysctl.  The
 tunable/sysctl (debug.kassert.suppress_in_panic) defaults to being
 enabled.
...
Modified: head/sys/kern/kern_shutdown.c
==
--- head/sys/kern/kern_shutdown.c   Sat Apr 21 15:15:47 2018
(r332859)
+++ head/sys/kern/kern_shutdown.c   Sat Apr 21 17:05:00 2018
(r332860)
@@ -642,6 +642,7 @@ static int kassert_do_log = 1;
static int kassert_log_pps_limit = 4;
static int kassert_log_mute_at = 0;
static int kassert_log_panic_at = 0;
+static int kassert_suppress_in_panic = 1;
static int kassert_warnings = 0;

SYSCTL_NODE(_debug, OID_AUTO, kassert, CTLFLAG_RW, NULL, "kassert options");
@@ -676,6 +677,10 @@ SYSCTL_INT(_debug_kassert, OID_AUTO, log_pps_limit, CT
SYSCTL_INT(_debug_kassert, OID_AUTO, log_mute_at, CTLFLAG_RWTUN,
&kassert_log_mute_at, 0, "max number of KASSERTS to log");

+SYSCTL_INT(_debug_kassert, OID_AUTO, suppress_in_panic, CTLFLAG_RWTUN,
+&kassert_suppress_in_panic, 0,
+"KASSERTs will be suppressed while handling a panic");
+
static int kassert_sysctl_kassert(SYSCTL_HANDLER_ARGS);

SYSCTL_PROC(_debug_kassert, OID_AUTO, kassert,
@@ -707,6 +712,10 @@ kassert_panic(const char *fmt, ...)
{
static char buf[256];
va_list ap;
+
+   /* If we already panic'd, don't create a double-fault. */
+   if (panicstr != NULL && kassert_suppress_in_panic)
+   return;

va_start(ap, fmt);
(void)vsnprintf(buf, sizeof(buf), fmt, ap);


panic() can't return, but I see that KASSERT() has already been broken
to use kassert_panic() which does return in some cases including this
new one.

KASSERT(9) is still documented to call panic(), and none of the options
to break it including this new one, or kassert_panic() itself are
documented in KASSERT(9) or in any other section 9 man page.

Lots of code was and still is written under the assumption that KASSERT()
works as documented.  E.g., after a null pointer check using KASSERT(),
the pointer is normally used and the page fault for this should cause a
panic anyway.  This is actually a feature -- it works around KASSERT()
breaking restartability of the fault.  Other cases are not so good.  There
man be a cascade of KASSERT() failures, and the new option prevents even
printing a message about even one of them.  If you are lucky then you get
a clean null pointer recursive panic.

The only example in KASSERT(9) is to give the worst use of it (for breaking
clean null pointer panics when KASSERT() works as documented, but now doesn't
even break them when KASSERT() returns).

Bruce
___
svn-src-head@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


svn commit: r332860 - head/sys/kern

2018-04-21 Thread Jonathan T. Looney
Author: jtl
Date: Sat Apr 21 17:05:00 2018
New Revision: 332860
URL: https://svnweb.freebsd.org/changeset/base/332860

Log:
  When running with INVARIANTS, the kernel contains extra checks.  However,
  these assumptions may not hold true once we've panic'd. Therefore, the
  checks hold less value after a panic.  Additionally, if one of the checks
  fails while we are already panic'd, this creates a double-panic which can
  interfere with debugging the original panic.
  
  Therefore, this commit allows an administrator to suppress a response to
  KASSERT checks after a panic by setting a tunable/sysctl.  The
  tunable/sysctl (debug.kassert.suppress_in_panic) defaults to being
  enabled.
  
  Reviewed by:  kib
  Sponsored by: Netflix, Inc.
  Differential Revision:https://reviews.freebsd.org/D12920

Modified:
  head/sys/kern/kern_shutdown.c

Modified: head/sys/kern/kern_shutdown.c
==
--- head/sys/kern/kern_shutdown.c   Sat Apr 21 15:15:47 2018
(r332859)
+++ head/sys/kern/kern_shutdown.c   Sat Apr 21 17:05:00 2018
(r332860)
@@ -642,6 +642,7 @@ static int kassert_do_log = 1;
 static int kassert_log_pps_limit = 4;
 static int kassert_log_mute_at = 0;
 static int kassert_log_panic_at = 0;
+static int kassert_suppress_in_panic = 1;
 static int kassert_warnings = 0;
 
 SYSCTL_NODE(_debug, OID_AUTO, kassert, CTLFLAG_RW, NULL, "kassert options");
@@ -676,6 +677,10 @@ SYSCTL_INT(_debug_kassert, OID_AUTO, log_pps_limit, CT
 SYSCTL_INT(_debug_kassert, OID_AUTO, log_mute_at, CTLFLAG_RWTUN,
 &kassert_log_mute_at, 0, "max number of KASSERTS to log");
 
+SYSCTL_INT(_debug_kassert, OID_AUTO, suppress_in_panic, CTLFLAG_RWTUN,
+&kassert_suppress_in_panic, 0,
+"KASSERTs will be suppressed while handling a panic");
+
 static int kassert_sysctl_kassert(SYSCTL_HANDLER_ARGS);
 
 SYSCTL_PROC(_debug_kassert, OID_AUTO, kassert,
@@ -707,6 +712,10 @@ kassert_panic(const char *fmt, ...)
 {
static char buf[256];
va_list ap;
+
+   /* If we already panic'd, don't create a double-fault. */
+   if (panicstr != NULL && kassert_suppress_in_panic)
+   return;
 
va_start(ap, fmt);
(void)vsnprintf(buf, sizeof(buf), fmt, ap);
___
svn-src-head@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"