(cc Jan)

On Wed,  7 Oct 2015 19:02:22 +0200 Vitaly Kuznetsov <vkuzn...@redhat.com> wrote:

> In some cases we may end up killing the CPU holding the console lock
> while still having valuable data in logbuf. E.g. I'm observing the
> following:
> - A crash is happening on one CPU and console_unlock() is being called on
>   some other.
> - console_unlock() tries to print out the buffer before releasing the lock
>   and on slow console it takes time.
> - in the meanwhile crashing CPU does lots of printk()-s with valuable data
>   (which go to the logbuf) and sends IPIs to all other CPUs.
> - console_unlock() finishes printing previous chunk and enables interrupts
>   before trying to print out the rest, the CPU catches the IPI and never
>   releases console lock.

Why doesn't the lock-owning CPU release the console lock?  Because it
was stopped by smp_send_stop() in panic()?

I don't recall why we stop CPUs in panic(), and of course we didn't
document the reason.  I guess it makes sense from the "what else can we
do" point of view, but I wonder if we can just do it later on - that
would fix this problem?

(dumb aside: why doesn't smp_send_stop() stop the calling CPU?)

> This is not the only possible case: in VT/fb subsystems we have many other
> console_lock()/console_unlock() users. Non-masked interrupts (or receiving
> NMI in case of extreme slowness) will have the same result. Getting the
> whole console buffer printed out on crash should be top priority.

Yes, this is a pretty big hole in the logic.

> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -23,6 +23,7 @@
>  #include <linux/sysrq.h>
>  #include <linux/init.h>
>  #include <linux/nmi.h>
> +#include <linux/console.h>
>  
>  #define PANIC_TIMER_STEP 100
>  #define PANIC_BLINK_SPD 18
> @@ -147,6 +148,15 @@ void panic(const char *fmt, ...)
>  
>       bust_spinlocks(0);
>  
> +     /*
> +      * We may have ended up killing the CPU holding the lock and still have
> +      * some valuable data in console buffer. Try to acquire the lock and
> +      * release it regardless of the result. The release will also print the
> +      * buffers out.
> +      */
> +     console_trylock();
> +     console_unlock();
> +

"killing the CPU" is a bit vague.  How's this look?

--- 
a/kernel/panic.c~panic-release-stale-console-lock-to-always-get-the-logbuf-printed-out-fix
+++ a/kernel/panic.c
@@ -149,10 +149,10 @@ void panic(const char *fmt, ...)
        bust_spinlocks(0);
 
        /*
-        * We may have ended up killing the CPU holding the lock and still have
-        * some valuable data in console buffer. Try to acquire the lock and
-        * release it regardless of the result. The release will also print the
-        * buffers out.
+        * We may have ended up stopping the CPU holding the lock (in
+        * smp_send_stop()) while still having some valuable data in the console
+        * buffer.  Try to acquire the lock then release it regardless of the
+        * result.  The release will also print the buffers out.
         */
        console_trylock();
        console_unlock();
_


Does the console_trylock() guarantee that the console lock is now held?
If the console_lock-holding CPU is still running then there's a window
where the above code could enter console_unlock() when nobody's holding
console_lock.  If smp_send_stop() always works (synchronously) then
that won't happen.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to