(cc Jan) On Wed, 7 Oct 2015 19:02:22 +0200 Vitaly Kuznetsov <vkuzn...@redhat.com> wrote:
> In some cases we may end up killing the CPU holding the console lock > while still having valuable data in logbuf. E.g. I'm observing the > following: > - A crash is happening on one CPU and console_unlock() is being called on > some other. > - console_unlock() tries to print out the buffer before releasing the lock > and on slow console it takes time. > - in the meanwhile crashing CPU does lots of printk()-s with valuable data > (which go to the logbuf) and sends IPIs to all other CPUs. > - console_unlock() finishes printing previous chunk and enables interrupts > before trying to print out the rest, the CPU catches the IPI and never > releases console lock. Why doesn't the lock-owning CPU release the console lock? Because it was stopped by smp_send_stop() in panic()? I don't recall why we stop CPUs in panic(), and of course we didn't document the reason. I guess it makes sense from the "what else can we do" point of view, but I wonder if we can just do it later on - that would fix this problem? (dumb aside: why doesn't smp_send_stop() stop the calling CPU?) > This is not the only possible case: in VT/fb subsystems we have many other > console_lock()/console_unlock() users. Non-masked interrupts (or receiving > NMI in case of extreme slowness) will have the same result. Getting the > whole console buffer printed out on crash should be top priority. Yes, this is a pretty big hole in the logic. > --- a/kernel/panic.c > +++ b/kernel/panic.c > @@ -23,6 +23,7 @@ > #include <linux/sysrq.h> > #include <linux/init.h> > #include <linux/nmi.h> > +#include <linux/console.h> > > #define PANIC_TIMER_STEP 100 > #define PANIC_BLINK_SPD 18 > @@ -147,6 +148,15 @@ void panic(const char *fmt, ...) > > bust_spinlocks(0); > > + /* > + * We may have ended up killing the CPU holding the lock and still have > + * some valuable data in console buffer. Try to acquire the lock and > + * release it regardless of the result. The release will also print the > + * buffers out. > + */ > + console_trylock(); > + console_unlock(); > + "killing the CPU" is a bit vague. How's this look? --- a/kernel/panic.c~panic-release-stale-console-lock-to-always-get-the-logbuf-printed-out-fix +++ a/kernel/panic.c @@ -149,10 +149,10 @@ void panic(const char *fmt, ...) bust_spinlocks(0); /* - * We may have ended up killing the CPU holding the lock and still have - * some valuable data in console buffer. Try to acquire the lock and - * release it regardless of the result. The release will also print the - * buffers out. + * We may have ended up stopping the CPU holding the lock (in + * smp_send_stop()) while still having some valuable data in the console + * buffer. Try to acquire the lock then release it regardless of the + * result. The release will also print the buffers out. */ console_trylock(); console_unlock(); _ Does the console_trylock() guarantee that the console lock is now held? If the console_lock-holding CPU is still running then there's a window where the above code could enter console_unlock() when nobody's holding console_lock. If smp_send_stop() always works (synchronously) then that won't happen. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/