On (07/19/19 14:57), Petr Mladek wrote: [..] > > Where do nested printk()-s come from? Which one of the following > > scenarios you cover in commit message: > > > > scenario 1 > > > > - we have CPUB which holds logbuf_lock > > - we have CPUA which panic()-s the system, but can't bring CPUB down, > > so logbuf_lock stays locked on remote CPU > > No, this scenario is not affected by this patch. It would always lead to > a deadlock.
Agreed, in many cases we won't be able to panic() the system properly, deadlocking somewhere in smp_send_stop(). > > scenario 2 > > > > - we have CPUA which holds logbuf_lock > > - we have panic() on CPUA, but it cannot bring down some other CPUB > > so logbuf_lock stays locked on local CPU, and it cannot re-init > > logbuf. [..] > + Before: > + printk_safe_flush_on_panic() will keep logbuf_lock locked > and do nothing. > > + kmsg_dump(), console_unblank(), or console_flush_on_panic() > will deadlock when they try to get logbuf_lock(). They will > not be able to process any single line. > > + After: > + printk_bust_lock_safe() will keep logbuf_lock locked > > + All functions using logbuf_lock will not get called. > We will not see the messages (as previously) but the > system will not deadlock. > > > But there is one more scenario 3: Yes! > - we have CPUB which loops or is deadlocked in IRQ context > > - we have CPUA which panic()-s the system, but can't bring CPUB down, > so logbuf_lock might be takes and release from time to time > by CPUB Great! This is the only case when we actually need to pay attention to num_online_cpus(), because there is an active logbuf_lock owner; in any other case we can unconditionally re-init printk() locks. But there is more to it. Note, that the problem in scenario 3 is bigger than just logbuf_lock. Regardless of logbuf implementation we will not be able to panic() the system. If we have a never ending source of printk() messages, coming from misbehaving CPU which stuck in printing loop in IRQ context, then flush_on_panic() will never end or kmsg dump will never stop, etc. We need to cut off misbehaving CPUs. Panic CPU waits (for up to 1 second?) in smp_send_stop() for secondary CPUs to die, if some secondary CPUs are still chatty after that then most likely those CPUs don't have anything good to say, just a pointless flood of same messages over and over again; which, however, will not let panic CPU to proceed. And this is where the idea of "disconnecting" those CPUs from main logbuf come from. So what we can do: - smp_send_stop() - disconnect all-but-self from logbuf (via printk-safe) - wait for 1 or 2 more extra seconds for secondary CPUs to leave console_unlock() and to redirect printks to per-CPU buffers - after that we are sort of good-to-go: re-init printk locks and do kmsg_dump, flush_on_panic(). Note, misbehaving CPUs will write to per-CPU buffers, they are not expected to be able to flush per-CPU buffers to the main logbuf. That will require enabled IRQs, which should deliver stop IPI. But we can do even more - just disable print_safe irq work on disconnect CPUs. So, shall we try one more time with the "disconnect" misbehaving CPUs approach? I can send an RFC patch. -ss