On Mon 2018-10-01 15:23:24, Steven Rostedt wrote: > On Thu, 27 Sep 2018 12:46:01 -0700 > Daniel Wang <wonder...@google.com> wrote: > > > Prior to this change, the combination of `softlockup_panic=1` and > > `softlockup_all_cpu_stacktrace=1` may result in a deadlock when the reboot > > path > > is trying to grab the console lock that is held by the stack trace printing > > path. What seems to be happening is that while there are multiple CPUs, > > only one > > of them is tasked to print the back trace of all CPUs. On a machine with > > many > > CPUs and a slow serial console (on Google Compute Engine for example), the > > stack > > trace printing routine hits a timeout and the reboot path kicks in. The > > latter > > then tries to print something else, but can't get the lock because it's > > still > > held by earlier printing path. This is easily reproducible on a VM with 16+ > > vCPUs on Google Compute Engine - which is a very common scenario. > > > > A quick repro is available at > > https://github.com/wonderfly/printk-deadlock-repro. The system hangs 3 > > seconds > > into executing repro.sh. Both deadlock analysis and repro are credits to > > Peter > > Feiner. > > > > Note that I have read previous discussions on backporting this to stable > > [1]. > > The argument for objecting the backport was that this is a non-trivial fix > > and > > is supported to prevent hypothetical soft lockups. What we are hitting is a > > real > > deadlock, in production, however. Hence this request. > > > > [1] > > https://lore.kernel.org/lkml/20180409081535.dq7p5bfnpvd3x...@pathway.suse.cz/T/#u > > > > Serial console logs leading up to the deadlock. As can be seen the stack > > trace > > was incomplete because the printing path hit a timeout. > > I'm fine with having this backported.
Dunno. Is the patch perhaps a bit too complex? This is not exactly trivial bugfix. pavel@duo:/data/l/clean-cg$ git show dbdda842fe96f | diffstat printk.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- I see that it is pretty critical to Daniel, but maybe kernel with console locking redone should no longer be called 4.4? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
signature.asc
Description: Digital signature