On Wed, 23 Apr 2014, Rik van Riel wrote: > >> Echoing values into /proc/sysrq-trigger seems to be a popular way to > >> get information out of the kernel. However, dumping information about > >> thousands of processes, or hundreds of CPUs to serial console can > >> result in IRQs being blocked for minutes, resulting in various kinds > >> of cascade failures. > >> > >> The most common failure is due to interrupts being blocked for a very > >> long time. This can lead to things like failed IO requests, and other > >> things the system cannot easily recover from. > >> > >> This problem is easily fixable by making __handle_sysrq use RCU > >> instead of spin_lock_irqsave. > >> > >> This leaves the warning that RCU grace periods have not elapsed for a > >> long time, but the system will come back from that automatically. > > > > This, however, will make RCU stall detector to send NMI to all online CPUs > > so that they can dump their stacks. > > It already does that, since several of the longer-running > sysrq handlers already grab rcu_read_lock(), for example > show_state(). > > > IOW, this might actually make the whole sysrq dump last for much longer, > > and have the log polluted with all-CPU dumps for no good reason. > > > > I wonder whether explicitly setting rcu_cpu_stall_suppress during sysrq > > handling might be a viable workaround for this. > > I suppose that would do the trick.
I can imagine Paul opposing this though ... this variable is supposed to be changed only by cmdline/modparam, not really flipped during runtime as a bandaid ... let's add Paul to CC. -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/