On Mon 11-11-13 22:54:15, Pavel Machek wrote:
> Hi!
> 
> > > > A CPU can be caught in console_unlock() for a long time (tens of seconds
> > > > are reported by our customers) when other CPUs are using printk heavily
> > > > and serial console makes printing slow. Despite serial console drivers
> > > > are calling touch_nmi_watchdog() this triggers softlockup warnings
> > > > because interrupts are disabled for the whole time console_unlock() runs
> > > > (e.g. vprintk() calls console_unlock() with interrupts disabled). Thus
> > > > IPIs cannot be processed and other CPUs get stuck spinning in calls like
> > > > smp_call_function_many(). Also RCU eventually starts reporting lockups.
> > > >
> > > > In my artifical testing I can also easily trigger a situation when disk
> > > > disappears from the system apparently because interrupt from it wasn't
> > > > served for too long. This is why just silencing watchdogs isn't a
> > > > reliable solution to the problem and we simply have to avoid spending
> > > > too long in console_unlock() with interrupts disabled.
> > > >
> > > > The solution this patch works toward is to postpone printing to a later
> > > > moment / different CPU when we already printed over X characters in
> > > > current console_unlock() invocation. This is a crude heuristic but
> > > > measuring time we spent printing doesn't seem to be really viable - we
> > > > cannot rely on high resolution time being available and with interrupts
> > > > disabled jiffies are not updated. User can tune the value X via
> > > > printk.offload_chars kernel parameter.
> > > >
> > > > Reviewed-by: Steven Rostedt <rost...@goodmis.org>
> > > > Signed-off-by: Jan Kara <j...@suse.cz>
> > > 
> > > When a message takes tens of seconds to be printed, it usually means
> > > we are in trouble somehow :)
> > > I wonder what printk source can trigger such a high volume.
> >   Machines with tens of processors and thousands of scsi devices. When
> > device discovery happens on boot, all processors are busily reporting new
> > scsi devices and one poor looser is bound to do the printing for ever and
> > ever until the machine dies...
> 
> Dunno. In these cases, would it make sense to:
> 
> 1) reduce amount of text printed
  I thought about this as well. But
a) It doesn't seem practical as you would have to modify lots of drivers
   and keep them rather silent. That seems rather fragile. Plus you will
   not display some potentially useful information.
b) It doesn't address the real underlying problem that the way printk() is
   currently implemented, there is no bound on the time CPU spends in the
   loop printing from buffer to console. And the fact that this loop
   sometimes happens with interrupts disabled makes the situation even
   worse.
 
> 2) just print [XXX characters lost] on overruns?
  We don't overrun the printk buffer so no characters are lost. It just
takes too long to feed the whole printk buffer through serial console...

                                                                Honza
-- 
Jan Kara <j...@suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to