Tejun Heo wrote:
> On Wed, Jan 10, 2018 at 07:08:30AM +0900, Tetsuo Handa wrote:
> > > * Netconsole tries to send out OOM messages and tries memory
> > > allocation which fails which then prints allocation failed messages.
> > > Because this happens while already printing, it just queues the
>
Hello, Steven.
On Wed, Jan 10, 2018 at 02:18:27AM -0500, Steven Rostedt wrote:
> My point is, that your test is only hammering at a single CPU. You say
> it is the scenario you see, which means that the OOM is printing out
> more than it should, because if it prints it out once, it should not
> pr
On Tue, 9 Jan 2018 14:53:56 -0800
Tejun Heo wrote:
> Hello, Steven.
>
> On Tue, Jan 09, 2018 at 05:47:50PM -0500, Steven Rostedt wrote:
> > > Maybe it can break out eventually but that can take a really long
> > > time. It's OOM. Most of userland is waiting for reclaim. There
> > > isn't all
Hello, Steven.
On Tue, Jan 09, 2018 at 05:47:50PM -0500, Steven Rostedt wrote:
> > Maybe it can break out eventually but that can take a really long
> > time. It's OOM. Most of userland is waiting for reclaim. There
> > isn't all that much going on outside that and there can only be one
> > CPU
On Tue, 9 Jan 2018 14:17:05 -0800
Tejun Heo wrote:
> Hello, Steven.
>
> On Tue, Jan 09, 2018 at 05:08:47PM -0500, Steven Rostedt wrote:
> > The scenario you listed would affect multiple CPUs and multiple CPUs
> > would be flooding printk. In that case my patch WILL help. Because the
> > current
On Wed, Jan 10, 2018 at 07:08:30AM +0900, Tetsuo Handa wrote:
> > * Netconsole tries to send out OOM messages and tries memory
> > allocation which fails which then prints allocation failed messages.
> > Because this happens while already printing, it just queues the
> > messages to the buffe
Hello, Steven.
On Tue, Jan 09, 2018 at 05:08:47PM -0500, Steven Rostedt wrote:
> The scenario you listed would affect multiple CPUs and multiple CPUs
> would be flooding printk. In that case my patch WILL help. Because the
> current method, the first CPU to do the printk will get stuck doing the
>
On Tue, 9 Jan 2018 12:06:20 -0800
Tejun Heo wrote:
> What's happening is that the OOM killer is trapped flushing printk
> failing to clear the memory condition and that leads irq / softirq
> contexts to produce messages faster than can be flushed. I don't see
> how we'd be able to clear the con
Tejun Heo wrote:
> The code might suck but I think this does replicate what we've been
> seeing regularly in the fleet. The console side is pretty slow - IPMI
> faithfully emulating serial console. I don't know it's doing 115200
> or even slower. Please consider something like the following.
Em
Hello, Steven.
My apologies for the late reply. Was traveling and then got sick.
On Thu, Dec 21, 2017 at 11:19:32PM -0500, Steven Rostedt wrote:
> You don't think handing off printks to an offloaded thread isn't more
> complex nor can it cause even more issues (like more likely to lose
> relevan
On (12/04/17 22:48), Sergey Senozhatsky wrote:
> A new version, yet another rework. Lots of changes, e.g. hand off
> control based on Steven's patch. Another change is that this time around
> we finally have a kernel module to test printk offloading (YAYY!). The
> module tests a bunch use cas
Hello,
On (12/29/17 22:59), Tetsuo Handa wrote:
[..]
> Just an idea: Do we really need to use a semaphore for console_sem?
>
> Is it possible to replace it with a spinlock? Then, I feel that we can write
> to consoles from non-process context (i.e. soft or hard IRQ context), with
> write only one
Sergey Senozhatsky wrote:
> On (12/28/17 15:48), Sergey Senozhatsky wrote:
> [..]
> > and I'm actually thinking about returning back the old vprintk_emit()
> > behavior
> >
> >vprintk_emit()
> >{
> > + preempt_disable();
> > if (console_trylock())
> >
On (12/28/17 15:48), Sergey Senozhatsky wrote:
[..]
> and I'm actually thinking about returning back the old vprintk_emit()
> behavior
>
>vprintk_emit()
>{
> + preempt_disable();
> if (console_trylock())
> console_unlock();
> + preempt_enable()
Hello,
On (12/21/17 23:19), Steven Rostedt wrote:
[..]
> > I wrote this before but this isn't a theoretical problem. We see
> > these stalls a lot. Preemption isn't enabled to begin with. Memory
> > pressure is high and OOM triggers and printk starts printing out OOM
> > warnings; then, a netw
On Thu, 21 Dec 2017 16:09:32 -0800
Tejun Heo wrote:
>
> I tried your v4 patch and ran the test module and could easily
> reproduce RCU stall and other issues stemming from a CPU getting
> pegged down by printk flushing. I'm attaching the test module code at
> the end.
Thanks, I'll take a look.
Hello,
Sorry about the long delay.
On Thu, Dec 14, 2017 at 01:21:09PM -0500, Steven Rostedt wrote:
> > Yeah, will do, but out of curiosity, Sergey and I already described
> > what the root problem was and you didn't really seem to take that. Is
> > that because the explanation didn't make sense
Hi Tetsuo,
On (12/20/17 21:06), Tetsuo Handa wrote:
> Sergey Senozhatsky wrote:
>
[..]
>
> Anyway, the rule that "do not try to printk() faster than the kernel can
> write to consoles" will remain no matter how printk() changes.
and the "faster than the kernel can write to consoles" is tricky.
it
Sergey Senozhatsky wrote:
> Steven said that this scenario is possible, but is not of any particular
> interest, because printk from IRQ or from any other atomic context is a
> bad thing, which should happen only when something wrong is going on in
> the system. but we are in OOM or has just return
On (12/19/17 09:40), Steven Rostedt wrote:
> On Tue, 19 Dec 2017 13:58:46 +0900
> Sergey Senozhatsky wrote:
>
> > so you are not convinced that my scenarios real/matter; I'm not
>
> Well, not with the test module. I'm looking for actual code in the
> upstream kernel.
>
> > convinced that I have
Hello,
not sure if you've been following the whole thread, so I'll try
to summarize it here. apologies if it'll massively repeat the things
that have already been said or will be too long.
On (12/19/17 15:31), Michal Hocko wrote:
> On Tue 19-12-17 10:24:55, Sergey Senozhatsky wrote:
> > On (12/18
On Tue, 19 Dec 2017 13:58:46 +0900
Sergey Senozhatsky wrote:
> so you are not convinced that my scenarios real/matter; I'm not
Well, not with the test module. I'm looking for actual code in the
upstream kernel.
> convinced that I have stable and functional boards with this patch ;)
> seems tha
On Tue 19-12-17 10:24:55, Sergey Senozhatsky wrote:
> On (12/18/17 20:08), Steven Rostedt wrote:
> > > ... do you guys read my emails? which part of the traces I have provided
> > > suggests that there is any improvement?
> >
> > The traces I've seen from you were from non-realistic scenarios.
> >
On (12/18/17 22:38), Steven Rostedt wrote:
> On Tue, 19 Dec 2017 11:46:10 +0900
> Sergey Senozhatsky wrote:
>
> > anyway,
> > before you guys push the patch to printk.git, can we wait for Tejun to
> > run his tests against it?
>
> I've been asking for that since day one ;-)
ok
so you are not c
On (12/18/17 12:46), Steven Rostedt wrote:
> > One question is if we really want to rely on offloading in
> > this case. What if this is printed to debug some stalled
> > system.
>
> Correct, and this is what I call when debugging hard lockups, and I do
> it from NMI.
[..]
> show_state_filter() is
On Tue, 19 Dec 2017 11:46:10 +0900
Sergey Senozhatsky wrote:
> anyway,
> before you guys push the patch to printk.git, can we wait for Tejun to
> run his tests against it?
I've been asking for that since day one ;-)
-- Steve
On (12/18/17 21:03), Steven Rostedt wrote:
> > and this is exactly what I'm still observing. i_do_printks-1992 stops
> > printing, while console_sem is owned by another task. Since log_store()
> > much faster than call_console_drivers() AND console_sem owner is getting
> > preempted for unknown per
On Tue, 19 Dec 2017 10:24:55 +0900
Sergey Senozhatsky wrote:
> On (12/18/17 20:08), Steven Rostedt wrote:
> > > ... do you guys read my emails? which part of the traces I have provided
> > > suggests that there is any improvement?
> >
> > The traces I've seen from you were from non-realistic s
On (12/18/17 20:08), Steven Rostedt wrote:
> > ... do you guys read my emails? which part of the traces I have provided
> > suggests that there is any improvement?
>
> The traces I've seen from you were from non-realistic scenarios.
> But I have hit issues with printk()s happening that cause one C
On (12/18/17 15:10), Petr Mladek wrote:
[..]
> > All this is in the current upstream code as well. Steven's patch
> > should make it better in compare with the current upstream code.
> >
> > Sure, the printk offload approach does all these things better.
> > But there is still the fear that the of
On Tue, 19 Dec 2017 10:03:11 +0900
Sergey Senozhatsky wrote:
> ... do you guys read my emails? which part of the traces I have provided
> suggests that there is any improvement?
The traces I've seen from you were from non-realistic scenarios. But I
have hit issues with printk()s happening that
On Tue, 19 Dec 2017 09:52:48 +0900
Sergey Senozhatsky wrote:
> > The case here, you are talking about a CPU doing console_lock() from a
> > non printk() case. Which is what I was asking about how often this
> > happens.
>
> I'd say often enough. but the point I was trying to make is that we ca
On (12/18/17 12:46), Steven Rostedt wrote:
> On Mon, 18 Dec 2017 15:13:53 +0100
> Petr Mladek wrote:
>
> > One question is if we really want to rely on offloading in
> > this case. What if this is printed to debug some stalled
> > system.
>
> Correct, and this is what I call when debugging hard
Hello Steven,
I couldn't reply sooner.
On (12/15/17 10:19), Steven Rostedt wrote:
> > On (12/14/17 22:18), Steven Rostedt wrote:
> > > > Steven, your approach works ONLY when we have the following
> > > > preconditions:
> > > >
> > > > a) there is a CPU that is calling printk() from the 'safe
On Mon, 18 Dec 2017 15:13:53 +0100
Petr Mladek wrote:
> One question is if we really want to rely on offloading in
> this case. What if this is printed to debug some stalled
> system.
Correct, and this is what I call when debugging hard lockups, and I do
it from NMI. Which the new NMI code preve
On Mon 2017-12-18 22:39:48, Sergey Senozhatsky wrote:
> On (12/18/17 14:31), Petr Mladek wrote:
> > On Mon 2017-12-18 18:36:15, Sergey Senozhatsky wrote:
> > > On (12/15/17 10:08), Petr Mladek wrote:
> > > 1) it opens both soft and hard lockup vectors
> > >
> > >I see *a lot* of cases when CPU
On Mon 2017-12-18 14:31:01, Petr Mladek wrote:
> On Mon 2017-12-18 18:36:15, Sergey Senozhatsky wrote:
> > - it has a significantly worse behaviour compared to old async printk.
> > - it keeps sleeping on console_sem tasks in TASK_UNINTERRUPTIBLE
> > for a long time.
> > - it timeouts user
On Mon 2017-12-18 19:36:24, Sergey Senozhatsky wrote:
> it takes call_console_drivers() 0.01+ of a second to print some of
> the messages [I think we can ignore raw_spin_lock(&console_owner_lock)
> and fully blame call_console_drivers()]. so vprintk_emit() seems to be
> gazillion times faster and i
On (12/18/17 14:31), Petr Mladek wrote:
> On Mon 2017-12-18 18:36:15, Sergey Senozhatsky wrote:
> > On (12/15/17 10:08), Petr Mladek wrote:
> > 1) it opens both soft and hard lockup vectors
> >
> >I see *a lot* of cases when CPU that call printk in a loop does not
> >end up flushing its me
On Mon 2017-12-18 18:36:15, Sergey Senozhatsky wrote:
> On (12/15/17 10:08), Petr Mladek wrote:
> 1) it opens both soft and hard lockup vectors
>
>I see *a lot* of cases when CPU that call printk in a loop does not
>end up flushing its messages. And the problem seems to be - preemption.
>
On (12/18/17 19:36), Sergey Senozhatsky wrote:
[..]
> it takes call_console_drivers() 0.01+ of a second to print some of
> the messages [I think we can ignore raw_spin_lock(&console_owner_lock)
> and fully blame call_console_drivers()]. so vprintk_emit() seems to be
> gazillion times faster and i_d
On (12/18/17 18:36), Sergey Senozhatsky wrote:
[..]
>I see *a lot* of cases when CPU that call printk in a loop does not
>end up flushing its messages. And the problem seems to be - preemption.
>
>
> CPU0CPU1
>
> for_each_process_thread
On (12/15/17 10:08), Petr Mladek wrote:
[..]
> > > Is the above scenario really dangerous? console_lock() owner is
> > > able to sleep. Therefore there is no risk of a softlockup.
> > >
> > > Sure, many messages will get stacked in the meantime and the console
> > > owner my get then passed to ano
On Thu, 14 Dec 2017 23:39:36 +0900
Sergey Senozhatsky wrote:
> On (12/14/17 15:27), Petr Mladek wrote:
> >
> > Therefore I tend to give Steven's solution a chance before this
> > combined approach.
> >
>
> have you seen this https://marc.info/?l=linux-kernel&m=151015850209859
> or this https:
On Fri, 15 Dec 2017 10:08:01 +0100
Petr Mladek wrote:
> You are looking for a perfect solution. But there is no perfect
> solution.
"Perfection is the enemy of 'good enough'" :-)
-- Steve
On Fri, 15 Dec 2017 09:31:51 +0100
Petr Mladek wrote:
> Do people have issues with the current upstream printk() or
> still even with Steven's patch?
>
> My current view is that Steven's patch could not make things
> worse. I was afraid of possible deadlock but it seems that I was
> wrong. Other
On Fri, 15 Dec 2017 15:52:05 +0900
Sergey Senozhatsky wrote:
> On (12/15/17 14:06), Sergey Senozhatsky wrote:
> [..]
> > > Where do we do the above? And has this been proven to be an issue?
> >
> > um... hundreds of cases.
> >
> > deep-stack spin_lock_irqsave() lockup reports from multiple CP
On Fri, 15 Dec 2017 14:06:07 +0900
Sergey Senozhatsky wrote:
> Hello,
>
> On (12/14/17 22:18), Steven Rostedt wrote:
> > > Steven, your approach works ONLY when we have the following preconditions:
> > >
> > > a) there is a CPU that is calling printk() from the 'safe' (non-atomic,
> > > et
On Fri 2017-12-15 17:42:36, Sergey Senozhatsky wrote:
> On (12/15/17 09:31), Petr Mladek wrote:
> > > On (12/14/17 22:18), Steven Rostedt wrote:
> > > > > Steven, your approach works ONLY when we have the following
> > > > > preconditions:
> > > > >
> > > > > a) there is a CPU that is calling pr
On (12/15/17 09:31), Petr Mladek wrote:
> > On (12/14/17 22:18), Steven Rostedt wrote:
> > > > Steven, your approach works ONLY when we have the following
> > > > preconditions:
> > > >
> > > > a) there is a CPU that is calling printk() from the 'safe' (non-atomic,
> > > > etc) context
> > >
On Fri 2017-12-15 14:06:07, Sergey Senozhatsky wrote:
> Hello,
>
> On (12/14/17 22:18), Steven Rostedt wrote:
> > > Steven, your approach works ONLY when we have the following preconditions:
> > >
> > > a) there is a CPU that is calling printk() from the 'safe' (non-atomic,
> > > etc) contex
On (12/15/17 14:06), Sergey Senozhatsky wrote:
[..]
> > Where do we do the above? And has this been proven to be an issue?
>
> um... hundreds of cases.
>
> deep-stack spin_lock_irqsave() lockup reports from multiple CPUs (3 cpus)
> happening at the same moment + NMI backtraces from all the CPUs (
Hello,
On (12/14/17 22:18), Steven Rostedt wrote:
> > Steven, your approach works ONLY when we have the following preconditions:
> >
> > a) there is a CPU that is calling printk() from the 'safe' (non-atomic,
> > etc) context
> >
> > what does guarantee that? what happens if there i
On Fri, 15 Dec 2017 11:10:24 +0900
Sergey Senozhatsky wrote:
> Steven, your approach works ONLY when we have the following preconditions:
>
> a) there is a CPU that is calling printk() from the 'safe' (non-atomic,
> etc) context
>
> what does guarantee that? what happens if there i
Hello,
On (12/14/17 10:11), Tejun Heo wrote:
> Hey, Steven.
>
> On Thu, Dec 14, 2017 at 12:55:06PM -0500, Steven Rostedt wrote:
> > Yes! Please create a reproducer, because I still don't believe there is
> > one. And it's all hand waving until there's an actual report that we can
> > lock up the
On Thu, 14 Dec 2017 10:11:53 -0800
Tejun Heo wrote:
> Hey, Steven.
>
> On Thu, Dec 14, 2017 at 12:55:06PM -0500, Steven Rostedt wrote:
> > Yes! Please create a reproducer, because I still don't believe there is
> > one. And it's all hand waving until there's an actual report that we can
> > lock
Hey, Steven.
On Thu, Dec 14, 2017 at 12:55:06PM -0500, Steven Rostedt wrote:
> Yes! Please create a reproducer, because I still don't believe there is
> one. And it's all hand waving until there's an actual report that we can
> lock up the system with my approach.
Yeah, will do, but out of curios
On Thu, 14 Dec 2017 07:25:51 -0800
Tejun Heo wrote:
> > 3. Soft-lockups are still theoretically possible with Steven's
> >approach.
> >
> >But it seems to be quite efficient in many real life scenarios,
> >including Tetsuo's stress testing. Or am I wrong?
>
> AFAICS, Steven's appr
Hello, Petr.
On Thu, Dec 14, 2017 at 03:27:09PM +0100, Petr Mladek wrote:
> Ah, I know that it was me who was pessimistic about Steven's approach[1]
> and persuaded you that offloading idea was still alive. But I am less
> sure now.
So, I don't really care which one gets in as long as the liveloc
On (12/14/17 15:27), Petr Mladek wrote:
>
> Therefore I tend to give Steven's solution a chance before this
> combined approach.
>
have you seen this https://marc.info/?l=linux-kernel&m=151015850209859
or this https://marc.info/?l=linux-kernel&m=151011840830776&w=2
or this https://marc.info/?l=li
On Mon 2017-12-04 22:48:13, Sergey Senozhatsky wrote:
> Hello,
>
> RFC
>
> A new version, yet another rework. Lots of changes, e.g. hand off
> control based on Steven's patch. Another change is that this time around
> we finally have a kernel module to test printk offloading (YAYY!).
Hello,
RFC
A new version, yet another rework. Lots of changes, e.g. hand off
control based on Steven's patch. Another change is that this time around
we finally have a kernel module to test printk offloading (YAYY!). The
module tests a bunch use cases; we also have trace printk ev
62 matches
Mail list logo