Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-08 Thread Sergey Senozhatsky
On (09/06/19 17:32), Petr Mladek wrote:
> > [..]
> > > I mean, really, do we need to keep calling wake up if it
> > > probably never even executed?
> > 
> > I guess ratelimiting you are talking about ("if it probably never even
> > executed") would be to check if we have already called wake up on the
> > log_wait ->head. For that we need to, at least, take log_wait spin_lock
> > and check that ->head is still in TASK_INTERRUPTIBLE; which is (quite,
> > but not exactly) close to what wake_up_interruptible() does - it doesn't
> > wake up the same task twice, it bails out on `p->state & state' check.
> 
> I have just realized that only sleeping tasks are in the waitqueue.
> It is already handled by waitqueue_active() check.

Yes.

> I am afraid that we could not ratelimit the wakeups. The userspace
> loggers might then miss the last lines for a long.

That's my concern as well.

> We could move wake_up_klogd() back to console_unlock(). But it might
> end up with a back-and-forth games according to who is currently
> complaining.

We still don't need irq_work, tho.

If we can do
printk()->console_unlock()->up()->try_to_wake_up()
then we can also do
printk()   -> try_to_wake_up()

It's LOGLEVEL_SCHED which tells us if we can try_to_wake_up()
or cannot.

> Sigh, I still suggest to ratelimit the warning about failed
> allocation.

Hard to imagine how many printk()-s we will have to ratelimit.
To imagine NET maintainers being OK with this is even harder.

-ss


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-07 Thread Tetsuo Handa
On 2019/09/04 17:25, Michal Hocko wrote:
> On Wed 04-09-19 16:00:42, Sergey Senozhatsky wrote:
>> On (09/04/19 15:41), Sergey Senozhatsky wrote:
>>> But the thing is different in case of dump_stack() + show_mem() +
>>> some other output. Because now we ratelimit not a single printk() line,
>>> but hundreds of them. The ratelimit becomes - 10 * $$$ lines in 5 seconds
>>> (IOW, now we talk about thousands of lines).
>>
>> And on devices with slow serial consoles this can be somewhat close to
>> "no ratelimit". *Suppose* that warn_alloc() adds 700 lines each time.
>> Within 5 seconds we can call warn_alloc() 10 times, which will add 7000
>> lines to the logbuf. If printk() can evict only 6000 lines in 5 seconds
>> then we have a growing number of pending logbuf messages.
> 
> Yes, ratelimit is problematic when the ratelimited operation is slow. I
> guess that is a well known problem and we would need to rework both the
> api and the implementation to make it work in those cases as well.
> Essentially we need to make the ratelimit act as a gatekeeper to an
> operation section - something like a critical section except you can
> tolerate more code executions but not too many. So effectively
> 
>   start_throttle(rate, number);
>   /* here goes your operation */
>   end_throttle();
> 
> one operation is not considered done until the whole section ends.
> Or something along those lines.

Regarding OOM killer which is serialized by oom_lock mutex, I proposed
"mm, oom: More aggressively ratelimit dump_header()." at
https://lkml.kernel.org/r/1550325895-9291-2-git-send-email-penguin-ker...@i-love.sakura.ne.jp
but it was ignored.

> 
> In this particular case we can increase the rate limit parameters of
> course but I think that longterm we need a better api.
> 

I proposed "printk: Introduce "store now but print later" prefix." at
https://lkml.kernel.org/r/1550896930-12324-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp
but it was not accepted.

But I think that more better solution for warn_alloc() is to defer printing
 from allocating context (which may not be allowed to sleep) to WQ context.
I proposed "mm,oom: Defer dump_tasks() output." at
https://lkml.kernel.org/r/7de2310d-afbd-e616-e83a-d75103b98...@i-love.sakura.ne.jp
and answered to Michal's concerns. I hope we can apply the same thing for
warn_alloc() and show_mem(). Then, we can minimize latency for both
"memory allocation failures" and "OOM killer invocations".



Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-06 Thread Qian Cai
On Fri, 2019-09-06 at 13:32 +0900, Sergey Senozhatsky wrote:
> On (09/05/19 12:03), Qian Cai wrote:
> > > ---
> > > diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> > > index cd51aa7d08a9..89cb47882254 100644
> > > --- a/kernel/printk/printk.c
> > > +++ b/kernel/printk/printk.c
> > > @@ -2027,8 +2027,11 @@ asmlinkage int vprintk_emit(int facility, int 
> > > level,
> > >   pending_output = (curr_log_seq != log_next_seq);
> > >   logbuf_unlock_irqrestore(flags);
> > >  
> > > + if (!pending_output)
> > > + return printed_len;
> > > +
> > >   /* If called from the scheduler, we can not call up(). */
> > > - if (!in_sched && pending_output) {
> > > + if (!in_sched) {
> > >   /*
> > >    * Disable preemption to avoid being preempted while holding
> > >    * console_sem which would prevent anyone from printing to
> > > @@ -2043,10 +2046,11 @@ asmlinkage int vprintk_emit(int facility, int 
> > > level,
> > >   if (console_trylock_spinning())
> > >   console_unlock();
> > >   preempt_enable();
> > > - }
> > >  
> > > - if (pending_output)
> > > + wake_up_interruptible(&log_wait);
> > > + } else {
> > >   wake_up_klogd();
> > > + }
> > >   return printed_len;
> > >  }
> > >  EXPORT_SYMBOL(vprintk_emit);
> > > ---
> 
> Qian Cai, any chance you can test that patch?

So far as good, but it is hard to tell if this really nail the issue down. I'll
leave it running over the weekend and report back if it occurs again.


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-06 Thread Sergey Senozhatsky
On (09/06/19 16:55), Petr Mladek wrote:
> > I think we can queue significantly much less irq_work-s from printk().
> > 
> > Petr, Steven, what do you think?
> > 
> > Something like this. Call wake_up_interruptible(), switch to
> > wake_up_klogd() only when called from sched code.
> 
> Replacing irq_work_queue() with wake_up_interruptible() looks
> dangerous to me.
> 
> As a result, all "normal" printk() calls from the scheduler
> code will deadlock. There is almost always a userspace
> logger registered.

I don't see why all printk()-s should deadlock.

A "normal" printk() call will deadlock only when scheduler calls
"normal" printk() under rq or pi locks. But this is illegal anyway,
because console_sem up() calls wake_up_process() - the same function
wake_up_interruptible() calls. IOW "normal" printk() calls from
scheduler end up in scheduler, via console_sem->sched chain. We
already execute wake_up_process()->try_to_wake_up() in printk(),
even when a non-LOGLEVEL_SCHED printk() comes from scheduler.

What am I missing something?

-ss


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-06 Thread Petr Mladek
On Fri 2019-09-06 12:39:00, Sergey Senozhatsky wrote:
> On (09/05/19 13:23), Steven Rostedt wrote:
> > > I think we can queue significantly much less irq_work-s from printk().
> > > 
> > > Petr, Steven, what do you think?
> 
> [..]
> > I mean, really, do we need to keep calling wake up if it
> > probably never even executed?
> 
> I guess ratelimiting you are talking about ("if it probably never even
> executed") would be to check if we have already called wake up on the
> log_wait ->head. For that we need to, at least, take log_wait spin_lock
> and check that ->head is still in TASK_INTERRUPTIBLE; which is (quite,
> but not exactly) close to what wake_up_interruptible() does - it doesn't
> wake up the same task twice, it bails out on `p->state & state' check.

I have just realized that only sleeping tasks are in the waitqueue.
It is already handled by waitqueue_active() check.

I am afraid that we could not ratelimit the wakeups. The userspace
loggers might then miss the last lines for a long.

We could move wake_up_klogd() back to console_unlock(). But it might
end up with a back-and-forth games according to who is currently
complaining.

Sigh, I still suggest to ratelimit the warning about failed
allocation.

Best Regards,
Petr


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-06 Thread Petr Mladek
On Thu 2019-09-05 20:32:08, Sergey Senozhatsky wrote:
> On (09/04/19 16:42), Qian Cai wrote:
> > > Let me think more.
> > 
> > To summary, those look to me are all good long-term improvement that would
> > reduce the likelihood of this kind of livelock in general especially for 
> > other
> > unknown allocations that happen while processing softirqs, but it is still 
> > up to
> > the air if it fixes it 100% in all situations as printk() is going to take 
> > more
> > time
> 
> Well. So. I guess that we don't need irq_work most of the time.
> 
> We need to queue irq_work for "safe" wake_up_interruptible(), when we
> know that we can deadlock in scheduler. IOW, only when we are invoked
> from the scheduler. Scheduler has printk_deferred(), which tells printk()
> that it cannot do wake_up_interruptible(). Otherwise we can just use
> normal wake_up_process() and don't need that irq_work->wake_up_interruptible()
> indirection. The parts of the scheduler, which by mistake call plain printk()
> from under pi_lock or rq_lock have chances to deadlock anyway and should
> be switched to printk_deferred().
> 
> I think we can queue significantly much less irq_work-s from printk().
> 
> Petr, Steven, what do you think?
> 
> Something like this. Call wake_up_interruptible(), switch to
> wake_up_klogd() only when called from sched code.

Replacing irq_work_queue() with wake_up_interruptible() looks
dangerous to me.

As a result, all "normal" printk() calls from the scheduler
code will deadlock. There is almost always a userspace
logger registered.

By "normal" I mean anything that is not printk_deferred(). For
example, any WARN() from sheduler will cause a deadlock.
We will not even have chance to catch these problems in
advance by lockdep.

The difference is that console_unlock() calls wake_up_process()
only when there is a waiter. And the hard console_lock() is not
called that often.


Honestly, scheduling IRQ looks like the most lightweight and reliable
solution for offloading. We are in big troubles if we could not use
it in printk() code.

IMHO, the best solution is to ratelimit the warnings about the
allocation failures. It does not make sense to repeat the same
warning again and again. We might need a better ratelimiting API
if the current one is not reliable.

Best Regards,
Petr


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-05 Thread Sergey Senozhatsky
On (09/05/19 12:03), Qian Cai wrote:
> > ---
> > diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> > index cd51aa7d08a9..89cb47882254 100644
> > --- a/kernel/printk/printk.c
> > +++ b/kernel/printk/printk.c
> > @@ -2027,8 +2027,11 @@ asmlinkage int vprintk_emit(int facility, int level,
> >     pending_output = (curr_log_seq != log_next_seq);
> >     logbuf_unlock_irqrestore(flags);
> >  
> > +   if (!pending_output)
> > +   return printed_len;
> > +
> >     /* If called from the scheduler, we can not call up(). */
> > -   if (!in_sched && pending_output) {
> > +   if (!in_sched) {
> >     /*
> >      * Disable preemption to avoid being preempted while holding
> >      * console_sem which would prevent anyone from printing to
> > @@ -2043,10 +2046,11 @@ asmlinkage int vprintk_emit(int facility, int level,
> >     if (console_trylock_spinning())
> >     console_unlock();
> >     preempt_enable();
> > -   }
> >  
> > -   if (pending_output)
> > +   wake_up_interruptible(&log_wait);
> > +   } else {
> >     wake_up_klogd();
> > +   }
> >     return printed_len;
> >  }
> >  EXPORT_SYMBOL(vprintk_emit);
> > ---

Qian Cai, any chance you can test that patch?

-ss


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-05 Thread Sergey Senozhatsky
On (09/05/19 13:23), Steven Rostedt wrote:
> > I think we can queue significantly much less irq_work-s from printk().
> > 
> > Petr, Steven, what do you think?

[..]
> I mean, really, do we need to keep calling wake up if it
> probably never even executed?

I guess ratelimiting you are talking about ("if it probably never even
executed") would be to check if we have already called wake up on the
log_wait ->head. For that we need to, at least, take log_wait spin_lock
and check that ->head is still in TASK_INTERRUPTIBLE; which is (quite,
but not exactly) close to what wake_up_interruptible() does - it doesn't
wake up the same task twice, it bails out on `p->state & state' check.

Or did I miss something?

-ss


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-05 Thread Sergey Senozhatsky
On (09/05/19 13:14), Steven Rostedt wrote:
> > Hmm, from the article,
> > 
> > https://en.wikipedia.org/wiki/Universal_asynchronous_receiver-transmitter
> > 
> > "Since transmission of a single or multiple characters may take a long time
> > relative to CPU speeds, a UART maintains a flag showing busy status so that 
> > the
> > host system knows if there is at least one character in the transmit buffer 
> > or
> > shift register; "ready for next character(s)" may also be signaled with an
> > interrupt."
> 
> I'm pretty sure all serial consoles do a busy loop on the UART and not
> use interrupts to notify when it's available.

Yes. Besides, we call console drivers with local IRQs disabled.

-ss


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-05 Thread Steven Rostedt
On Thu, 5 Sep 2019 20:32:08 +0900
Sergey Senozhatsky  wrote:

> I think we can queue significantly much less irq_work-s from printk().
> 
> Petr, Steven, what do you think?

What if we just rate limit the wake ups of klogd? I mean, really, do we
need to keep calling wake up if it probably never even executed?

-- Steve


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-05 Thread Steven Rostedt
On Thu, 05 Sep 2019 12:03:13 -0400
Qian Cai  wrote:

> > > and could deal with console hardware that involve irq_exit() anyway.  
> > 
> > printk->console_driver->write() does not involve irq.  
> 
> Hmm, from the article,
> 
> https://en.wikipedia.org/wiki/Universal_asynchronous_receiver-transmitter
> 
> "Since transmission of a single or multiple characters may take a long time
> relative to CPU speeds, a UART maintains a flag showing busy status so that 
> the
> host system knows if there is at least one character in the transmit buffer or
> shift register; "ready for next character(s)" may also be signaled with an
> interrupt."

I'm pretty sure all serial consoles do a busy loop on the UART and not
use interrupts to notify when it's available. That would require an
asynchronous implementation of printk() which would be quite complex to
implement.

-- Steve


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-05 Thread Qian Cai
On Thu, 2019-09-05 at 20:32 +0900, Sergey Senozhatsky wrote:
> On (09/04/19 16:42), Qian Cai wrote:
> > > Let me think more.
> > 
> > To summary, those look to me are all good long-term improvement that would
> > reduce the likelihood of this kind of livelock in general especially for
> > other
> > unknown allocations that happen while processing softirqs, but it is still
> > up to
> > the air if it fixes it 100% in all situations as printk() is going to take
> > more
> > time
> 
> Well. So. I guess that we don't need irq_work most of the time.
> 
> We need to queue irq_work for "safe" wake_up_interruptible(), when we
> know that we can deadlock in scheduler. IOW, only when we are invoked
> from the scheduler. Scheduler has printk_deferred(), which tells printk()
> that it cannot do wake_up_interruptible(). Otherwise we can just use
> normal wake_up_process() and don't need that irq_work->wake_up_interruptible()
> indirection. The parts of the scheduler, which by mistake call plain printk()
> from under pi_lock or rq_lock have chances to deadlock anyway and should
> be switched to printk_deferred().
> 
> I think we can queue significantly much less irq_work-s from printk().
> 
> Petr, Steven, what do you think?
> 
> Something like this. Call wake_up_interruptible(), switch to
> wake_up_klogd() only when called from sched code.
> 
> ---
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index cd51aa7d08a9..89cb47882254 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -2027,8 +2027,11 @@ asmlinkage int vprintk_emit(int facility, int level,
>   pending_output = (curr_log_seq != log_next_seq);
>   logbuf_unlock_irqrestore(flags);
>  
> + if (!pending_output)
> + return printed_len;
> +
>   /* If called from the scheduler, we can not call up(). */
> - if (!in_sched && pending_output) {
> + if (!in_sched) {
>   /*
>    * Disable preemption to avoid being preempted while holding
>    * console_sem which would prevent anyone from printing to
> @@ -2043,10 +2046,11 @@ asmlinkage int vprintk_emit(int facility, int level,
>   if (console_trylock_spinning())
>   console_unlock();
>   preempt_enable();
> - }
>  
> - if (pending_output)
> + wake_up_interruptible(&log_wait);
> + } else {
>   wake_up_klogd();
> + }
>   return printed_len;
>  }
>  EXPORT_SYMBOL(vprintk_emit);
> ---
> 
> > and could deal with console hardware that involve irq_exit() anyway.
> 
> printk->console_driver->write() does not involve irq.

Hmm, from the article,

https://en.wikipedia.org/wiki/Universal_asynchronous_receiver-transmitter

"Since transmission of a single or multiple characters may take a long time
relative to CPU speeds, a UART maintains a flag showing busy status so that the
host system knows if there is at least one character in the transmit buffer or
shift register; "ready for next character(s)" may also be signaled with an
interrupt."


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-05 Thread Eric Dumazet



On 9/5/19 4:09 PM, Qian Cai wrote:

> Instead of repeatedly make generalize statements, could you enlighten me with
> some concrete examples that have the similar properties which would trigger a
> livelock,
> 
> - guaranteed GFP_ATOMIC allocations when processing softirq batches.
> - the allocation has a fallback mechanism that is unnecessary to warn a 
> failure.
> 
> I thought "skb" is a special-case here as every packet sent or received is
> handled using this data structure.
>

Just  'git grep GFP_ATOMIC -- net' and carefully study all the places.

You will discover many allocations done for incoming packets.

All of them can fail and trigger a trace.

Please fix the problem for good, do not pretend addressing the skb allocations
will solve it.

The skb allocation can succeed, then the following allocation might fail.

skb are one of the many objects that networking need to allocate dynamically.



Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-05 Thread Eric Dumazet



On 9/5/19 4:09 PM, Qian Cai wrote:
> 
> I feel like you may not follow the thread closely. There are more details
> uncovered in the last few days and narrowed down to the culprits.
> 

I have followed the thread closely, thank you very much.

I am happy that the problem is addressed as I suggested.
Ie not individual patches adding selected __GFP_NOWARN.



Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-05 Thread Qian Cai
On Thu, 2019-09-05 at 10:32 +0200, Eric Dumazet wrote:
> 
> On 9/4/19 10:42 PM, Qian Cai wrote:
> 
> > To summary, those look to me are all good long-term improvement that would
> > reduce the likelihood of this kind of livelock in general especially for
> > other
> > unknown allocations that happen while processing softirqs, but it is still
> > up to
> > the air if it fixes it 100% in all situations as printk() is going to take
> > more
> > time and could deal with console hardware that involve irq_exit() anyway.
> > 
> > On the other hand, adding __GPF_NOWARN in the build_skb() allocation will
> > fix
> > this known NET_TX_SOFTIRQ case which is common when softirqd involved at
> > least
> > in short-term. It even have a benefit to reduce the overall warn_alloc()
> > noise
> > out there.
> > 
> > I can resubmit with an update changelog. Does it make any sense?
> 
> It does not make sense.
> 
> We have thousands other GFP_ATOMIC allocations in the networking stacks.

Instead of repeatedly make generalize statements, could you enlighten me with
some concrete examples that have the similar properties which would trigger a
livelock,

- guaranteed GFP_ATOMIC allocations when processing softirq batches.
- the allocation has a fallback mechanism that is unnecessary to warn a failure.

I thought "skb" is a special-case here as every packet sent or received is
handled using this data structure.

> 
> Soon you will have to send more and more patches adding __GFP_NOWARN once
> your workloads/tests can hit all these various points.

I doubt so.

> 
> It is really time to fix this problem generically, instead of having
> to review hundreds of patches.
> 
> This was my initial feedback really, nothing really has changed since.

I feel like you may not follow the thread closely. There are more details
uncovered in the last few days and narrowed down to the culprits.

> 
> The ability to send a warning with a stack trace, holding the cpu
> for many milliseconds should not be decided case by case, otherwise
> every call points will decide to opt-out from the harmful warnings.

That is not really the reasons anymore why I asked to add a __GPF_NOWARN here.


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-05 Thread Sergey Senozhatsky
On (09/04/19 16:42), Qian Cai wrote:
> > Let me think more.
> 
> To summary, those look to me are all good long-term improvement that would
> reduce the likelihood of this kind of livelock in general especially for other
> unknown allocations that happen while processing softirqs, but it is still up 
> to
> the air if it fixes it 100% in all situations as printk() is going to take 
> more
> time

Well. So. I guess that we don't need irq_work most of the time.

We need to queue irq_work for "safe" wake_up_interruptible(), when we
know that we can deadlock in scheduler. IOW, only when we are invoked
from the scheduler. Scheduler has printk_deferred(), which tells printk()
that it cannot do wake_up_interruptible(). Otherwise we can just use
normal wake_up_process() and don't need that irq_work->wake_up_interruptible()
indirection. The parts of the scheduler, which by mistake call plain printk()
from under pi_lock or rq_lock have chances to deadlock anyway and should
be switched to printk_deferred().

I think we can queue significantly much less irq_work-s from printk().

Petr, Steven, what do you think?

Something like this. Call wake_up_interruptible(), switch to
wake_up_klogd() only when called from sched code.

---
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index cd51aa7d08a9..89cb47882254 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2027,8 +2027,11 @@ asmlinkage int vprintk_emit(int facility, int level,
pending_output = (curr_log_seq != log_next_seq);
logbuf_unlock_irqrestore(flags);
 
+   if (!pending_output)
+   return printed_len;
+
/* If called from the scheduler, we can not call up(). */
-   if (!in_sched && pending_output) {
+   if (!in_sched) {
/*
 * Disable preemption to avoid being preempted while holding
 * console_sem which would prevent anyone from printing to
@@ -2043,10 +2046,11 @@ asmlinkage int vprintk_emit(int facility, int level,
if (console_trylock_spinning())
console_unlock();
preempt_enable();
-   }
 
-   if (pending_output)
+   wake_up_interruptible(&log_wait);
+   } else {
wake_up_klogd();
+   }
return printed_len;
 }
 EXPORT_SYMBOL(vprintk_emit);
---

> and could deal with console hardware that involve irq_exit() anyway.

printk->console_driver->write() does not involve irq.

> On the other hand, adding __GPF_NOWARN in the build_skb() allocation will fix
> this known NET_TX_SOFTIRQ case which is common when softirqd involved at least
> in short-term. It even have a benefit to reduce the overall warn_alloc() noise
> out there.

That's not up to me to decide.

-ss


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-05 Thread Eric Dumazet



On 9/4/19 10:42 PM, Qian Cai wrote:

> To summary, those look to me are all good long-term improvement that would
> reduce the likelihood of this kind of livelock in general especially for other
> unknown allocations that happen while processing softirqs, but it is still up 
> to
> the air if it fixes it 100% in all situations as printk() is going to take 
> more
> time and could deal with console hardware that involve irq_exit() anyway.
> 
> On the other hand, adding __GPF_NOWARN in the build_skb() allocation will fix
> this known NET_TX_SOFTIRQ case which is common when softirqd involved at least
> in short-term. It even have a benefit to reduce the overall warn_alloc() noise
> out there.
> 
> I can resubmit with an update changelog. Does it make any sense?

It does not make sense.

We have thousands other GFP_ATOMIC allocations in the networking stacks.

Soon you will have to send more and more patches adding __GFP_NOWARN once
your workloads/tests can hit all these various points.

It is really time to fix this problem generically, instead of having
to review hundreds of patches.

This was my initial feedback really, nothing really has changed since.

The ability to send a warning with a stack trace, holding the cpu
for many milliseconds should not be decided case by case, otherwise
every call points will decide to opt-out from the harmful warnings.


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-04 Thread Qian Cai
On Wed, 2019-09-04 at 23:48 +0900, Sergey Senozhatsky wrote:
> On (09/04/19 08:14), Qian Cai wrote:
> > > Plus one more check - waitqueue_active(&log_wait). printk() adds
> > > pending irq_work only if there is a user-space process sleeping on
> > > log_wait and irq_work is not already scheduled. If the syslog is
> > > active or there is noone to wakeup then we don't queue irq_work.
> > 
> > Another possibility for this potential livelock is that those printk() from
> > warn_alloc(), dump_stack() and show_mem() increase the time it needs to
> > process
> > build_skb() allocation failures significantly under memory pressure. As the
> > result, ksoftirqd() could be rescheduled during that time via a different
> > CPU
> > (this is a large x86 NUMA system anyway),
> > 
> > [83605.577256][   C31]  run_ksoftirqd+0x1f/0x40
> > [83605.577256][   C31]  smpboot_thread_fn+0x255/0x440
> > [83605.577256][   C31]  kthread+0x1df/0x200
> > [83605.577256][   C31]  ret_from_fork+0x35/0x40
> 
> Hum hum hum...
> 
> So I can, _probably_, think of several patches.
> 
> First, move wake_up_klogd() back to console_unlock().
> 
> Second, move `printk_pending' out of per-CPU region and make it global.
> So we will have just one printk irq_work scheduled across all CPUs;
> currently we have one irq_work per CPU. I think I sent a patch a long
> long time ago, but we never discussed it, as far as I remember.
> 
> > In addition, those printk() will deal with console drivers or even a
> > networking
> > console, so it is probably not unusual that it could call irq_exit()-
> > __do_softirq() at one point and then this livelock.
> 
> Do you use netcon? Because this, theoretically, can open up one more
> vector. netcon allocates skbs from ->write() path. We call con drivers'
> ->write() from printk_safe context, so should netcon skb allocation
> warn we will scedule one more irq_work on that CPU to flush per-CPU
> printk_safe buffer.
> 
> If this is the case, then we can stop calling console_driver() under
> printk_safe. I sent a patch a while ago, but we agreed to keep the
> things the way they are, fot the time being.
> 
> Let me think more.

To summary, those look to me are all good long-term improvement that would
reduce the likelihood of this kind of livelock in general especially for other
unknown allocations that happen while processing softirqs, but it is still up to
the air if it fixes it 100% in all situations as printk() is going to take more
time and could deal with console hardware that involve irq_exit() anyway.

On the other hand, adding __GPF_NOWARN in the build_skb() allocation will fix
this known NET_TX_SOFTIRQ case which is common when softirqd involved at least
in short-term. It even have a benefit to reduce the overall warn_alloc() noise
out there.

I can resubmit with an update changelog. Does it make any sense?


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-04 Thread Qian Cai
On Wed, 2019-09-04 at 23:48 +0900, Sergey Senozhatsky wrote:
> On (09/04/19 08:14), Qian Cai wrote:
> > > Plus one more check - waitqueue_active(&log_wait). printk() adds
> > > pending irq_work only if there is a user-space process sleeping on
> > > log_wait and irq_work is not already scheduled. If the syslog is
> > > active or there is noone to wakeup then we don't queue irq_work.
> > 
> > Another possibility for this potential livelock is that those printk() from
> > warn_alloc(), dump_stack() and show_mem() increase the time it needs to
> > process
> > build_skb() allocation failures significantly under memory pressure. As the
> > result, ksoftirqd() could be rescheduled during that time via a different
> > CPU
> > (this is a large x86 NUMA system anyway),
> > 
> > [83605.577256][   C31]  run_ksoftirqd+0x1f/0x40
> > [83605.577256][   C31]  smpboot_thread_fn+0x255/0x440
> > [83605.577256][   C31]  kthread+0x1df/0x200
> > [83605.577256][   C31]  ret_from_fork+0x35/0x40
> 
> Hum hum hum...
> 
> So I can, _probably_, think of several patches.
> 
> First, move wake_up_klogd() back to console_unlock().
> 
> Second, move `printk_pending' out of per-CPU region and make it global.
> So we will have just one printk irq_work scheduled across all CPUs;
> currently we have one irq_work per CPU. I think I sent a patch a long
> long time ago, but we never discussed it, as far as I remember.
> 
> > In addition, those printk() will deal with console drivers or even a
> > networking
> > console, so it is probably not unusual that it could call irq_exit()-
> > __do_softirq() at one point and then this livelock.
> 
> Do you use netcon? Because this, theoretically, can open up one more
> vector. netcon allocates skbs from ->write() path. We call con drivers'
> ->write() from printk_safe context, so should netcon skb allocation
> warn we will scedule one more irq_work on that CPU to flush per-CPU
> printk_safe buffer.

No, I don't use netcon. Just thought to mention it anyway since there could
other people use it.

> 
> If this is the case, then we can stop calling console_driver() under
> printk_safe. I sent a patch a while ago, but we agreed to keep the
> things the way they are, fot the time being.
> 
> Let me think more.
> 
>   -ss


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-04 Thread Sergey Senozhatsky
On (09/04/19 08:14), Qian Cai wrote:
> > Plus one more check - waitqueue_active(&log_wait). printk() adds
> > pending irq_work only if there is a user-space process sleeping on
> > log_wait and irq_work is not already scheduled. If the syslog is
> > active or there is noone to wakeup then we don't queue irq_work.
> 
> Another possibility for this potential livelock is that those printk() from
> warn_alloc(), dump_stack() and show_mem() increase the time it needs to 
> process
> build_skb() allocation failures significantly under memory pressure. As the
> result, ksoftirqd() could be rescheduled during that time via a different CPU
> (this is a large x86 NUMA system anyway),
> 
> [83605.577256][   C31]  run_ksoftirqd+0x1f/0x40
> [83605.577256][   C31]  smpboot_thread_fn+0x255/0x440
> [83605.577256][   C31]  kthread+0x1df/0x200
> [83605.577256][   C31]  ret_from_fork+0x35/0x40

Hum hum hum...

So I can, _probably_, think of several patches.

First, move wake_up_klogd() back to console_unlock().

Second, move `printk_pending' out of per-CPU region and make it global.
So we will have just one printk irq_work scheduled across all CPUs;
currently we have one irq_work per CPU. I think I sent a patch a long
long time ago, but we never discussed it, as far as I remember.

> In addition, those printk() will deal with console drivers or even a 
> networking
> console, so it is probably not unusual that it could call irq_exit()-
>__do_softirq() at one point and then this livelock.

Do you use netcon? Because this, theoretically, can open up one more
vector. netcon allocates skbs from ->write() path. We call con drivers'
->write() from printk_safe context, so should netcon skb allocation
warn we will scedule one more irq_work on that CPU to flush per-CPU
printk_safe buffer.

If this is the case, then we can stop calling console_driver() under
printk_safe. I sent a patch a while ago, but we agreed to keep the
things the way they are, fot the time being.

Let me think more.

-ss


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-04 Thread Qian Cai
On Wed, 2019-09-04 at 14:07 +0200, Michal Hocko wrote:
> On Wed 04-09-19 07:59:17, Qian Cai wrote:
> > On Wed, 2019-09-04 at 10:25 +0200, Michal Hocko wrote:
> > > On Wed 04-09-19 16:00:42, Sergey Senozhatsky wrote:
> > > > On (09/04/19 15:41), Sergey Senozhatsky wrote:
> > > > > But the thing is different in case of dump_stack() + show_mem() +
> > > > > some other output. Because now we ratelimit not a single printk()
> > > > > line,
> > > > > but hundreds of them. The ratelimit becomes - 10 * $$$ lines in 5
> > > > > seconds
> > > > > (IOW, now we talk about thousands of lines).
> > > > 
> > > > And on devices with slow serial consoles this can be somewhat close to
> > > > "no ratelimit". *Suppose* that warn_alloc() adds 700 lines each time.
> > > > Within 5 seconds we can call warn_alloc() 10 times, which will add 7000
> > > > lines to the logbuf. If printk() can evict only 6000 lines in 5 seconds
> > > > then we have a growing number of pending logbuf messages.
> > > 
> > > Yes, ratelimit is problematic when the ratelimited operation is slow. I
> > > guess that is a well known problem and we would need to rework both the
> > > api and the implementation to make it work in those cases as well.
> > > Essentially we need to make the ratelimit act as a gatekeeper to an
> > > operation section - something like a critical section except you can
> > > tolerate more code executions but not too many. So effectively
> > > 
> > >   start_throttle(rate, number);
> > >   /* here goes your operation */
> > >   end_throttle();
> > > 
> > > one operation is not considered done until the whole section ends.
> > > Or something along those lines.
> > > 
> > > In this particular case we can increase the rate limit parameters of
> > > course but I think that longterm we need a better api.
> > 
> > The problem is when a system is under heavy memory pressure, everything is
> > becoming slower, so I don't know how to come up with a sane default for rate
> > limit parameters as a generic solution that would work for every machine out
> > there. Sure, it is possible to set a limit as low as possible that would
> > work
> > for the majority of systems apart from people may complain that they are now
> > missing important warnings, but using __GFP_NOWARN in this code would work
> > for
> > all systems. You could even argument there is even a separate benefit that
> > it
> > could reduce the noise-level overall from those build_skb() allocation
> > failures
> > as it has a fall-back mechanism anyway.
> 
> As Vlastimil already pointed out, __GFP_NOWARN would hide that reserves
> might be configured too low.

Tune "min_free_kbytes" is also an unreliable solution and situational as the
same reason mentioned previously. It may also need a lot of testing to find out
the right value of it on one particular system.

"
When there is a heavy memory pressure, the system is trying hard to reclaim
memory to fill up the watermark. However, the IO is slow to page out, but the
memory pressure keep draining atomic reservoir, and some of those skb_build()
will fail eventually.

Only if there is a fast IO, it will finish swapping sooner and then invoke the
OOM to end the memory pressure.
"

It also have a drawback that "waste" precious memory resources, as allocations
other than GPF_ATOMIC are unable to use those reserved memory anymore.


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-04 Thread Qian Cai
On Wed, 2019-09-04 at 16:43 +0900, Sergey Senozhatsky wrote:
> On (09/04/19 16:19), Sergey Senozhatsky wrote:
> > Hmm. I need to look at this more... wake_up_klogd() queues work only once
> > on particular CPU: irq_work_queue(this_cpu_ptr(&wake_up_klogd_work));
> > 
> > bool irq_work_queue()
> > {
> > /* Only queue if not already pending */
> > if (!irq_work_claim(work))
> > return false;
> > 
> >  __irq_work_queue_local(work);
> > }
> 
> Plus one more check - waitqueue_active(&log_wait). printk() adds
> pending irq_work only if there is a user-space process sleeping on
> log_wait and irq_work is not already scheduled. If the syslog is
> active or there is noone to wakeup then we don't queue irq_work.

Another possibility for this potential livelock is that those printk() from
warn_alloc(), dump_stack() and show_mem() increase the time it needs to process
build_skb() allocation failures significantly under memory pressure. As the
result, ksoftirqd() could be rescheduled during that time via a different CPU
(this is a large x86 NUMA system anyway),

[83605.577256][   C31]  run_ksoftirqd+0x1f/0x40
[83605.577256][   C31]  smpboot_thread_fn+0x255/0x440
[83605.577256][   C31]  kthread+0x1df/0x200
[83605.577256][   C31]  ret_from_fork+0x35/0x40

In addition, those printk() will deal with console drivers or even a networking
console, so it is probably not unusual that it could call irq_exit()-
>__do_softirq() at one point and then this livelock.


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-04 Thread Michal Hocko
On Wed 04-09-19 07:59:17, Qian Cai wrote:
> On Wed, 2019-09-04 at 10:25 +0200, Michal Hocko wrote:
> > On Wed 04-09-19 16:00:42, Sergey Senozhatsky wrote:
> > > On (09/04/19 15:41), Sergey Senozhatsky wrote:
> > > > But the thing is different in case of dump_stack() + show_mem() +
> > > > some other output. Because now we ratelimit not a single printk() line,
> > > > but hundreds of them. The ratelimit becomes - 10 * $$$ lines in 5 
> > > > seconds
> > > > (IOW, now we talk about thousands of lines).
> > > 
> > > And on devices with slow serial consoles this can be somewhat close to
> > > "no ratelimit". *Suppose* that warn_alloc() adds 700 lines each time.
> > > Within 5 seconds we can call warn_alloc() 10 times, which will add 7000
> > > lines to the logbuf. If printk() can evict only 6000 lines in 5 seconds
> > > then we have a growing number of pending logbuf messages.
> > 
> > Yes, ratelimit is problematic when the ratelimited operation is slow. I
> > guess that is a well known problem and we would need to rework both the
> > api and the implementation to make it work in those cases as well.
> > Essentially we need to make the ratelimit act as a gatekeeper to an
> > operation section - something like a critical section except you can
> > tolerate more code executions but not too many. So effectively
> > 
> > start_throttle(rate, number);
> > /* here goes your operation */
> > end_throttle();
> > 
> > one operation is not considered done until the whole section ends.
> > Or something along those lines.
> > 
> > In this particular case we can increase the rate limit parameters of
> > course but I think that longterm we need a better api.
> 
> The problem is when a system is under heavy memory pressure, everything is
> becoming slower, so I don't know how to come up with a sane default for rate
> limit parameters as a generic solution that would work for every machine out
> there. Sure, it is possible to set a limit as low as possible that would work
> for the majority of systems apart from people may complain that they are now
> missing important warnings, but using __GFP_NOWARN in this code would work for
> all systems. You could even argument there is even a separate benefit that it
> could reduce the noise-level overall from those build_skb() allocation 
> failures
> as it has a fall-back mechanism anyway.

As Vlastimil already pointed out, __GFP_NOWARN would hide that reserves
might be configured too low.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-04 Thread Qian Cai
On Wed, 2019-09-04 at 10:25 +0200, Michal Hocko wrote:
> On Wed 04-09-19 16:00:42, Sergey Senozhatsky wrote:
> > On (09/04/19 15:41), Sergey Senozhatsky wrote:
> > > But the thing is different in case of dump_stack() + show_mem() +
> > > some other output. Because now we ratelimit not a single printk() line,
> > > but hundreds of them. The ratelimit becomes - 10 * $$$ lines in 5 seconds
> > > (IOW, now we talk about thousands of lines).
> > 
> > And on devices with slow serial consoles this can be somewhat close to
> > "no ratelimit". *Suppose* that warn_alloc() adds 700 lines each time.
> > Within 5 seconds we can call warn_alloc() 10 times, which will add 7000
> > lines to the logbuf. If printk() can evict only 6000 lines in 5 seconds
> > then we have a growing number of pending logbuf messages.
> 
> Yes, ratelimit is problematic when the ratelimited operation is slow. I
> guess that is a well known problem and we would need to rework both the
> api and the implementation to make it work in those cases as well.
> Essentially we need to make the ratelimit act as a gatekeeper to an
> operation section - something like a critical section except you can
> tolerate more code executions but not too many. So effectively
> 
>   start_throttle(rate, number);
>   /* here goes your operation */
>   end_throttle();
> 
> one operation is not considered done until the whole section ends.
> Or something along those lines.
> 
> In this particular case we can increase the rate limit parameters of
> course but I think that longterm we need a better api.

The problem is when a system is under heavy memory pressure, everything is
becoming slower, so I don't know how to come up with a sane default for rate
limit parameters as a generic solution that would work for every machine out
there. Sure, it is possible to set a limit as low as possible that would work
for the majority of systems apart from people may complain that they are now
missing important warnings, but using __GFP_NOWARN in this code would work for
all systems. You could even argument there is even a separate benefit that it
could reduce the noise-level overall from those build_skb() allocation failures
as it has a fall-back mechanism anyway.


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-04 Thread Michal Hocko
On Wed 04-09-19 16:00:42, Sergey Senozhatsky wrote:
> On (09/04/19 15:41), Sergey Senozhatsky wrote:
> > But the thing is different in case of dump_stack() + show_mem() +
> > some other output. Because now we ratelimit not a single printk() line,
> > but hundreds of them. The ratelimit becomes - 10 * $$$ lines in 5 seconds
> > (IOW, now we talk about thousands of lines).
> 
> And on devices with slow serial consoles this can be somewhat close to
> "no ratelimit". *Suppose* that warn_alloc() adds 700 lines each time.
> Within 5 seconds we can call warn_alloc() 10 times, which will add 7000
> lines to the logbuf. If printk() can evict only 6000 lines in 5 seconds
> then we have a growing number of pending logbuf messages.

Yes, ratelimit is problematic when the ratelimited operation is slow. I
guess that is a well known problem and we would need to rework both the
api and the implementation to make it work in those cases as well.
Essentially we need to make the ratelimit act as a gatekeeper to an
operation section - something like a critical section except you can
tolerate more code executions but not too many. So effectively

start_throttle(rate, number);
/* here goes your operation */
end_throttle();

one operation is not considered done until the whole section ends.
Or something along those lines.

In this particular case we can increase the rate limit parameters of
course but I think that longterm we need a better api.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-04 Thread Sergey Senozhatsky
On (09/04/19 16:19), Sergey Senozhatsky wrote:
> Hmm. I need to look at this more... wake_up_klogd() queues work only once
> on particular CPU: irq_work_queue(this_cpu_ptr(&wake_up_klogd_work));
> 
> bool irq_work_queue()
> {
>   /* Only queue if not already pending */
>   if (!irq_work_claim(work))
>   return false;
> 
>__irq_work_queue_local(work);
> }

Plus one more check - waitqueue_active(&log_wait). printk() adds
pending irq_work only if there is a user-space process sleeping on
log_wait and irq_work is not already scheduled. If the syslog is
active or there is noone to wakeup then we don't queue irq_work.

-ss


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-04 Thread Sergey Senozhatsky
On (09/04/19 08:54), Michal Hocko wrote:
> I am sorry, I could have been more explicit when CCing you.

Oh, sorry! My bad!

> Sure the ratelimit is part of the problem. But I was more interested
> in the potential livelock (infinite loop) mentioned by Qian Cai. It
> is not important whether we generate one or more lines of output from
> the softirq context as long as the printk generates more irq processing
> which might end up doing the same. Is this really possible?

Hmm. I need to look at this more... wake_up_klogd() queues work only once
on particular CPU: irq_work_queue(this_cpu_ptr(&wake_up_klogd_work));

bool irq_work_queue()
{
/* Only queue if not already pending */
if (!irq_work_claim(work))
return false;

 __irq_work_queue_local(work);
}

softirqs are processed in batches, right? The softirq batch can add 
lines to printk logbuf, but there will be only one PRINTK_PENDING_WAKEUP
queued. Qian Cai mentioned that "net_rx_action softirqs again which are
plenty due to connected via ssh etc." so the proportion still seems to be
N:1 - we process N softirqs, add 1 printk irq_work.

But need to think more.
Interesting question.

-ss


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-04 Thread Sergey Senozhatsky
On (09/04/19 15:41), Sergey Senozhatsky wrote:
> But the thing is different in case of dump_stack() + show_mem() +
> some other output. Because now we ratelimit not a single printk() line,
> but hundreds of them. The ratelimit becomes - 10 * $$$ lines in 5 seconds
> (IOW, now we talk about thousands of lines).

And on devices with slow serial consoles this can be somewhat close to
"no ratelimit". *Suppose* that warn_alloc() adds 700 lines each time.
Within 5 seconds we can call warn_alloc() 10 times, which will add 7000
lines to the logbuf. If printk() can evict only 6000 lines in 5 seconds
then we have a growing number of pending logbuf messages.

-ss


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-03 Thread Michal Hocko
On Wed 04-09-19 15:41:44, Sergey Senozhatsky wrote:
> On (09/04/19 08:15), Michal Hocko wrote:
> > > If you look at the original report, the failed allocation dump_stack() is,
> > > 
> > >  
> > >  warn_alloc.cold.43+0x8a/0x148
> > >  __alloc_pages_nodemask+0x1a5c/0x1bb0
> > >  alloc_pages_current+0x9c/0x110
> > >  allocate_slab+0x34a/0x11f0
> > >  new_slab+0x46/0x70
> > >  ___slab_alloc+0x604/0x950
> > >  __slab_alloc+0x12/0x20
> > >  kmem_cache_alloc+0x32a/0x400
> > >  __build_skb+0x23/0x60
> > >  build_skb+0x1a/0xb0
> > >  igb_clean_rx_irq+0xafc/0x1010 [igb]
> > >  igb_poll+0x4bb/0xe30 [igb]
> > >  net_rx_action+0x244/0x7a0
> > >  __do_softirq+0x1a0/0x60a
> > >  irq_exit+0xb5/0xd0
> > >  do_IRQ+0x81/0x170
> > >  common_interrupt+0xf/0xf
> > >  
> > > 
> > > Since it has no __GFP_NOWARN to begin with, it will call,
> 
> I think that DEFAULT_RATELIMIT_INTERVAL and DEFAULT_RATELIMIT_BURST
> are good when we ratelimit just a single printk() call, so the ratelimit
> is "max 10 kernel log lines in 5 seconds".

I am sorry, I could have been more explicit when CCing you. Sure the
ratelimit is part of the problem. But I was more interested in the
potential livelock (infinite loop) mentioned by Qian Cai. It is not
important whether we generate one or more lines of output from the
softirq context as long as the printk generates more irq processing
which might end up doing the same. Is this really possible?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-03 Thread Sergey Senozhatsky
On (09/04/19 08:15), Michal Hocko wrote:
> > If you look at the original report, the failed allocation dump_stack() is,
> > 
> >  
> >  warn_alloc.cold.43+0x8a/0x148
> >  __alloc_pages_nodemask+0x1a5c/0x1bb0
> >  alloc_pages_current+0x9c/0x110
> >  allocate_slab+0x34a/0x11f0
> >  new_slab+0x46/0x70
> >  ___slab_alloc+0x604/0x950
> >  __slab_alloc+0x12/0x20
> >  kmem_cache_alloc+0x32a/0x400
> >  __build_skb+0x23/0x60
> >  build_skb+0x1a/0xb0
> >  igb_clean_rx_irq+0xafc/0x1010 [igb]
> >  igb_poll+0x4bb/0xe30 [igb]
> >  net_rx_action+0x244/0x7a0
> >  __do_softirq+0x1a0/0x60a
> >  irq_exit+0xb5/0xd0
> >  do_IRQ+0x81/0x170
> >  common_interrupt+0xf/0xf
> >  
> > 
> > Since it has no __GFP_NOWARN to begin with, it will call,

I think that DEFAULT_RATELIMIT_INTERVAL and DEFAULT_RATELIMIT_BURST
are good when we ratelimit just a single printk() call, so the ratelimit
is "max 10 kernel log lines in 5 seconds".

But the thing is different in case of dump_stack() + show_mem() +
some other output. Because now we ratelimit not a single printk() line,
but hundreds of them. The ratelimit becomes - 10 * $$$ lines in 5 seconds
(IOW, now we talk about thousands of lines). Significantly more permissive
ratelimiting.

-ss


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-03 Thread Michal Hocko
> On Tue, 2019-09-03 at 20:53 +0200, Michal Hocko wrote:
> > On Tue 03-09-19 11:42:22, Qian Cai wrote:
> > > On Tue, 2019-09-03 at 15:22 +0200, Michal Hocko wrote:
> > > > On Fri 30-08-19 18:15:22, Eric Dumazet wrote:
> > > > > If there is a risk of flooding the syslog, we should fix this
> > > > > generically
> > > > > in mm layer, not adding hundred of __GFP_NOWARN all over the places.
> > > > 
> > > > We do already ratelimit in warn_alloc. If it isn't sufficient then we
> > > > can think of a different parameters. Or maybe it is the ratelimiting
> > > > which doesn't work here. Hard to tell and something to explore.
> > > 
> > > The time-based ratelimit won't work for skb_build() as when a system under
> > > memory pressure, and the CPU is fast and IO is so slow, it could take a 
> > > long
> > > time to swap and trigger OOM.
> > 
> > I really do not understand what does OOM and swapping have to do with
> > the ratelimiting here. The sole purpose of the ratelimit is to reduce
> > the amount of warnings to be printed. Slow IO might have an effect on
> > when the OOM killer is invoked but atomic allocations are not directly
> > dependent on IO.
> 
> When there is a heavy memory pressure, the system is trying hard to reclaim
> memory to fill up the watermark. However, the IO is slow to page out, but the
> memory pressure keep draining atomic reservoir, and some of those skb_build()
> will fail eventually.

Yes this is true but this has nothing to do with the ratelimitted
warn_alloc AFAICS. It is natural that atomic allocations are going
to fail more likely under extreme memory pressure but we are talking
about an excessive amount of debugging output that is generated and
that should be throttled. And that's why we have ratelimit there. If it
doesn't work well then we should look into why.

> Only if there is a fast IO, it will finish swapping sooner and then invoke the
> OOM to end the memory pressure.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-03 Thread Michal Hocko
Cc printk maintainers

On Tue 03-09-19 17:42:28, Qian Cai wrote:
> > > I suppose what happens is those skb_build() allocations are from softirq,
> > > and
> > > once one of them failed, it calls printk() which generates more 
> > > interrupts.
> > > Hence, the infinite loop.
> > 
> > Please elaborate more.
> > 
> 
> If you look at the original report, the failed allocation dump_stack() is,
> 
>  
>  warn_alloc.cold.43+0x8a/0x148
>  __alloc_pages_nodemask+0x1a5c/0x1bb0
>  alloc_pages_current+0x9c/0x110
>  allocate_slab+0x34a/0x11f0
>  new_slab+0x46/0x70
>  ___slab_alloc+0x604/0x950
>  __slab_alloc+0x12/0x20
>  kmem_cache_alloc+0x32a/0x400
>  __build_skb+0x23/0x60
>  build_skb+0x1a/0xb0
>  igb_clean_rx_irq+0xafc/0x1010 [igb]
>  igb_poll+0x4bb/0xe30 [igb]
>  net_rx_action+0x244/0x7a0
>  __do_softirq+0x1a0/0x60a
>  irq_exit+0xb5/0xd0
>  do_IRQ+0x81/0x170
>  common_interrupt+0xf/0xf
>  
> 
> Since it has no __GFP_NOWARN to begin with, it will call,
> 
> printk
>   vprintk_default
> vprintk_emit
>   wake_up_klogd
> irq_work_queue
>   __irq_work_queue_local
> arch_irq_work_raise
>   apic->send_IPI_self(IRQ_WORK_VECTOR)
> smp_irq_work_interrupt
>   exiting_irq
> irq_exit
> 
> and end up processing pending net_rx_action softirqs again which are plenty 
> due
> to connected via ssh etc.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-03 Thread Qian Cai
On Tue, 2019-09-03 at 20:53 +0200, Michal Hocko wrote:
> On Tue 03-09-19 11:42:22, Qian Cai wrote:
> > On Tue, 2019-09-03 at 15:22 +0200, Michal Hocko wrote:
> > > On Fri 30-08-19 18:15:22, Eric Dumazet wrote:
> > > > If there is a risk of flooding the syslog, we should fix this
> > > > generically
> > > > in mm layer, not adding hundred of __GFP_NOWARN all over the places.
> > > 
> > > We do already ratelimit in warn_alloc. If it isn't sufficient then we
> > > can think of a different parameters. Or maybe it is the ratelimiting
> > > which doesn't work here. Hard to tell and something to explore.
> > 
> > The time-based ratelimit won't work for skb_build() as when a system under
> > memory pressure, and the CPU is fast and IO is so slow, it could take a long
> > time to swap and trigger OOM.
> 
> I really do not understand what does OOM and swapping have to do with
> the ratelimiting here. The sole purpose of the ratelimit is to reduce
> the amount of warnings to be printed. Slow IO might have an effect on
> when the OOM killer is invoked but atomic allocations are not directly
> dependent on IO.

When there is a heavy memory pressure, the system is trying hard to reclaim
memory to fill up the watermark. However, the IO is slow to page out, but the
memory pressure keep draining atomic reservoir, and some of those skb_build()
will fail eventually.

Only if there is a fast IO, it will finish swapping sooner and then invoke the
OOM to end the memory pressure.

> 
> > I suppose what happens is those skb_build() allocations are from softirq,
> > and
> > once one of them failed, it calls printk() which generates more interrupts.
> > Hence, the infinite loop.
> 
> Please elaborate more.
> 

If you look at the original report, the failed allocation dump_stack() is,

 
 warn_alloc.cold.43+0x8a/0x148
 __alloc_pages_nodemask+0x1a5c/0x1bb0
 alloc_pages_current+0x9c/0x110
 allocate_slab+0x34a/0x11f0
 new_slab+0x46/0x70
 ___slab_alloc+0x604/0x950
 __slab_alloc+0x12/0x20
 kmem_cache_alloc+0x32a/0x400
 __build_skb+0x23/0x60
 build_skb+0x1a/0xb0
 igb_clean_rx_irq+0xafc/0x1010 [igb]
 igb_poll+0x4bb/0xe30 [igb]
 net_rx_action+0x244/0x7a0
 __do_softirq+0x1a0/0x60a
 irq_exit+0xb5/0xd0
 do_IRQ+0x81/0x170
 common_interrupt+0xf/0xf
 

Since it has no __GFP_NOWARN to begin with, it will call,

printk
  vprintk_default
vprintk_emit
  wake_up_klogd
irq_work_queue
  __irq_work_queue_local
arch_irq_work_raise
  apic->send_IPI_self(IRQ_WORK_VECTOR)
smp_irq_work_interrupt
  exiting_irq
irq_exit

and end up processing pending net_rx_action softirqs again which are plenty due
to connected via ssh etc.


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-03 Thread Michal Hocko
On Tue 03-09-19 11:42:22, Qian Cai wrote:
> On Tue, 2019-09-03 at 15:22 +0200, Michal Hocko wrote:
> > On Fri 30-08-19 18:15:22, Eric Dumazet wrote:
> > > If there is a risk of flooding the syslog, we should fix this generically
> > > in mm layer, not adding hundred of __GFP_NOWARN all over the places.
> > 
> > We do already ratelimit in warn_alloc. If it isn't sufficient then we
> > can think of a different parameters. Or maybe it is the ratelimiting
> > which doesn't work here. Hard to tell and something to explore.
> 
> The time-based ratelimit won't work for skb_build() as when a system under
> memory pressure, and the CPU is fast and IO is so slow, it could take a long
> time to swap and trigger OOM.

I really do not understand what does OOM and swapping have to do with
the ratelimiting here. The sole purpose of the ratelimit is to reduce
the amount of warnings to be printed. Slow IO might have an effect on
when the OOM killer is invoked but atomic allocations are not directly
dependent on IO.

> I suppose what happens is those skb_build() allocations are from softirq, and
> once one of them failed, it calls printk() which generates more interrupts.
> Hence, the infinite loop.

Please elaborate more.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-03 Thread Qian Cai
On Tue, 2019-09-03 at 15:22 +0200, Michal Hocko wrote:
> On Fri 30-08-19 18:15:22, Eric Dumazet wrote:
> > If there is a risk of flooding the syslog, we should fix this generically
> > in mm layer, not adding hundred of __GFP_NOWARN all over the places.
> 
> We do already ratelimit in warn_alloc. If it isn't sufficient then we
> can think of a different parameters. Or maybe it is the ratelimiting
> which doesn't work here. Hard to tell and something to explore.

The time-based ratelimit won't work for skb_build() as when a system under
memory pressure, and the CPU is fast and IO is so slow, it could take a long
time to swap and trigger OOM.

I suppose what happens is those skb_build() allocations are from softirq, and
once one of them failed, it calls printk() which generates more interrupts.
Hence, the infinite loop.


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-03 Thread Michal Hocko
On Fri 30-08-19 18:15:22, Eric Dumazet wrote:
> If there is a risk of flooding the syslog, we should fix this generically
> in mm layer, not adding hundred of __GFP_NOWARN all over the places.

We do already ratelimit in warn_alloc. If it isn't sufficient then we
can think of a different parameters. Or maybe it is the ratelimiting
which doesn't work here. Hard to tell and something to explore.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-02 Thread Vlastimil Babka
On 8/30/19 5:25 PM, Qian Cai wrote:
> On Fri, 2019-08-30 at 17:11 +0200, Eric Dumazet wrote:
>>
>> On 8/30/19 4:57 PM, Qian Cai wrote:
>>> When running heavy memory pressure workloads, the system is throwing
>>> endless warnings below due to the allocation could fail from
>>> __build_skb(), and the volume of this call could be huge which may
>>> generate a lot of serial console output and cosumes all CPUs as
>>> warn_alloc() could be expensive by calling dump_stack() and then
>>> show_mem().
>>>
>>> Fix it by silencing the warning in this call site. Also, it seems
>>> unnecessary to even print a warning at all if the allocation failed in
>>> __build_skb(), as it may just retransmit the packet and retry.
>>>

Well, __GFP_NOWARN would save me from explaining this warning to users
many times. OTOH usually it's an indication that min_free_kbytes should
be raised to better cope with network traffic.

>>
>> Same patches are showing up there and there from time to time.
>>
>> Why is this particular spot interesting, against all others not adding
>> __GFP_NOWARN ?

This one is interesting that it's a GFP_ATOMIC allocation triggered by
incoming packets, and has a fallback mechanism. I don't recall other so
notoric ones.

>> Are we going to have hundred of patches adding __GFP_NOWARN at various 
>> points,
>> or should we get something generic to not flood the syslog in case of memory
>> pressure ?
>>
> 
> From my testing which uses LTP oom* tests. There are only 3 places need to be
> patched. The other two are in IOMMU code for both Intel and AMD. The place is
> particular interesting because it could cause the system with floating serial
> console output for days without making progress in OOM. I suppose it ends up 
> in
> a looping condition that warn_alloc() would end up generating more calls into
> __build_skb() via ksoftirqd.

Regardless of this particular allocation, if the reporting itself makes
the conditions so much worse, then at least some kind of general
ratelimit would make sense indeed.



Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-08-30 Thread Qian Cai
On Fri, 2019-08-30 at 18:15 +0200, Eric Dumazet wrote:
> 
> On 8/30/19 5:25 PM, Qian Cai wrote:
> > On Fri, 2019-08-30 at 17:11 +0200, Eric Dumazet wrote:
> > > 
> > > On 8/30/19 4:57 PM, Qian Cai wrote:
> > > > When running heavy memory pressure workloads, the system is throwing
> > > > endless warnings below due to the allocation could fail from
> > > > __build_skb(), and the volume of this call could be huge which may
> > > > generate a lot of serial console output and cosumes all CPUs as
> > > > warn_alloc() could be expensive by calling dump_stack() and then
> > > > show_mem().
> > > > 
> > > > Fix it by silencing the warning in this call site. Also, it seems
> > > > unnecessary to even print a warning at all if the allocation failed in
> > > > __build_skb(), as it may just retransmit the packet and retry.
> > > > 
> > > 
> > > Same patches are showing up there and there from time to time.
> > > 
> > > Why is this particular spot interesting, against all others not adding
> > > __GFP_NOWARN ?
> > > 
> > > Are we going to have hundred of patches adding __GFP_NOWARN at various
> > > points,
> > > or should we get something generic to not flood the syslog in case of
> > > memory
> > > pressure ?
> > > 
> > 
> > From my testing which uses LTP oom* tests. There are only 3 places need to
> > be
> > patched. The other two are in IOMMU code for both Intel and AMD. The place
> > is
> > particular interesting because it could cause the system with floating
> > serial
> > console output for days without making progress in OOM. I suppose it ends up
> > in
> > a looping condition that warn_alloc() would end up generating more calls
> > into
> > __build_skb() via ksoftirqd.
> > 
> 
> Yes, but what about other tests done by other people ?

Sigh, I don't know what tests do you have in mind. I tried many memory pressure
tests including LTP, stress-ng, and mmtests etc running for years. I don't
recall see other places that could loop like this for days.

> 
> You do not really answer my last question, which was really the point I tried
> to make.
> 
> If there is a risk of flooding the syslog, we should fix this generically
> in mm layer, not adding hundred of __GFP_NOWARN all over the places.
> 
> Maybe just make __GFP_NOWARN the default, I dunno.

I don't really see how it could end up with adding hundred of _GFP_NOWARN in the
kernel code. If there is really a hundred places could loop like this, it may
make more sense looking into a general solution.


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-08-30 Thread Eric Dumazet



On 8/30/19 5:25 PM, Qian Cai wrote:
> On Fri, 2019-08-30 at 17:11 +0200, Eric Dumazet wrote:
>>
>> On 8/30/19 4:57 PM, Qian Cai wrote:
>>> When running heavy memory pressure workloads, the system is throwing
>>> endless warnings below due to the allocation could fail from
>>> __build_skb(), and the volume of this call could be huge which may
>>> generate a lot of serial console output and cosumes all CPUs as
>>> warn_alloc() could be expensive by calling dump_stack() and then
>>> show_mem().
>>>
>>> Fix it by silencing the warning in this call site. Also, it seems
>>> unnecessary to even print a warning at all if the allocation failed in
>>> __build_skb(), as it may just retransmit the packet and retry.
>>>
>>
>> Same patches are showing up there and there from time to time.
>>
>> Why is this particular spot interesting, against all others not adding
>> __GFP_NOWARN ?
>>
>> Are we going to have hundred of patches adding __GFP_NOWARN at various 
>> points,
>> or should we get something generic to not flood the syslog in case of memory
>> pressure ?
>>
> 
> From my testing which uses LTP oom* tests. There are only 3 places need to be
> patched. The other two are in IOMMU code for both Intel and AMD. The place is
> particular interesting because it could cause the system with floating serial
> console output for days without making progress in OOM. I suppose it ends up 
> in
> a looping condition that warn_alloc() would end up generating more calls into
> __build_skb() via ksoftirqd.
> 

Yes, but what about other tests done by other people ?

You do not really answer my last question, which was really the point I tried
to make.

If there is a risk of flooding the syslog, we should fix this generically
in mm layer, not adding hundred of __GFP_NOWARN all over the places.

Maybe just make __GFP_NOWARN the default, I dunno.


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-08-30 Thread Qian Cai
On Fri, 2019-08-30 at 17:11 +0200, Eric Dumazet wrote:
> 
> On 8/30/19 4:57 PM, Qian Cai wrote:
> > When running heavy memory pressure workloads, the system is throwing
> > endless warnings below due to the allocation could fail from
> > __build_skb(), and the volume of this call could be huge which may
> > generate a lot of serial console output and cosumes all CPUs as
> > warn_alloc() could be expensive by calling dump_stack() and then
> > show_mem().
> > 
> > Fix it by silencing the warning in this call site. Also, it seems
> > unnecessary to even print a warning at all if the allocation failed in
> > __build_skb(), as it may just retransmit the packet and retry.
> > 
> 
> Same patches are showing up there and there from time to time.
> 
> Why is this particular spot interesting, against all others not adding
> __GFP_NOWARN ?
> 
> Are we going to have hundred of patches adding __GFP_NOWARN at various points,
> or should we get something generic to not flood the syslog in case of memory
> pressure ?
> 

>From my testing which uses LTP oom* tests. There are only 3 places need to be
patched. The other two are in IOMMU code for both Intel and AMD. The place is
particular interesting because it could cause the system with floating serial
console output for days without making progress in OOM. I suppose it ends up in
a looping condition that warn_alloc() would end up generating more calls into
__build_skb() via ksoftirqd.


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-08-30 Thread Eric Dumazet



On 8/30/19 4:57 PM, Qian Cai wrote:
> When running heavy memory pressure workloads, the system is throwing
> endless warnings below due to the allocation could fail from
> __build_skb(), and the volume of this call could be huge which may
> generate a lot of serial console output and cosumes all CPUs as
> warn_alloc() could be expensive by calling dump_stack() and then
> show_mem().
> 
> Fix it by silencing the warning in this call site. Also, it seems
> unnecessary to even print a warning at all if the allocation failed in
> __build_skb(), as it may just retransmit the packet and retry.
> 

Same patches are showing up there and there from time to time.

Why is this particular spot interesting, against all others not adding 
__GFP_NOWARN ?

Are we going to have hundred of patches adding __GFP_NOWARN at various points,
or should we get something generic to not flood the syslog in case of memory 
pressure ?