Re: Re: Re: Re: Re: Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-06 Thread SeongJae Park
On Wed, 6 May 2020 07:41:51 -0700 "Paul E. McKenney"  wrote:

> On Wed, May 06, 2020 at 02:59:26PM +0200, SeongJae Park wrote:
> > TL; DR: It was not kernel's fault, but the benchmark program.
> > 
> > So, the problem is reproducible using the lebench[1] only.  I carefully read
> > it's code again.
> > 
> > Before running the problem occurred "poll big" sub test, lebench executes
> > "context switch" sub test.  For the test, it sets the cpu affinity[2] and
> > process priority[3] of itself to '0' and '-20', respectively.  However, it
> > doesn't restore the values to original value even after the "context 
> > switch" is
> > finished.  For the reason, "select big" sub test also run binded on CPU 0 
> > and
> > has lowest nice value.  Therefore, it can disturb the RCU callback thread 
> > for
> > the CPU 0, which processes the deferred deallocations of the sockets, and 
> > as a
> > result it triggers the OOM.
> > 
> > We confirmed the problem disappears by offloading the RCU callbacks from the
> > CPU 0 using rcu_nocbs=0 boot parameter or simply restoring the affinity 
> > and/or
> > priority.
> > 
> > Someone _might_ still argue that this is kernel problem because the problem
> > didn't occur on the old kernels prior to the Al's patches.  However, setting
> > the affinity and priority was available because the program received the
> > permission.  Therefore, it would be reasonable to blame the system
> > administrators rather than the kernel.
> > 
> > So, please ignore this patchset, apology for making confuse.  If you still 
> > has
> > some doubts or need more tests, please let me know.
> > 
> > [1] https://github.com/LinuxPerfStudy/LEBench
> > [2] 
> > https://github.com/LinuxPerfStudy/LEBench/blob/master/TEST_DIR/OS_Eval.c#L820
> > [3] 
> > https://github.com/LinuxPerfStudy/LEBench/blob/master/TEST_DIR/OS_Eval.c#L822
> 
> Thank you for chasing this down!
> 
> I have had this sort of thing on my list as a potential issue, but given
> that it is now really showing up, it sounds like it is time to bump
> up its priority a bit.  Of course there are limits, so if userspace is
> running at any of the real-time priorities, making sufficient CPU time
> available to RCU's kthreads becomes userspace's responsibility.  But if
> everything is running at SCHED_OTHER (which is this case here, correct?),

Correct.

> then it is reasonable for RCU to do some work to avoid this situation.

That would be also great!

> 
> But still, yes, the immediate job is fixing the benchmark.  ;-)

Totally agreed.

> 
>   Thanx, Paul
> 
> PS.  Why not just attack all potential issues on my list?  Because I
>  usually learn quite a bit from seeing the problem actually happen.
>  And sometimes other changes in RCU eliminate the potential issue
>  before it has a chance to happen.

Sounds interesting, I will try some of those in my spare time ;)


Thanks,
SeongJae Park


Re: Re: Re: Re: Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-06 Thread Paul E. McKenney
On Wed, May 06, 2020 at 02:59:26PM +0200, SeongJae Park wrote:
> TL; DR: It was not kernel's fault, but the benchmark program.
> 
> So, the problem is reproducible using the lebench[1] only.  I carefully read
> it's code again.
> 
> Before running the problem occurred "poll big" sub test, lebench executes
> "context switch" sub test.  For the test, it sets the cpu affinity[2] and
> process priority[3] of itself to '0' and '-20', respectively.  However, it
> doesn't restore the values to original value even after the "context switch" 
> is
> finished.  For the reason, "select big" sub test also run binded on CPU 0 and
> has lowest nice value.  Therefore, it can disturb the RCU callback thread for
> the CPU 0, which processes the deferred deallocations of the sockets, and as a
> result it triggers the OOM.
> 
> We confirmed the problem disappears by offloading the RCU callbacks from the
> CPU 0 using rcu_nocbs=0 boot parameter or simply restoring the affinity and/or
> priority.
> 
> Someone _might_ still argue that this is kernel problem because the problem
> didn't occur on the old kernels prior to the Al's patches.  However, setting
> the affinity and priority was available because the program received the
> permission.  Therefore, it would be reasonable to blame the system
> administrators rather than the kernel.
> 
> So, please ignore this patchset, apology for making confuse.  If you still has
> some doubts or need more tests, please let me know.
> 
> [1] https://github.com/LinuxPerfStudy/LEBench
> [2] 
> https://github.com/LinuxPerfStudy/LEBench/blob/master/TEST_DIR/OS_Eval.c#L820
> [3] 
> https://github.com/LinuxPerfStudy/LEBench/blob/master/TEST_DIR/OS_Eval.c#L822

Thank you for chasing this down!

I have had this sort of thing on my list as a potential issue, but given
that it is now really showing up, it sounds like it is time to bump
up its priority a bit.  Of course there are limits, so if userspace is
running at any of the real-time priorities, making sufficient CPU time
available to RCU's kthreads becomes userspace's responsibility.  But if
everything is running at SCHED_OTHER (which is this case here, correct?),
then it is reasonable for RCU to do some work to avoid this situation.

But still, yes, the immediate job is fixing the benchmark.  ;-)

Thanx, Paul

PS.  Why not just attack all potential issues on my list?  Because I
 usually learn quite a bit from seeing the problem actually happen.
 And sometimes other changes in RCU eliminate the potential issue
 before it has a chance to happen.


Re: Re: Re: Re: Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-06 Thread Eric Dumazet
On Wed, May 6, 2020 at 5:59 AM SeongJae Park  wrote:
>
> TL; DR: It was not kernel's fault, but the benchmark program.
>
> So, the problem is reproducible using the lebench[1] only.  I carefully read
> it's code again.
>
> Before running the problem occurred "poll big" sub test, lebench executes
> "context switch" sub test.  For the test, it sets the cpu affinity[2] and
> process priority[3] of itself to '0' and '-20', respectively.  However, it
> doesn't restore the values to original value even after the "context switch" 
> is
> finished.  For the reason, "select big" sub test also run binded on CPU 0 and
> has lowest nice value.  Therefore, it can disturb the RCU callback thread for
> the CPU 0, which processes the deferred deallocations of the sockets, and as a
> result it triggers the OOM.
>
> We confirmed the problem disappears by offloading the RCU callbacks from the
> CPU 0 using rcu_nocbs=0 boot parameter or simply restoring the affinity and/or
> priority.
>
> Someone _might_ still argue that this is kernel problem because the problem
> didn't occur on the old kernels prior to the Al's patches.  However, setting
> the affinity and priority was available because the program received the
> permission.  Therefore, it would be reasonable to blame the system
> administrators rather than the kernel.
>
> So, please ignore this patchset, apology for making confuse.  If you still has
> some doubts or need more tests, please let me know.
>
> [1] https://github.com/LinuxPerfStudy/LEBench
> [2] 
> https://github.com/LinuxPerfStudy/LEBench/blob/master/TEST_DIR/OS_Eval.c#L820
> [3] 
> https://github.com/LinuxPerfStudy/LEBench/blob/master/TEST_DIR/OS_Eval.c#L822
>
>
> Thanks,
> SeongJae Park

No harm done, thanks for running more tests and root-causing the issue !


Re: Re: Re: Re: Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-06 Thread SeongJae Park
TL; DR: It was not kernel's fault, but the benchmark program.

So, the problem is reproducible using the lebench[1] only.  I carefully read
it's code again.

Before running the problem occurred "poll big" sub test, lebench executes
"context switch" sub test.  For the test, it sets the cpu affinity[2] and
process priority[3] of itself to '0' and '-20', respectively.  However, it
doesn't restore the values to original value even after the "context switch" is
finished.  For the reason, "select big" sub test also run binded on CPU 0 and
has lowest nice value.  Therefore, it can disturb the RCU callback thread for
the CPU 0, which processes the deferred deallocations of the sockets, and as a
result it triggers the OOM.

We confirmed the problem disappears by offloading the RCU callbacks from the
CPU 0 using rcu_nocbs=0 boot parameter or simply restoring the affinity and/or
priority.

Someone _might_ still argue that this is kernel problem because the problem
didn't occur on the old kernels prior to the Al's patches.  However, setting
the affinity and priority was available because the program received the
permission.  Therefore, it would be reasonable to blame the system
administrators rather than the kernel.

So, please ignore this patchset, apology for making confuse.  If you still has
some doubts or need more tests, please let me know.

[1] https://github.com/LinuxPerfStudy/LEBench
[2] 
https://github.com/LinuxPerfStudy/LEBench/blob/master/TEST_DIR/OS_Eval.c#L820
[3] 
https://github.com/LinuxPerfStudy/LEBench/blob/master/TEST_DIR/OS_Eval.c#L822


Thanks,
SeongJae Park


Re: Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread SeongJae Park
On Tue, 05 May 2020 12:00:49 -0700 (PDT) David Miller  
wrote:

> From: David Miller 
> Date: Tue, 05 May 2020 11:48:25 -0700 (PDT)
> 
> > Series applied and queued up for -stable, thanks.
> 
> Nevermind, this doesn't even compile.
> 
> net/smc/af_smc.c: In function ‘smc_switch_to_fallback’:
> net/smc/af_smc.c:473:19: error: ‘smc->clcsock->wq’ is a pointer; did you mean 
> to use ‘->’?
>   473 |   smc->clcsock->wq.fasync_list =
>   |   ^
>   |   ->
> net/smc/af_smc.c:474:25: error: ‘smc->sk.sk_socket->wq’ is a pointer; did you 
> mean to use ‘->’?
>   474 |smc->sk.sk_socket->wq.fasync_list;
>   | ^
>   | ->
> 
> So I had to revert these changes.
> 
> When you make a change of this magnitude and scope you must do an
> allmodconfig build.

Definitely my fault.  I will fix this in next spin.


Thanks,
SeongJae Park

> 
> Thank you.


Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread David Miller
From: David Miller 
Date: Tue, 05 May 2020 11:48:25 -0700 (PDT)

> Series applied and queued up for -stable, thanks.

Nevermind, this doesn't even compile.

net/smc/af_smc.c: In function ‘smc_switch_to_fallback’:
net/smc/af_smc.c:473:19: error: ‘smc->clcsock->wq’ is a pointer; did you mean 
to use ‘->’?
  473 |   smc->clcsock->wq.fasync_list =
  |   ^
  |   ->
net/smc/af_smc.c:474:25: error: ‘smc->sk.sk_socket->wq’ is a pointer; did you 
mean to use ‘->’?
  474 |smc->sk.sk_socket->wq.fasync_list;
  | ^
  | ->

So I had to revert these changes.

When you make a change of this magnitude and scope you must do an
allmodconfig build.

Thank you.


Re: Re: Re: Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread Paul E. McKenney
On Tue, May 05, 2020 at 08:34:02PM +0200, SeongJae Park wrote:
> On Tue, 5 May 2020 11:17:07 -0700 "Paul E. McKenney"  
> wrote:
> 
> > On Tue, May 05, 2020 at 07:56:05PM +0200, SeongJae Park wrote:
> > > On Tue, 5 May 2020 10:30:36 -0700 "Paul E. McKenney"  
> > > wrote:
> > > 
> > > > On Tue, May 05, 2020 at 07:05:53PM +0200, SeongJae Park wrote:
> > > > > On Tue, 5 May 2020 09:37:42 -0700 Eric Dumazet 
> > > > >  wrote:
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > On 5/5/20 9:31 AM, Eric Dumazet wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > On 5/5/20 9:25 AM, Eric Dumazet wrote:
> > > > > > >>
> > > > > > >>
> > > > > > >> On 5/5/20 9:13 AM, SeongJae Park wrote:
> > > > > > >>> On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet 
> > > > > > >>>  wrote:
> > > > > > >>>
> > > > > >  On Tue, May 5, 2020 at 8:47 AM SeongJae Park 
> > > > > >   wrote:
> > > > > > >
> > > > > > > On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet 
> > > > > > >  wrote:
> > > > > > >
> > > > > > >>
> > > > > > >>
> > > > > > >> On 5/5/20 8:07 AM, SeongJae Park wrote:
> > > > > > >>> On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet 
> > > > > > >>>  wrote:
> > > > > > >>>
> > > > > > >>
> > > > > [...]
> > > > > > >>
> > > > > > >> I would ask Paul opinion on this issue, because we have many 
> > > > > > >> objects
> > > > > > >> being freed after RCU grace periods.
> > > > > > >>
> > > > > > >> If RCU subsystem can not keep-up, I guess other workloads will 
> > > > > > >> also suffer.
> > > > > > >>
> > > > > > >> Sure, we can revert patches there and there trying to work 
> > > > > > >> around the issue,
> > > > > > >> but for objects allocated from process context, we should not 
> > > > > > >> have these problems.
> > > > > > >>
> > > > > > > 
> > > > > > > I wonder if simply adjusting rcu_divisor to 6 or 5 would help 
> > > > > > > 
> > > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > > > index 
> > > > > > > d9a49cd6065a20936edbda1b334136ab597cde52..fde833bac0f9f81e8536211b4dad6e7575c1219a
> > > > > > >  100644
> > > > > > > --- a/kernel/rcu/tree.c
> > > > > > > +++ b/kernel/rcu/tree.c
> > > > > > > @@ -427,7 +427,7 @@ module_param(qovld, long, 0444);
> > > > > > >  static ulong jiffies_till_first_fqs = ULONG_MAX;
> > > > > > >  static ulong jiffies_till_next_fqs = ULONG_MAX;
> > > > > > >  static bool rcu_kick_kthreads;
> > > > > > > -static int rcu_divisor = 7;
> > > > > > > +static int rcu_divisor = 6;
> > > > > > >  module_param(rcu_divisor, int, 0644);
> > > > > > >  
> > > > > > >  /* Force an exit from rcu_do_batch() after 3 milliseconds. */
> > > > > > > 
> > > > > > 
> > > > > > To be clear, you can adjust the value without building a new kernel.
> > > > > > 
> > > > > > echo 6 >/sys/module/rcutree/parameters/rcu_divisor
> > > > > 
> > > > > I tried value 6, 5, and 4, but none of those removed the problem.
> > > > 
> > > > Thank you for checking this!
> > > > 
> > > > Was your earlier discussion on long RCU readers speculation, or do you
> > > > have measurements?
> > > 
> > > It was just a guess without any measurement or dedicated investigation.
> > 
> > OK, another thing to check is the duration of the low-memory episode.
> > Does this duration exceed the RCU CPU stall warning time?  (21 seconds
> > in mainline, 60 in many distros, but check rcupdate.rcu_cpu_stall_timeout
> > to be sure.)
> 
> The benchmark takes about 36 seconds for 10,000 repeats of the test.
> 
> The value on the test machine is 60.
> 
> So the duration would not exceeded the warning time and therefore I haven't
> seen the warning message.
> 
> As told in other mail, I will also adjust this value to shorter one.

Sounds good, thank you!

> > Also, any chance of a .config?  Or at least the RCU portions?  I am
> > guessing CONFIG_PREEMPT=n, for example.
> 
> I guess this would be ok.
> 
> # CONFIG_PREEMPT is not set
> 
> #
> # RCU Subsystem
> #
> CONFIG_TREE_RCU=y
> CONFIG_RCU_EXPERT=y
> CONFIG_SRCU=y
> CONFIG_TREE_SRCU=y
> CONFIG_RCU_STALL_COMMON=y
> CONFIG_RCU_NEED_SEGCBLIST=y
> CONFIG_RCU_FANOUT=64
> CONFIG_RCU_FANOUT_LEAF=16
> # CONFIG_RCU_FAST_NO_HZ is not set
> CONFIG_RCU_NOCB_CPU=y
> # end of RCU Subsystem

And thank you again!

Thanx, Paul


Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread David Miller
From: SeongJae Park 
Date: Tue, 5 May 2020 10:10:33 +0200

> From: SeongJae Park 
> 
> The commit 6d7855c54e1e ("sockfs: switch to ->free_inode()") made the
> deallocation of 'socket_alloc' to be done asynchronously using RCU, as
> same to 'sock.wq'.  And the following commit 333f7909a857 ("coallocate
> socket_sq with socket itself") made those to have same life cycle.
> 
> The changes made the code much more simple, but also made 'socket_alloc'
> live longer than before.  For the reason, user programs intensively
> repeating allocations and deallocations of sockets could cause memory
> pressure on recent kernels.
> 
> To avoid the problem, this commit reverts the changes.

Series applied and queued up for -stable, thanks.


Re: Re: Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread Paul E. McKenney
On Tue, May 05, 2020 at 08:40:07PM +0200, SeongJae Park wrote:
> On Tue, 5 May 2020 11:27:20 -0700 "Paul E. McKenney"  
> wrote:
> 
> > On Tue, May 05, 2020 at 07:49:43PM +0200, SeongJae Park wrote:
> > > On Tue, 5 May 2020 10:23:58 -0700 "Paul E. McKenney"  
> > > wrote:
> > > 
> > > > On Tue, May 05, 2020 at 09:25:06AM -0700, Eric Dumazet wrote:
> > > > > 
> > > > > 
> > > > > On 5/5/20 9:13 AM, SeongJae Park wrote:
> > > > > > On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet 
> > > > > >  wrote:
> > > > > > 
> > > > > >> On Tue, May 5, 2020 at 8:47 AM SeongJae Park  
> > > > > >> wrote:
> > > > > >>>
> > > > > >>> On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet 
> > > > > >>>  wrote:
> > > > > >>>
> > > > > 
> > > > > 
> > > > >  On 5/5/20 8:07 AM, SeongJae Park wrote:
> > > > > > On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet 
> > > > > >  wrote:
> > > > > >
> > > > > 
> > > > > >> Why do we have 10,000,000 objects around ? Could this be 
> > > > > >> because of
> > > > > >> some RCU problem ?
> > > > > >
> > > > > > Mainly because of a long RCU grace period, as you guess.  I 
> > > > > > have no idea how
> > > > > > the grace period became so long in this case.
> > > > > >
> > > > > > As my test machine was a virtual machine instance, I guess RCU 
> > > > > > readers
> > > > > > preemption[1] like problem might affected this.
> > > > > >
> > > > > > [1] 
> > > > > > https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
> > > > 
> > > > If this is the root cause of the problem, then it will be necessary to
> > > > provide a hint to the hypervisor.  Or, in the near term, avoid loading
> > > > the hypervisor the point that vCPU preemption is so lengthy.
> > > > 
> > > > RCU could also provide some sort of pre-stall-warning notification that
> > > > some of the CPUs aren't passing through quiescent states, which might
> > > > allow the guest OS's userspace to take corrective action.
> > > > 
> > > > But first, what are you doing to either confirm or invalidate the
> > > > hypothesis that this might be due to vCPU preemption?
> > > 
> > > Nothing, I was just guessing.  Sorry if this made you confused.
> > > 
> > > > 
> > > > > >> Once Al patches reverted, do you have 10,000,000 sock_alloc 
> > > > > >> around ?
> > > > > >
> > > > > > Yes, both the old kernel that prior to Al's patches and the 
> > > > > > recent kernel
> > > > > > reverting the Al's patches didn't reproduce the problem.
> > > > > >
> > > > > 
> > > > >  I repeat my question : Do you have 10,000,000 (smaller) objects 
> > > > >  kept in slab caches ?
> > > > > 
> > > > >  TCP sockets use the (very complex, error prone) 
> > > > >  SLAB_TYPESAFE_BY_RCU, but not the struct socket_wq
> > > > >  object that was allocated in sock_alloc_inode() before Al 
> > > > >  patches.
> > > > > 
> > > > >  These objects should be visible in kmalloc-64 kmem cache.
> > > > > >>>
> > > > > >>> Not exactly the 10,000,000, as it is only the possible highest 
> > > > > >>> number, but I
> > > > > >>> was able to observe clear exponential increase of the number of 
> > > > > >>> the objects
> > > > > >>> using slabtop.  Before the start of the problematic workload, the 
> > > > > >>> number of
> > > > > >>> objects of 'kmalloc-64' was 5760, but I was able to observe the 
> > > > > >>> number increase
> > > > > >>> to 1,136,576.
> > > > > >>>
> > > > > >>>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE 
> > > > > >>> NAME
> > > > > >>> before:   5760   5088  88%0.06K 90   64   360K 
> > > > > >>> kmalloc-64
> > > > > >>> after:  1136576 1136576 100%0.06K  17759   64 71036K 
> > > > > >>> kmalloc-64
> > > > > >>>
> > > > > >>
> > > > > >> Great, thanks.
> > > > > >>
> > > > > >> How recent is the kernel you are running for your experiment ?
> > > > > > 
> > > > > > It's based on 5.4.35.
> > > > 
> > > > Is it possible to retest on v5.6?  I have been adding various mechanisms
> > > > to make RCU keep up better with heavy callback overload.
> > > 
> > > I will try soon!
> > > 
> > > > 
> > > > Also, could you please provide the .config?  If either NO_HZ_FULL or
> > > > RCU_NOCB_CPU, please also provide the kernel boot parameters.
> > > 
> > > NO_HZ_FULL is not set, but RCU_NOCB_CPU is y.
> > 
> > OK, this is important information.
> > 
> > > I think I should check whether it's ok to share the full config and boot
> > > parameters.  Please wait this.
> > 
> > I probably don't need the whole thing.  So, if it makes it easier to
> > gain approval...
> > 
> > The main thing I need are CONFIG_PREEMPT and the various Kconfig options
> > having "RCU" in their names.  For example, I have no need for any of the
> > options pertaining to device drivers.  (As far as I know at the moment,
> > anyway!)
> > 
> > For the boot parameters, I am very interested in rcu_nocbs=.  A

Re: Re: Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread SeongJae Park
On Tue, 5 May 2020 11:27:20 -0700 "Paul E. McKenney"  wrote:

> On Tue, May 05, 2020 at 07:49:43PM +0200, SeongJae Park wrote:
> > On Tue, 5 May 2020 10:23:58 -0700 "Paul E. McKenney"  
> > wrote:
> > 
> > > On Tue, May 05, 2020 at 09:25:06AM -0700, Eric Dumazet wrote:
> > > > 
> > > > 
> > > > On 5/5/20 9:13 AM, SeongJae Park wrote:
> > > > > On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet  
> > > > > wrote:
> > > > > 
> > > > >> On Tue, May 5, 2020 at 8:47 AM SeongJae Park  
> > > > >> wrote:
> > > > >>>
> > > > >>> On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet 
> > > > >>>  wrote:
> > > > >>>
> > > > 
> > > > 
> > > >  On 5/5/20 8:07 AM, SeongJae Park wrote:
> > > > > On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet 
> > > > >  wrote:
> > > > >
> > > > 
> > > > >> Why do we have 10,000,000 objects around ? Could this be because 
> > > > >> of
> > > > >> some RCU problem ?
> > > > >
> > > > > Mainly because of a long RCU grace period, as you guess.  I have 
> > > > > no idea how
> > > > > the grace period became so long in this case.
> > > > >
> > > > > As my test machine was a virtual machine instance, I guess RCU 
> > > > > readers
> > > > > preemption[1] like problem might affected this.
> > > > >
> > > > > [1] 
> > > > > https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
> > > 
> > > If this is the root cause of the problem, then it will be necessary to
> > > provide a hint to the hypervisor.  Or, in the near term, avoid loading
> > > the hypervisor the point that vCPU preemption is so lengthy.
> > > 
> > > RCU could also provide some sort of pre-stall-warning notification that
> > > some of the CPUs aren't passing through quiescent states, which might
> > > allow the guest OS's userspace to take corrective action.
> > > 
> > > But first, what are you doing to either confirm or invalidate the
> > > hypothesis that this might be due to vCPU preemption?
> > 
> > Nothing, I was just guessing.  Sorry if this made you confused.
> > 
> > > 
> > > > >> Once Al patches reverted, do you have 10,000,000 sock_alloc 
> > > > >> around ?
> > > > >
> > > > > Yes, both the old kernel that prior to Al's patches and the 
> > > > > recent kernel
> > > > > reverting the Al's patches didn't reproduce the problem.
> > > > >
> > > > 
> > > >  I repeat my question : Do you have 10,000,000 (smaller) objects 
> > > >  kept in slab caches ?
> > > > 
> > > >  TCP sockets use the (very complex, error prone) 
> > > >  SLAB_TYPESAFE_BY_RCU, but not the struct socket_wq
> > > >  object that was allocated in sock_alloc_inode() before Al patches.
> > > > 
> > > >  These objects should be visible in kmalloc-64 kmem cache.
> > > > >>>
> > > > >>> Not exactly the 10,000,000, as it is only the possible highest 
> > > > >>> number, but I
> > > > >>> was able to observe clear exponential increase of the number of the 
> > > > >>> objects
> > > > >>> using slabtop.  Before the start of the problematic workload, the 
> > > > >>> number of
> > > > >>> objects of 'kmalloc-64' was 5760, but I was able to observe the 
> > > > >>> number increase
> > > > >>> to 1,136,576.
> > > > >>>
> > > > >>>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> > > > >>> before:   5760   5088  88%0.06K 90   64   360K 
> > > > >>> kmalloc-64
> > > > >>> after:  1136576 1136576 100%0.06K  17759   64 71036K 
> > > > >>> kmalloc-64
> > > > >>>
> > > > >>
> > > > >> Great, thanks.
> > > > >>
> > > > >> How recent is the kernel you are running for your experiment ?
> > > > > 
> > > > > It's based on 5.4.35.
> > > 
> > > Is it possible to retest on v5.6?  I have been adding various mechanisms
> > > to make RCU keep up better with heavy callback overload.
> > 
> > I will try soon!
> > 
> > > 
> > > Also, could you please provide the .config?  If either NO_HZ_FULL or
> > > RCU_NOCB_CPU, please also provide the kernel boot parameters.
> > 
> > NO_HZ_FULL is not set, but RCU_NOCB_CPU is y.
> 
> OK, this is important information.
> 
> > I think I should check whether it's ok to share the full config and boot
> > parameters.  Please wait this.
> 
> I probably don't need the whole thing.  So, if it makes it easier to
> gain approval...
> 
> The main thing I need are CONFIG_PREEMPT and the various Kconfig options
> having "RCU" in their names.  For example, I have no need for any of the
> options pertaining to device drivers.  (As far as I know at the moment,
> anyway!)
> 
> For the boot parameters, I am very interested in rcu_nocbs=.  Along with
> any other boot parameters whose names contain "rcu".

I guess this would be ok.

It uses no 'rcu_nocbs=' boot parameter.  

The configs you asked are as below:

# CONFIG_PREEMPT is not set

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
CONFIG_RCU_EXPERT=y
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONF

Re: Re: Re: Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread SeongJae Park
On Tue, 5 May 2020 11:17:07 -0700 "Paul E. McKenney"  wrote:

> On Tue, May 05, 2020 at 07:56:05PM +0200, SeongJae Park wrote:
> > On Tue, 5 May 2020 10:30:36 -0700 "Paul E. McKenney"  
> > wrote:
> > 
> > > On Tue, May 05, 2020 at 07:05:53PM +0200, SeongJae Park wrote:
> > > > On Tue, 5 May 2020 09:37:42 -0700 Eric Dumazet  
> > > > wrote:
> > > > 
> > > > > 
> > > > > 
> > > > > On 5/5/20 9:31 AM, Eric Dumazet wrote:
> > > > > > 
> > > > > > 
> > > > > > On 5/5/20 9:25 AM, Eric Dumazet wrote:
> > > > > >>
> > > > > >>
> > > > > >> On 5/5/20 9:13 AM, SeongJae Park wrote:
> > > > > >>> On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet 
> > > > > >>>  wrote:
> > > > > >>>
> > > > >  On Tue, May 5, 2020 at 8:47 AM SeongJae Park  
> > > > >  wrote:
> > > > > >
> > > > > > On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet 
> > > > > >  wrote:
> > > > > >
> > > > > >>
> > > > > >>
> > > > > >> On 5/5/20 8:07 AM, SeongJae Park wrote:
> > > > > >>> On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet 
> > > > > >>>  wrote:
> > > > > >>>
> > > > > >>
> > > > [...]
> > > > > >>
> > > > > >> I would ask Paul opinion on this issue, because we have many 
> > > > > >> objects
> > > > > >> being freed after RCU grace periods.
> > > > > >>
> > > > > >> If RCU subsystem can not keep-up, I guess other workloads will 
> > > > > >> also suffer.
> > > > > >>
> > > > > >> Sure, we can revert patches there and there trying to work around 
> > > > > >> the issue,
> > > > > >> but for objects allocated from process context, we should not have 
> > > > > >> these problems.
> > > > > >>
> > > > > > 
> > > > > > I wonder if simply adjusting rcu_divisor to 6 or 5 would help 
> > > > > > 
> > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > > index 
> > > > > > d9a49cd6065a20936edbda1b334136ab597cde52..fde833bac0f9f81e8536211b4dad6e7575c1219a
> > > > > >  100644
> > > > > > --- a/kernel/rcu/tree.c
> > > > > > +++ b/kernel/rcu/tree.c
> > > > > > @@ -427,7 +427,7 @@ module_param(qovld, long, 0444);
> > > > > >  static ulong jiffies_till_first_fqs = ULONG_MAX;
> > > > > >  static ulong jiffies_till_next_fqs = ULONG_MAX;
> > > > > >  static bool rcu_kick_kthreads;
> > > > > > -static int rcu_divisor = 7;
> > > > > > +static int rcu_divisor = 6;
> > > > > >  module_param(rcu_divisor, int, 0644);
> > > > > >  
> > > > > >  /* Force an exit from rcu_do_batch() after 3 milliseconds. */
> > > > > > 
> > > > > 
> > > > > To be clear, you can adjust the value without building a new kernel.
> > > > > 
> > > > > echo 6 >/sys/module/rcutree/parameters/rcu_divisor
> > > > 
> > > > I tried value 6, 5, and 4, but none of those removed the problem.
> > > 
> > > Thank you for checking this!
> > > 
> > > Was your earlier discussion on long RCU readers speculation, or do you
> > > have measurements?
> > 
> > It was just a guess without any measurement or dedicated investigation.
> 
> OK, another thing to check is the duration of the low-memory episode.
> Does this duration exceed the RCU CPU stall warning time?  (21 seconds
> in mainline, 60 in many distros, but check rcupdate.rcu_cpu_stall_timeout
> to be sure.)

The benchmark takes about 36 seconds for 10,000 repeats of the test.

The value on the test machine is 60.

So the duration would not exceeded the warning time and therefore I haven't
seen the warning message.

As told in other mail, I will also adjust this value to shorter one.

> 
> Also, any chance of a .config?  Or at least the RCU portions?  I am
> guessing CONFIG_PREEMPT=n, for example.

I guess this would be ok.

# CONFIG_PREEMPT is not set

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
CONFIG_RCU_EXPERT=y
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
CONFIG_RCU_FANOUT=64
CONFIG_RCU_FANOUT_LEAF=16
# CONFIG_RCU_FAST_NO_HZ is not set
CONFIG_RCU_NOCB_CPU=y
# end of RCU Subsystem


Thanks,
SeongJae Park


Re: Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread Paul E. McKenney
On Tue, May 05, 2020 at 07:49:43PM +0200, SeongJae Park wrote:
> On Tue, 5 May 2020 10:23:58 -0700 "Paul E. McKenney"  
> wrote:
> 
> > On Tue, May 05, 2020 at 09:25:06AM -0700, Eric Dumazet wrote:
> > > 
> > > 
> > > On 5/5/20 9:13 AM, SeongJae Park wrote:
> > > > On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet  
> > > > wrote:
> > > > 
> > > >> On Tue, May 5, 2020 at 8:47 AM SeongJae Park  wrote:
> > > >>>
> > > >>> On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet 
> > > >>>  wrote:
> > > >>>
> > > 
> > > 
> > >  On 5/5/20 8:07 AM, SeongJae Park wrote:
> > > > On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet 
> > > >  wrote:
> > > >
> > > 
> > > >> Why do we have 10,000,000 objects around ? Could this be because of
> > > >> some RCU problem ?
> > > >
> > > > Mainly because of a long RCU grace period, as you guess.  I have no 
> > > > idea how
> > > > the grace period became so long in this case.
> > > >
> > > > As my test machine was a virtual machine instance, I guess RCU 
> > > > readers
> > > > preemption[1] like problem might affected this.
> > > >
> > > > [1] 
> > > > https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
> > 
> > If this is the root cause of the problem, then it will be necessary to
> > provide a hint to the hypervisor.  Or, in the near term, avoid loading
> > the hypervisor the point that vCPU preemption is so lengthy.
> > 
> > RCU could also provide some sort of pre-stall-warning notification that
> > some of the CPUs aren't passing through quiescent states, which might
> > allow the guest OS's userspace to take corrective action.
> > 
> > But first, what are you doing to either confirm or invalidate the
> > hypothesis that this might be due to vCPU preemption?
> 
> Nothing, I was just guessing.  Sorry if this made you confused.
> 
> > 
> > > >> Once Al patches reverted, do you have 10,000,000 sock_alloc around 
> > > >> ?
> > > >
> > > > Yes, both the old kernel that prior to Al's patches and the recent 
> > > > kernel
> > > > reverting the Al's patches didn't reproduce the problem.
> > > >
> > > 
> > >  I repeat my question : Do you have 10,000,000 (smaller) objects kept 
> > >  in slab caches ?
> > > 
> > >  TCP sockets use the (very complex, error prone) 
> > >  SLAB_TYPESAFE_BY_RCU, but not the struct socket_wq
> > >  object that was allocated in sock_alloc_inode() before Al patches.
> > > 
> > >  These objects should be visible in kmalloc-64 kmem cache.
> > > >>>
> > > >>> Not exactly the 10,000,000, as it is only the possible highest 
> > > >>> number, but I
> > > >>> was able to observe clear exponential increase of the number of the 
> > > >>> objects
> > > >>> using slabtop.  Before the start of the problematic workload, the 
> > > >>> number of
> > > >>> objects of 'kmalloc-64' was 5760, but I was able to observe the 
> > > >>> number increase
> > > >>> to 1,136,576.
> > > >>>
> > > >>>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> > > >>> before:   5760   5088  88%0.06K 90   64   360K 
> > > >>> kmalloc-64
> > > >>> after:  1136576 1136576 100%0.06K  17759   64 71036K 
> > > >>> kmalloc-64
> > > >>>
> > > >>
> > > >> Great, thanks.
> > > >>
> > > >> How recent is the kernel you are running for your experiment ?
> > > > 
> > > > It's based on 5.4.35.
> > 
> > Is it possible to retest on v5.6?  I have been adding various mechanisms
> > to make RCU keep up better with heavy callback overload.
> 
> I will try soon!
> 
> > 
> > Also, could you please provide the .config?  If either NO_HZ_FULL or
> > RCU_NOCB_CPU, please also provide the kernel boot parameters.
> 
> NO_HZ_FULL is not set, but RCU_NOCB_CPU is y.

OK, this is important information.

> I think I should check whether it's ok to share the full config and boot
> parameters.  Please wait this.

I probably don't need the whole thing.  So, if it makes it easier to
gain approval...

The main thing I need are CONFIG_PREEMPT and the various Kconfig options
having "RCU" in their names.  For example, I have no need for any of the
options pertaining to device drivers.  (As far as I know at the moment,
anyway!)

For the boot parameters, I am very interested in rcu_nocbs=.  Along with
any other boot parameters whose names contain "rcu".

If rcu_nocbs does designate have any CPUs listed, another thing to check
is where the rcuo kthreads are permitted to run.  The reason that this
is important is that any CPU listed in the rcu_nocbs= boot parameter
has its RCU callbacks invoked by one of the rcuo kthreads.  If you have
booted with (say) "rcu_nocbs=1,63" and then bound all of the resulting
rcuo kthreads to CPU 0, you just tied RCU's hands, making it unable to
keep up with any reasonable RCU callback load.

This sort of configuration is permitted, but it is intended for tightly
controlled real-time or HPC systems whose config

Re: Re: Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread Paul E. McKenney
On Tue, May 05, 2020 at 07:56:05PM +0200, SeongJae Park wrote:
> On Tue, 5 May 2020 10:30:36 -0700 "Paul E. McKenney"  
> wrote:
> 
> > On Tue, May 05, 2020 at 07:05:53PM +0200, SeongJae Park wrote:
> > > On Tue, 5 May 2020 09:37:42 -0700 Eric Dumazet  
> > > wrote:
> > > 
> > > > 
> > > > 
> > > > On 5/5/20 9:31 AM, Eric Dumazet wrote:
> > > > > 
> > > > > 
> > > > > On 5/5/20 9:25 AM, Eric Dumazet wrote:
> > > > >>
> > > > >>
> > > > >> On 5/5/20 9:13 AM, SeongJae Park wrote:
> > > > >>> On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet 
> > > > >>>  wrote:
> > > > >>>
> > > >  On Tue, May 5, 2020 at 8:47 AM SeongJae Park  
> > > >  wrote:
> > > > >
> > > > > On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet 
> > > > >  wrote:
> > > > >
> > > > >>
> > > > >>
> > > > >> On 5/5/20 8:07 AM, SeongJae Park wrote:
> > > > >>> On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet 
> > > > >>>  wrote:
> > > > >>>
> > > > >>
> > > [...]
> > > > >>
> > > > >> I would ask Paul opinion on this issue, because we have many objects
> > > > >> being freed after RCU grace periods.
> > > > >>
> > > > >> If RCU subsystem can not keep-up, I guess other workloads will also 
> > > > >> suffer.
> > > > >>
> > > > >> Sure, we can revert patches there and there trying to work around 
> > > > >> the issue,
> > > > >> but for objects allocated from process context, we should not have 
> > > > >> these problems.
> > > > >>
> > > > > 
> > > > > I wonder if simply adjusting rcu_divisor to 6 or 5 would help 
> > > > > 
> > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > index 
> > > > > d9a49cd6065a20936edbda1b334136ab597cde52..fde833bac0f9f81e8536211b4dad6e7575c1219a
> > > > >  100644
> > > > > --- a/kernel/rcu/tree.c
> > > > > +++ b/kernel/rcu/tree.c
> > > > > @@ -427,7 +427,7 @@ module_param(qovld, long, 0444);
> > > > >  static ulong jiffies_till_first_fqs = ULONG_MAX;
> > > > >  static ulong jiffies_till_next_fqs = ULONG_MAX;
> > > > >  static bool rcu_kick_kthreads;
> > > > > -static int rcu_divisor = 7;
> > > > > +static int rcu_divisor = 6;
> > > > >  module_param(rcu_divisor, int, 0644);
> > > > >  
> > > > >  /* Force an exit from rcu_do_batch() after 3 milliseconds. */
> > > > > 
> > > > 
> > > > To be clear, you can adjust the value without building a new kernel.
> > > > 
> > > > echo 6 >/sys/module/rcutree/parameters/rcu_divisor
> > > 
> > > I tried value 6, 5, and 4, but none of those removed the problem.
> > 
> > Thank you for checking this!
> > 
> > Was your earlier discussion on long RCU readers speculation, or do you
> > have measurements?
> 
> It was just a guess without any measurement or dedicated investigation.

OK, another thing to check is the duration of the low-memory episode.
Does this duration exceed the RCU CPU stall warning time?  (21 seconds
in mainline, 60 in many distros, but check rcupdate.rcu_cpu_stall_timeout
to be sure.)

Also, any chance of a .config?  Or at least the RCU portions?  I am
guessing CONFIG_PREEMPT=n, for example.

Thanx, Paul


Re: Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread SeongJae Park
On Tue, 5 May 2020 10:28:50 -0700 "Paul E. McKenney"  wrote:

> On Tue, May 05, 2020 at 09:37:42AM -0700, Eric Dumazet wrote:
> > 
> > 
> > On 5/5/20 9:31 AM, Eric Dumazet wrote:
> > > 
> > > 
> > > On 5/5/20 9:25 AM, Eric Dumazet wrote:
> > >>
> > >>
> > >> On 5/5/20 9:13 AM, SeongJae Park wrote:
> > >>> On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet  
> > >>> wrote:
> > >>>
> >  On Tue, May 5, 2020 at 8:47 AM SeongJae Park  wrote:
> > >
> > > On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet 
> > >  wrote:
> > >
> > >>
> > >>
> > >> On 5/5/20 8:07 AM, SeongJae Park wrote:
> > >>> On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet 
> > >>>  wrote:
> > >>>
> > >>
> >  Why do we have 10,000,000 objects around ? Could this be because of
> >  some RCU problem ?
> > >>>
> > >>> Mainly because of a long RCU grace period, as you guess.  I have no 
> > >>> idea how
> > >>> the grace period became so long in this case.
> > >>>
> > >>> As my test machine was a virtual machine instance, I guess RCU 
> > >>> readers
> > >>> preemption[1] like problem might affected this.
> > >>>
> > >>> [1] 
> > >>> https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
> > >>>
> > 
> >  Once Al patches reverted, do you have 10,000,000 sock_alloc around 
> >  ?
> > >>>
> > >>> Yes, both the old kernel that prior to Al's patches and the recent 
> > >>> kernel
> > >>> reverting the Al's patches didn't reproduce the problem.
> > >>>
> > >>
> > >> I repeat my question : Do you have 10,000,000 (smaller) objects kept 
> > >> in slab caches ?
> > >>
> > >> TCP sockets use the (very complex, error prone) 
> > >> SLAB_TYPESAFE_BY_RCU, but not the struct socket_wq
> > >> object that was allocated in sock_alloc_inode() before Al patches.
> > >>
> > >> These objects should be visible in kmalloc-64 kmem cache.
> > >
> > > Not exactly the 10,000,000, as it is only the possible highest 
> > > number, but I
> > > was able to observe clear exponential increase of the number of the 
> > > objects
> > > using slabtop.  Before the start of the problematic workload, the 
> > > number of
> > > objects of 'kmalloc-64' was 5760, but I was able to observe the 
> > > number increase
> > > to 1,136,576.
> > >
> > >   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> > > before:   5760   5088  88%0.06K 90   64   360K 
> > > kmalloc-64
> > > after:  1136576 1136576 100%0.06K  17759   64 71036K 
> > > kmalloc-64
> > >
> > 
> >  Great, thanks.
> > 
> >  How recent is the kernel you are running for your experiment ?
> > >>>
> > >>> It's based on 5.4.35.
> > >>>
> > 
> >  Let's make sure the bug is not in RCU.
> > >>>
> > >>> One thing I can currently say is that the grace period passes at last.  
> > >>> I
> > >>> modified the benchmark to repeat not 10,000 times but only 5,000 times 
> > >>> to run
> > >>> the test without OOM but easily observable memory pressure.  As soon as 
> > >>> the
> > >>> benchmark finishes, the memory were freed.
> > >>>
> > >>> If you need more tests, please let me know.
> > >>>
> > >>
> > >> I would ask Paul opinion on this issue, because we have many objects
> > >> being freed after RCU grace periods.
> > >>
> > >> If RCU subsystem can not keep-up, I guess other workloads will also 
> > >> suffer.
> > >>
> > >> Sure, we can revert patches there and there trying to work around the 
> > >> issue,
> > >> but for objects allocated from process context, we should not have these 
> > >> problems.
> > >>
> > > 
> > > I wonder if simply adjusting rcu_divisor to 6 or 5 would help 
> > > 
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 
> > > d9a49cd6065a20936edbda1b334136ab597cde52..fde833bac0f9f81e8536211b4dad6e7575c1219a
> > >  100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -427,7 +427,7 @@ module_param(qovld, long, 0444);
> > >  static ulong jiffies_till_first_fqs = ULONG_MAX;
> > >  static ulong jiffies_till_next_fqs = ULONG_MAX;
> > >  static bool rcu_kick_kthreads;
> > > -static int rcu_divisor = 7;
> > > +static int rcu_divisor = 6;
> > >  module_param(rcu_divisor, int, 0644);
> > >  
> > >  /* Force an exit from rcu_do_batch() after 3 milliseconds. */
> > > 
> > 
> > To be clear, you can adjust the value without building a new kernel.
> > 
> > echo 6 >/sys/module/rcutree/parameters/rcu_divisor
> 
> Worth a try!  If that helps significantly, I have some ideas for updating
> that heuristic, such as checking for sudden increases in the number of
> pending callbacks.
> 
> But I would really also like to know whether there are long readers and
> whether v5.6 fares better.

I will share the results as soon as possible :)


Thanks,
SeongJae Park

> 
>   

Re: Re: Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread SeongJae Park
On Tue, 5 May 2020 10:30:36 -0700 "Paul E. McKenney"  wrote:

> On Tue, May 05, 2020 at 07:05:53PM +0200, SeongJae Park wrote:
> > On Tue, 5 May 2020 09:37:42 -0700 Eric Dumazet  
> > wrote:
> > 
> > > 
> > > 
> > > On 5/5/20 9:31 AM, Eric Dumazet wrote:
> > > > 
> > > > 
> > > > On 5/5/20 9:25 AM, Eric Dumazet wrote:
> > > >>
> > > >>
> > > >> On 5/5/20 9:13 AM, SeongJae Park wrote:
> > > >>> On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet  
> > > >>> wrote:
> > > >>>
> > >  On Tue, May 5, 2020 at 8:47 AM SeongJae Park  
> > >  wrote:
> > > >
> > > > On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet 
> > > >  wrote:
> > > >
> > > >>
> > > >>
> > > >> On 5/5/20 8:07 AM, SeongJae Park wrote:
> > > >>> On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet 
> > > >>>  wrote:
> > > >>>
> > > >>
> > [...]
> > > >>
> > > >> I would ask Paul opinion on this issue, because we have many objects
> > > >> being freed after RCU grace periods.
> > > >>
> > > >> If RCU subsystem can not keep-up, I guess other workloads will also 
> > > >> suffer.
> > > >>
> > > >> Sure, we can revert patches there and there trying to work around the 
> > > >> issue,
> > > >> but for objects allocated from process context, we should not have 
> > > >> these problems.
> > > >>
> > > > 
> > > > I wonder if simply adjusting rcu_divisor to 6 or 5 would help 
> > > > 
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index 
> > > > d9a49cd6065a20936edbda1b334136ab597cde52..fde833bac0f9f81e8536211b4dad6e7575c1219a
> > > >  100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -427,7 +427,7 @@ module_param(qovld, long, 0444);
> > > >  static ulong jiffies_till_first_fqs = ULONG_MAX;
> > > >  static ulong jiffies_till_next_fqs = ULONG_MAX;
> > > >  static bool rcu_kick_kthreads;
> > > > -static int rcu_divisor = 7;
> > > > +static int rcu_divisor = 6;
> > > >  module_param(rcu_divisor, int, 0644);
> > > >  
> > > >  /* Force an exit from rcu_do_batch() after 3 milliseconds. */
> > > > 
> > > 
> > > To be clear, you can adjust the value without building a new kernel.
> > > 
> > > echo 6 >/sys/module/rcutree/parameters/rcu_divisor
> > 
> > I tried value 6, 5, and 4, but none of those removed the problem.
> 
> Thank you for checking this!
> 
> Was your earlier discussion on long RCU readers speculation, or do you
> have measurements?

It was just a guess without any measurement or dedicated investigation.


Thanks,
SeongJae Park

> 
>   Thanx, Paul


Re: Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread SeongJae Park
On Tue, 5 May 2020 10:23:58 -0700 "Paul E. McKenney"  wrote:

> On Tue, May 05, 2020 at 09:25:06AM -0700, Eric Dumazet wrote:
> > 
> > 
> > On 5/5/20 9:13 AM, SeongJae Park wrote:
> > > On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet  
> > > wrote:
> > > 
> > >> On Tue, May 5, 2020 at 8:47 AM SeongJae Park  wrote:
> > >>>
> > >>> On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet  
> > >>> wrote:
> > >>>
> > 
> > 
> >  On 5/5/20 8:07 AM, SeongJae Park wrote:
> > > On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet  
> > > wrote:
> > >
> > 
> > >> Why do we have 10,000,000 objects around ? Could this be because of
> > >> some RCU problem ?
> > >
> > > Mainly because of a long RCU grace period, as you guess.  I have no 
> > > idea how
> > > the grace period became so long in this case.
> > >
> > > As my test machine was a virtual machine instance, I guess RCU readers
> > > preemption[1] like problem might affected this.
> > >
> > > [1] 
> > > https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
> 
> If this is the root cause of the problem, then it will be necessary to
> provide a hint to the hypervisor.  Or, in the near term, avoid loading
> the hypervisor the point that vCPU preemption is so lengthy.
> 
> RCU could also provide some sort of pre-stall-warning notification that
> some of the CPUs aren't passing through quiescent states, which might
> allow the guest OS's userspace to take corrective action.
> 
> But first, what are you doing to either confirm or invalidate the
> hypothesis that this might be due to vCPU preemption?

Nothing, I was just guessing.  Sorry if this made you confused.

> 
> > >> Once Al patches reverted, do you have 10,000,000 sock_alloc around ?
> > >
> > > Yes, both the old kernel that prior to Al's patches and the recent 
> > > kernel
> > > reverting the Al's patches didn't reproduce the problem.
> > >
> > 
> >  I repeat my question : Do you have 10,000,000 (smaller) objects kept 
> >  in slab caches ?
> > 
> >  TCP sockets use the (very complex, error prone) SLAB_TYPESAFE_BY_RCU, 
> >  but not the struct socket_wq
> >  object that was allocated in sock_alloc_inode() before Al patches.
> > 
> >  These objects should be visible in kmalloc-64 kmem cache.
> > >>>
> > >>> Not exactly the 10,000,000, as it is only the possible highest number, 
> > >>> but I
> > >>> was able to observe clear exponential increase of the number of the 
> > >>> objects
> > >>> using slabtop.  Before the start of the problematic workload, the 
> > >>> number of
> > >>> objects of 'kmalloc-64' was 5760, but I was able to observe the number 
> > >>> increase
> > >>> to 1,136,576.
> > >>>
> > >>>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> > >>> before:   5760   5088  88%0.06K 90   64   360K 
> > >>> kmalloc-64
> > >>> after:  1136576 1136576 100%0.06K  17759   64 71036K 
> > >>> kmalloc-64
> > >>>
> > >>
> > >> Great, thanks.
> > >>
> > >> How recent is the kernel you are running for your experiment ?
> > > 
> > > It's based on 5.4.35.
> 
> Is it possible to retest on v5.6?  I have been adding various mechanisms
> to make RCU keep up better with heavy callback overload.

I will try soon!

> 
> Also, could you please provide the .config?  If either NO_HZ_FULL or
> RCU_NOCB_CPU, please also provide the kernel boot parameters.

NO_HZ_FULL is not set, but RCU_NOCB_CPU is y.

I think I should check whether it's ok to share the full config and boot
parameters.  Please wait this.

> 
> > >> Let's make sure the bug is not in RCU.
> > > 
> > > One thing I can currently say is that the grace period passes at last.  I
> > > modified the benchmark to repeat not 10,000 times but only 5,000 times to 
> > > run
> > > the test without OOM but easily observable memory pressure.  As soon as 
> > > the
> > > benchmark finishes, the memory were freed.
> > > 
> > > If you need more tests, please let me know.
> > 
> > I would ask Paul opinion on this issue, because we have many objects
> > being freed after RCU grace periods.
> 
> As always, "It depends."
> 
> o If the problem is a too-long RCU reader, RCU is prohibited from
>   ending the grace period.  The reader duration must be shortened,
>   and until it is shortened, there is nothing RCU can do.
> 
> o In some special cases of the above, RCU can and does help, for
>   example, by enlisting the aid of cond_resched().  So perhaps
>   there is a long in-kernel loop that needs a cond_resched().
> 
>   And perhaps RCU can help for some types of vCPU preemption.
> 
> o As Al suggested offline and as has been discussed in the past,
>   it would not be hard to cause RCU to burn CPU to attain faster
>   grace periods during OOM events.  This could be helpful, but only
>   given that RCU readers are completing in reasonable timeframes.

Totally

Re: Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread Paul E. McKenney
On Tue, May 05, 2020 at 07:05:53PM +0200, SeongJae Park wrote:
> On Tue, 5 May 2020 09:37:42 -0700 Eric Dumazet  wrote:
> 
> > 
> > 
> > On 5/5/20 9:31 AM, Eric Dumazet wrote:
> > > 
> > > 
> > > On 5/5/20 9:25 AM, Eric Dumazet wrote:
> > >>
> > >>
> > >> On 5/5/20 9:13 AM, SeongJae Park wrote:
> > >>> On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet  
> > >>> wrote:
> > >>>
> >  On Tue, May 5, 2020 at 8:47 AM SeongJae Park  wrote:
> > >
> > > On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet 
> > >  wrote:
> > >
> > >>
> > >>
> > >> On 5/5/20 8:07 AM, SeongJae Park wrote:
> > >>> On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet 
> > >>>  wrote:
> > >>>
> > >>
> [...]
> > >>
> > >> I would ask Paul opinion on this issue, because we have many objects
> > >> being freed after RCU grace periods.
> > >>
> > >> If RCU subsystem can not keep-up, I guess other workloads will also 
> > >> suffer.
> > >>
> > >> Sure, we can revert patches there and there trying to work around the 
> > >> issue,
> > >> but for objects allocated from process context, we should not have these 
> > >> problems.
> > >>
> > > 
> > > I wonder if simply adjusting rcu_divisor to 6 or 5 would help 
> > > 
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 
> > > d9a49cd6065a20936edbda1b334136ab597cde52..fde833bac0f9f81e8536211b4dad6e7575c1219a
> > >  100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -427,7 +427,7 @@ module_param(qovld, long, 0444);
> > >  static ulong jiffies_till_first_fqs = ULONG_MAX;
> > >  static ulong jiffies_till_next_fqs = ULONG_MAX;
> > >  static bool rcu_kick_kthreads;
> > > -static int rcu_divisor = 7;
> > > +static int rcu_divisor = 6;
> > >  module_param(rcu_divisor, int, 0644);
> > >  
> > >  /* Force an exit from rcu_do_batch() after 3 milliseconds. */
> > > 
> > 
> > To be clear, you can adjust the value without building a new kernel.
> > 
> > echo 6 >/sys/module/rcutree/parameters/rcu_divisor
> 
> I tried value 6, 5, and 4, but none of those removed the problem.

Thank you for checking this!

Was your earlier discussion on long RCU readers speculation, or do you
have measurements?

Thanx, Paul


Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread Paul E. McKenney
On Tue, May 05, 2020 at 09:37:42AM -0700, Eric Dumazet wrote:
> 
> 
> On 5/5/20 9:31 AM, Eric Dumazet wrote:
> > 
> > 
> > On 5/5/20 9:25 AM, Eric Dumazet wrote:
> >>
> >>
> >> On 5/5/20 9:13 AM, SeongJae Park wrote:
> >>> On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet  
> >>> wrote:
> >>>
>  On Tue, May 5, 2020 at 8:47 AM SeongJae Park  wrote:
> >
> > On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet  
> > wrote:
> >
> >>
> >>
> >> On 5/5/20 8:07 AM, SeongJae Park wrote:
> >>> On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet  
> >>> wrote:
> >>>
> >>
>  Why do we have 10,000,000 objects around ? Could this be because of
>  some RCU problem ?
> >>>
> >>> Mainly because of a long RCU grace period, as you guess.  I have no 
> >>> idea how
> >>> the grace period became so long in this case.
> >>>
> >>> As my test machine was a virtual machine instance, I guess RCU readers
> >>> preemption[1] like problem might affected this.
> >>>
> >>> [1] 
> >>> https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
> >>>
> 
>  Once Al patches reverted, do you have 10,000,000 sock_alloc around ?
> >>>
> >>> Yes, both the old kernel that prior to Al's patches and the recent 
> >>> kernel
> >>> reverting the Al's patches didn't reproduce the problem.
> >>>
> >>
> >> I repeat my question : Do you have 10,000,000 (smaller) objects kept 
> >> in slab caches ?
> >>
> >> TCP sockets use the (very complex, error prone) SLAB_TYPESAFE_BY_RCU, 
> >> but not the struct socket_wq
> >> object that was allocated in sock_alloc_inode() before Al patches.
> >>
> >> These objects should be visible in kmalloc-64 kmem cache.
> >
> > Not exactly the 10,000,000, as it is only the possible highest number, 
> > but I
> > was able to observe clear exponential increase of the number of the 
> > objects
> > using slabtop.  Before the start of the problematic workload, the 
> > number of
> > objects of 'kmalloc-64' was 5760, but I was able to observe the number 
> > increase
> > to 1,136,576.
> >
> >   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> > before:   5760   5088  88%0.06K 90   64   360K 
> > kmalloc-64
> > after:  1136576 1136576 100%0.06K  17759   64 71036K 
> > kmalloc-64
> >
> 
>  Great, thanks.
> 
>  How recent is the kernel you are running for your experiment ?
> >>>
> >>> It's based on 5.4.35.
> >>>
> 
>  Let's make sure the bug is not in RCU.
> >>>
> >>> One thing I can currently say is that the grace period passes at last.  I
> >>> modified the benchmark to repeat not 10,000 times but only 5,000 times to 
> >>> run
> >>> the test without OOM but easily observable memory pressure.  As soon as 
> >>> the
> >>> benchmark finishes, the memory were freed.
> >>>
> >>> If you need more tests, please let me know.
> >>>
> >>
> >> I would ask Paul opinion on this issue, because we have many objects
> >> being freed after RCU grace periods.
> >>
> >> If RCU subsystem can not keep-up, I guess other workloads will also suffer.
> >>
> >> Sure, we can revert patches there and there trying to work around the 
> >> issue,
> >> but for objects allocated from process context, we should not have these 
> >> problems.
> >>
> > 
> > I wonder if simply adjusting rcu_divisor to 6 or 5 would help 
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 
> > d9a49cd6065a20936edbda1b334136ab597cde52..fde833bac0f9f81e8536211b4dad6e7575c1219a
> >  100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -427,7 +427,7 @@ module_param(qovld, long, 0444);
> >  static ulong jiffies_till_first_fqs = ULONG_MAX;
> >  static ulong jiffies_till_next_fqs = ULONG_MAX;
> >  static bool rcu_kick_kthreads;
> > -static int rcu_divisor = 7;
> > +static int rcu_divisor = 6;
> >  module_param(rcu_divisor, int, 0644);
> >  
> >  /* Force an exit from rcu_do_batch() after 3 milliseconds. */
> > 
> 
> To be clear, you can adjust the value without building a new kernel.
> 
> echo 6 >/sys/module/rcutree/parameters/rcu_divisor

Worth a try!  If that helps significantly, I have some ideas for updating
that heuristic, such as checking for sudden increases in the number of
pending callbacks.

But I would really also like to know whether there are long readers and
whether v5.6 fares better.

Thanx, Paul


Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread Paul E. McKenney
On Tue, May 05, 2020 at 09:25:06AM -0700, Eric Dumazet wrote:
> 
> 
> On 5/5/20 9:13 AM, SeongJae Park wrote:
> > On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet  wrote:
> > 
> >> On Tue, May 5, 2020 at 8:47 AM SeongJae Park  wrote:
> >>>
> >>> On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet  
> >>> wrote:
> >>>
> 
> 
>  On 5/5/20 8:07 AM, SeongJae Park wrote:
> > On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet  
> > wrote:
> >
> 
> >> Why do we have 10,000,000 objects around ? Could this be because of
> >> some RCU problem ?
> >
> > Mainly because of a long RCU grace period, as you guess.  I have no 
> > idea how
> > the grace period became so long in this case.
> >
> > As my test machine was a virtual machine instance, I guess RCU readers
> > preemption[1] like problem might affected this.
> >
> > [1] 
> > https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf

If this is the root cause of the problem, then it will be necessary to
provide a hint to the hypervisor.  Or, in the near term, avoid loading
the hypervisor the point that vCPU preemption is so lengthy.

RCU could also provide some sort of pre-stall-warning notification that
some of the CPUs aren't passing through quiescent states, which might
allow the guest OS's userspace to take corrective action.

But first, what are you doing to either confirm or invalidate the
hypothesis that this might be due to vCPU preemption?

> >> Once Al patches reverted, do you have 10,000,000 sock_alloc around ?
> >
> > Yes, both the old kernel that prior to Al's patches and the recent 
> > kernel
> > reverting the Al's patches didn't reproduce the problem.
> >
> 
>  I repeat my question : Do you have 10,000,000 (smaller) objects kept in 
>  slab caches ?
> 
>  TCP sockets use the (very complex, error prone) SLAB_TYPESAFE_BY_RCU, 
>  but not the struct socket_wq
>  object that was allocated in sock_alloc_inode() before Al patches.
> 
>  These objects should be visible in kmalloc-64 kmem cache.
> >>>
> >>> Not exactly the 10,000,000, as it is only the possible highest number, 
> >>> but I
> >>> was able to observe clear exponential increase of the number of the 
> >>> objects
> >>> using slabtop.  Before the start of the problematic workload, the number 
> >>> of
> >>> objects of 'kmalloc-64' was 5760, but I was able to observe the number 
> >>> increase
> >>> to 1,136,576.
> >>>
> >>>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> >>> before:   5760   5088  88%0.06K 90   64   360K kmalloc-64
> >>> after:  1136576 1136576 100%0.06K  17759   64 71036K 
> >>> kmalloc-64
> >>>
> >>
> >> Great, thanks.
> >>
> >> How recent is the kernel you are running for your experiment ?
> > 
> > It's based on 5.4.35.

Is it possible to retest on v5.6?  I have been adding various mechanisms
to make RCU keep up better with heavy callback overload.

Also, could you please provide the .config?  If either NO_HZ_FULL or
RCU_NOCB_CPU, please also provide the kernel boot parameters.

> >> Let's make sure the bug is not in RCU.
> > 
> > One thing I can currently say is that the grace period passes at last.  I
> > modified the benchmark to repeat not 10,000 times but only 5,000 times to 
> > run
> > the test without OOM but easily observable memory pressure.  As soon as the
> > benchmark finishes, the memory were freed.
> > 
> > If you need more tests, please let me know.
> 
> I would ask Paul opinion on this issue, because we have many objects
> being freed after RCU grace periods.

As always, "It depends."

o   If the problem is a too-long RCU reader, RCU is prohibited from
ending the grace period.  The reader duration must be shortened,
and until it is shortened, there is nothing RCU can do.

o   In some special cases of the above, RCU can and does help, for
example, by enlisting the aid of cond_resched().  So perhaps
there is a long in-kernel loop that needs a cond_resched().

And perhaps RCU can help for some types of vCPU preemption.

o   As Al suggested offline and as has been discussed in the past,
it would not be hard to cause RCU to burn CPU to attain faster
grace periods during OOM events.  This could be helpful, but only
given that RCU readers are completing in reasonable timeframes.

> If RCU subsystem can not keep-up, I guess other workloads will also suffer.

If readers are not excessively long, RCU should be able to keep up.
(In the absence of misconfigurations, for example, both NO_HZ_FULL and
then binding all the rcuo kthreads to a single CPU on a 100-CPU system
or some such.)

> Sure, we can revert patches there and there trying to work around the issue,
> but for objects allocated from process context, we should not have these 
> problems.

Agreed, let's get more info on what is happening to RCU.

One app

Re: Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread SeongJae Park
On Tue, 5 May 2020 09:37:42 -0700 Eric Dumazet  wrote:

> 
> 
> On 5/5/20 9:31 AM, Eric Dumazet wrote:
> > 
> > 
> > On 5/5/20 9:25 AM, Eric Dumazet wrote:
> >>
> >>
> >> On 5/5/20 9:13 AM, SeongJae Park wrote:
> >>> On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet  
> >>> wrote:
> >>>
>  On Tue, May 5, 2020 at 8:47 AM SeongJae Park  wrote:
> >
> > On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet  
> > wrote:
> >
> >>
> >>
> >> On 5/5/20 8:07 AM, SeongJae Park wrote:
> >>> On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet  
> >>> wrote:
> >>>
> >>
[...]
> >>
> >> I would ask Paul opinion on this issue, because we have many objects
> >> being freed after RCU grace periods.
> >>
> >> If RCU subsystem can not keep-up, I guess other workloads will also suffer.
> >>
> >> Sure, we can revert patches there and there trying to work around the 
> >> issue,
> >> but for objects allocated from process context, we should not have these 
> >> problems.
> >>
> > 
> > I wonder if simply adjusting rcu_divisor to 6 or 5 would help 
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 
> > d9a49cd6065a20936edbda1b334136ab597cde52..fde833bac0f9f81e8536211b4dad6e7575c1219a
> >  100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -427,7 +427,7 @@ module_param(qovld, long, 0444);
> >  static ulong jiffies_till_first_fqs = ULONG_MAX;
> >  static ulong jiffies_till_next_fqs = ULONG_MAX;
> >  static bool rcu_kick_kthreads;
> > -static int rcu_divisor = 7;
> > +static int rcu_divisor = 6;
> >  module_param(rcu_divisor, int, 0644);
> >  
> >  /* Force an exit from rcu_do_batch() after 3 milliseconds. */
> > 
> 
> To be clear, you can adjust the value without building a new kernel.
> 
> echo 6 >/sys/module/rcutree/parameters/rcu_divisor
> 

I tried value 6, 5, and 4, but none of those removed the problem.


Thanks,
SeongJae Park


Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread Eric Dumazet



On 5/5/20 9:31 AM, Eric Dumazet wrote:
> 
> 
> On 5/5/20 9:25 AM, Eric Dumazet wrote:
>>
>>
>> On 5/5/20 9:13 AM, SeongJae Park wrote:
>>> On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet  wrote:
>>>
 On Tue, May 5, 2020 at 8:47 AM SeongJae Park  wrote:
>
> On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet  
> wrote:
>
>>
>>
>> On 5/5/20 8:07 AM, SeongJae Park wrote:
>>> On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet  
>>> wrote:
>>>
>>
 Why do we have 10,000,000 objects around ? Could this be because of
 some RCU problem ?
>>>
>>> Mainly because of a long RCU grace period, as you guess.  I have no 
>>> idea how
>>> the grace period became so long in this case.
>>>
>>> As my test machine was a virtual machine instance, I guess RCU readers
>>> preemption[1] like problem might affected this.
>>>
>>> [1] 
>>> https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
>>>

 Once Al patches reverted, do you have 10,000,000 sock_alloc around ?
>>>
>>> Yes, both the old kernel that prior to Al's patches and the recent 
>>> kernel
>>> reverting the Al's patches didn't reproduce the problem.
>>>
>>
>> I repeat my question : Do you have 10,000,000 (smaller) objects kept in 
>> slab caches ?
>>
>> TCP sockets use the (very complex, error prone) SLAB_TYPESAFE_BY_RCU, 
>> but not the struct socket_wq
>> object that was allocated in sock_alloc_inode() before Al patches.
>>
>> These objects should be visible in kmalloc-64 kmem cache.
>
> Not exactly the 10,000,000, as it is only the possible highest number, 
> but I
> was able to observe clear exponential increase of the number of the 
> objects
> using slabtop.  Before the start of the problematic workload, the number 
> of
> objects of 'kmalloc-64' was 5760, but I was able to observe the number 
> increase
> to 1,136,576.
>
>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> before:   5760   5088  88%0.06K 90   64   360K kmalloc-64
> after:  1136576 1136576 100%0.06K  17759   64 71036K 
> kmalloc-64
>

 Great, thanks.

 How recent is the kernel you are running for your experiment ?
>>>
>>> It's based on 5.4.35.
>>>

 Let's make sure the bug is not in RCU.
>>>
>>> One thing I can currently say is that the grace period passes at last.  I
>>> modified the benchmark to repeat not 10,000 times but only 5,000 times to 
>>> run
>>> the test without OOM but easily observable memory pressure.  As soon as the
>>> benchmark finishes, the memory were freed.
>>>
>>> If you need more tests, please let me know.
>>>
>>
>> I would ask Paul opinion on this issue, because we have many objects
>> being freed after RCU grace periods.
>>
>> If RCU subsystem can not keep-up, I guess other workloads will also suffer.
>>
>> Sure, we can revert patches there and there trying to work around the issue,
>> but for objects allocated from process context, we should not have these 
>> problems.
>>
> 
> I wonder if simply adjusting rcu_divisor to 6 or 5 would help 
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 
> d9a49cd6065a20936edbda1b334136ab597cde52..fde833bac0f9f81e8536211b4dad6e7575c1219a
>  100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -427,7 +427,7 @@ module_param(qovld, long, 0444);
>  static ulong jiffies_till_first_fqs = ULONG_MAX;
>  static ulong jiffies_till_next_fqs = ULONG_MAX;
>  static bool rcu_kick_kthreads;
> -static int rcu_divisor = 7;
> +static int rcu_divisor = 6;
>  module_param(rcu_divisor, int, 0644);
>  
>  /* Force an exit from rcu_do_batch() after 3 milliseconds. */
> 

To be clear, you can adjust the value without building a new kernel.

echo 6 >/sys/module/rcutree/parameters/rcu_divisor




Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread Eric Dumazet



On 5/5/20 9:25 AM, Eric Dumazet wrote:
> 
> 
> On 5/5/20 9:13 AM, SeongJae Park wrote:
>> On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet  wrote:
>>
>>> On Tue, May 5, 2020 at 8:47 AM SeongJae Park  wrote:

 On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet  
 wrote:

>
>
> On 5/5/20 8:07 AM, SeongJae Park wrote:
>> On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet  
>> wrote:
>>
>
>>> Why do we have 10,000,000 objects around ? Could this be because of
>>> some RCU problem ?
>>
>> Mainly because of a long RCU grace period, as you guess.  I have no idea 
>> how
>> the grace period became so long in this case.
>>
>> As my test machine was a virtual machine instance, I guess RCU readers
>> preemption[1] like problem might affected this.
>>
>> [1] https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
>>
>>>
>>> Once Al patches reverted, do you have 10,000,000 sock_alloc around ?
>>
>> Yes, both the old kernel that prior to Al's patches and the recent kernel
>> reverting the Al's patches didn't reproduce the problem.
>>
>
> I repeat my question : Do you have 10,000,000 (smaller) objects kept in 
> slab caches ?
>
> TCP sockets use the (very complex, error prone) SLAB_TYPESAFE_BY_RCU, but 
> not the struct socket_wq
> object that was allocated in sock_alloc_inode() before Al patches.
>
> These objects should be visible in kmalloc-64 kmem cache.

 Not exactly the 10,000,000, as it is only the possible highest number, but 
 I
 was able to observe clear exponential increase of the number of the objects
 using slabtop.  Before the start of the problematic workload, the number of
 objects of 'kmalloc-64' was 5760, but I was able to observe the number 
 increase
 to 1,136,576.

   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
 before:   5760   5088  88%0.06K 90   64   360K kmalloc-64
 after:  1136576 1136576 100%0.06K  17759   64 71036K kmalloc-64

>>>
>>> Great, thanks.
>>>
>>> How recent is the kernel you are running for your experiment ?
>>
>> It's based on 5.4.35.
>>
>>>
>>> Let's make sure the bug is not in RCU.
>>
>> One thing I can currently say is that the grace period passes at last.  I
>> modified the benchmark to repeat not 10,000 times but only 5,000 times to run
>> the test without OOM but easily observable memory pressure.  As soon as the
>> benchmark finishes, the memory were freed.
>>
>> If you need more tests, please let me know.
>>
> 
> I would ask Paul opinion on this issue, because we have many objects
> being freed after RCU grace periods.
> 
> If RCU subsystem can not keep-up, I guess other workloads will also suffer.
> 
> Sure, we can revert patches there and there trying to work around the issue,
> but for objects allocated from process context, we should not have these 
> problems.
> 

I wonder if simply adjusting rcu_divisor to 6 or 5 would help 

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 
d9a49cd6065a20936edbda1b334136ab597cde52..fde833bac0f9f81e8536211b4dad6e7575c1219a
 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -427,7 +427,7 @@ module_param(qovld, long, 0444);
 static ulong jiffies_till_first_fqs = ULONG_MAX;
 static ulong jiffies_till_next_fqs = ULONG_MAX;
 static bool rcu_kick_kthreads;
-static int rcu_divisor = 7;
+static int rcu_divisor = 6;
 module_param(rcu_divisor, int, 0644);
 
 /* Force an exit from rcu_do_batch() after 3 milliseconds. */



Re: Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread Al Viro
On Tue, May 05, 2020 at 09:00:44AM -0700, Eric Dumazet wrote:

> > Not exactly the 10,000,000, as it is only the possible highest number, but I
> > was able to observe clear exponential increase of the number of the objects
> > using slabtop.  Before the start of the problematic workload, the number of
> > objects of 'kmalloc-64' was 5760, but I was able to observe the number 
> > increase
> > to 1,136,576.
> >
> >   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> > before:   5760   5088  88%0.06K 90   64   360K kmalloc-64
> > after:  1136576 1136576 100%0.06K  17759   64 71036K kmalloc-64
> >
> 
> Great, thanks.
> 
> How recent is the kernel you are running for your experiment ?
> 
> Let's make sure the bug is not in RCU.
> 
> After Al changes, RCU got slightly better under stress.

The thing that worries me here is that this is far from being the only
source of RCU-delayed freeing of objects.  If we really see bogus OOM
kills due to that (IRL, not in an artificial microbenchmark), we'd
better do something that would help with all those sources, not just
paper over the contributions from one of those.  Because there's no
chance in hell to get rid of RCU-delayed freeing in general...

Does the problem extend to kfree_rcu()?  And there's a lot of RCU
callbacks that boil down to kmem_cache_free(); those really look like
they should have exact same issue - sock_free_inode() is one of those,
after all.


Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread Eric Dumazet



On 5/5/20 9:13 AM, SeongJae Park wrote:
> On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet  wrote:
> 
>> On Tue, May 5, 2020 at 8:47 AM SeongJae Park  wrote:
>>>
>>> On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet  
>>> wrote:
>>>


 On 5/5/20 8:07 AM, SeongJae Park wrote:
> On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet  
> wrote:
>

>> Why do we have 10,000,000 objects around ? Could this be because of
>> some RCU problem ?
>
> Mainly because of a long RCU grace period, as you guess.  I have no idea 
> how
> the grace period became so long in this case.
>
> As my test machine was a virtual machine instance, I guess RCU readers
> preemption[1] like problem might affected this.
>
> [1] https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
>
>>
>> Once Al patches reverted, do you have 10,000,000 sock_alloc around ?
>
> Yes, both the old kernel that prior to Al's patches and the recent kernel
> reverting the Al's patches didn't reproduce the problem.
>

 I repeat my question : Do you have 10,000,000 (smaller) objects kept in 
 slab caches ?

 TCP sockets use the (very complex, error prone) SLAB_TYPESAFE_BY_RCU, but 
 not the struct socket_wq
 object that was allocated in sock_alloc_inode() before Al patches.

 These objects should be visible in kmalloc-64 kmem cache.
>>>
>>> Not exactly the 10,000,000, as it is only the possible highest number, but I
>>> was able to observe clear exponential increase of the number of the objects
>>> using slabtop.  Before the start of the problematic workload, the number of
>>> objects of 'kmalloc-64' was 5760, but I was able to observe the number 
>>> increase
>>> to 1,136,576.
>>>
>>>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
>>> before:   5760   5088  88%0.06K 90   64   360K kmalloc-64
>>> after:  1136576 1136576 100%0.06K  17759   64 71036K kmalloc-64
>>>
>>
>> Great, thanks.
>>
>> How recent is the kernel you are running for your experiment ?
> 
> It's based on 5.4.35.
> 
>>
>> Let's make sure the bug is not in RCU.
> 
> One thing I can currently say is that the grace period passes at last.  I
> modified the benchmark to repeat not 10,000 times but only 5,000 times to run
> the test without OOM but easily observable memory pressure.  As soon as the
> benchmark finishes, the memory were freed.
> 
> If you need more tests, please let me know.
> 

I would ask Paul opinion on this issue, because we have many objects
being freed after RCU grace periods.

If RCU subsystem can not keep-up, I guess other workloads will also suffer.

Sure, we can revert patches there and there trying to work around the issue,
but for objects allocated from process context, we should not have these 
problems.





Re: Re: Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread SeongJae Park
On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet  wrote:

> On Tue, May 5, 2020 at 8:47 AM SeongJae Park  wrote:
> >
> > On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet  
> > wrote:
> >
> > >
> > >
> > > On 5/5/20 8:07 AM, SeongJae Park wrote:
> > > > On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet  
> > > > wrote:
> > > >
> > >
> > > >> Why do we have 10,000,000 objects around ? Could this be because of
> > > >> some RCU problem ?
> > > >
> > > > Mainly because of a long RCU grace period, as you guess.  I have no 
> > > > idea how
> > > > the grace period became so long in this case.
> > > >
> > > > As my test machine was a virtual machine instance, I guess RCU readers
> > > > preemption[1] like problem might affected this.
> > > >
> > > > [1] 
> > > > https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
> > > >
> > > >>
> > > >> Once Al patches reverted, do you have 10,000,000 sock_alloc around ?
> > > >
> > > > Yes, both the old kernel that prior to Al's patches and the recent 
> > > > kernel
> > > > reverting the Al's patches didn't reproduce the problem.
> > > >
> > >
> > > I repeat my question : Do you have 10,000,000 (smaller) objects kept in 
> > > slab caches ?
> > >
> > > TCP sockets use the (very complex, error prone) SLAB_TYPESAFE_BY_RCU, but 
> > > not the struct socket_wq
> > > object that was allocated in sock_alloc_inode() before Al patches.
> > >
> > > These objects should be visible in kmalloc-64 kmem cache.
> >
> > Not exactly the 10,000,000, as it is only the possible highest number, but I
> > was able to observe clear exponential increase of the number of the objects
> > using slabtop.  Before the start of the problematic workload, the number of
> > objects of 'kmalloc-64' was 5760, but I was able to observe the number 
> > increase
> > to 1,136,576.
> >
> >   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> > before:   5760   5088  88%0.06K 90   64   360K kmalloc-64
> > after:  1136576 1136576 100%0.06K  17759   64 71036K kmalloc-64
> >
> 
> Great, thanks.
> 
> How recent is the kernel you are running for your experiment ?

It's based on 5.4.35.

> 
> Let's make sure the bug is not in RCU.

One thing I can currently say is that the grace period passes at last.  I
modified the benchmark to repeat not 10,000 times but only 5,000 times to run
the test without OOM but easily observable memory pressure.  As soon as the
benchmark finishes, the memory were freed.

If you need more tests, please let me know.


Thanks,
SeongJae Park

> 
> After Al changes, RCU got slightly better under stress.
> 


Re: Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread Eric Dumazet
On Tue, May 5, 2020 at 8:47 AM SeongJae Park  wrote:
>
> On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet  wrote:
>
> >
> >
> > On 5/5/20 8:07 AM, SeongJae Park wrote:
> > > On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet  
> > > wrote:
> > >
> >
> > >> Why do we have 10,000,000 objects around ? Could this be because of
> > >> some RCU problem ?
> > >
> > > Mainly because of a long RCU grace period, as you guess.  I have no idea 
> > > how
> > > the grace period became so long in this case.
> > >
> > > As my test machine was a virtual machine instance, I guess RCU readers
> > > preemption[1] like problem might affected this.
> > >
> > > [1] https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
> > >
> > >>
> > >> Once Al patches reverted, do you have 10,000,000 sock_alloc around ?
> > >
> > > Yes, both the old kernel that prior to Al's patches and the recent kernel
> > > reverting the Al's patches didn't reproduce the problem.
> > >
> >
> > I repeat my question : Do you have 10,000,000 (smaller) objects kept in 
> > slab caches ?
> >
> > TCP sockets use the (very complex, error prone) SLAB_TYPESAFE_BY_RCU, but 
> > not the struct socket_wq
> > object that was allocated in sock_alloc_inode() before Al patches.
> >
> > These objects should be visible in kmalloc-64 kmem cache.
>
> Not exactly the 10,000,000, as it is only the possible highest number, but I
> was able to observe clear exponential increase of the number of the objects
> using slabtop.  Before the start of the problematic workload, the number of
> objects of 'kmalloc-64' was 5760, but I was able to observe the number 
> increase
> to 1,136,576.
>
>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> before:   5760   5088  88%0.06K 90   64   360K kmalloc-64
> after:  1136576 1136576 100%0.06K  17759   64 71036K kmalloc-64
>

Great, thanks.

How recent is the kernel you are running for your experiment ?

Let's make sure the bug is not in RCU.

After Al changes, RCU got slightly better under stress.


Re: Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread SeongJae Park
On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet  wrote:

> 
> 
> On 5/5/20 8:07 AM, SeongJae Park wrote:
> > On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet  wrote:
> > 
> 
> >> Why do we have 10,000,000 objects around ? Could this be because of
> >> some RCU problem ?
> > 
> > Mainly because of a long RCU grace period, as you guess.  I have no idea how
> > the grace period became so long in this case.
> > 
> > As my test machine was a virtual machine instance, I guess RCU readers
> > preemption[1] like problem might affected this.
> > 
> > [1] https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
> > 
> >>
> >> Once Al patches reverted, do you have 10,000,000 sock_alloc around ?
> > 
> > Yes, both the old kernel that prior to Al's patches and the recent kernel
> > reverting the Al's patches didn't reproduce the problem.
> >
> 
> I repeat my question : Do you have 10,000,000 (smaller) objects kept in slab 
> caches ?
> 
> TCP sockets use the (very complex, error prone) SLAB_TYPESAFE_BY_RCU, but not 
> the struct socket_wq
> object that was allocated in sock_alloc_inode() before Al patches.
> 
> These objects should be visible in kmalloc-64 kmem cache.

Not exactly the 10,000,000, as it is only the possible highest number, but I
was able to observe clear exponential increase of the number of the objects
using slabtop.  Before the start of the problematic workload, the number of
objects of 'kmalloc-64' was 5760, but I was able to observe the number increase
to 1,136,576.

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
before:   5760   5088  88%0.06K 90   64   360K kmalloc-64
after:  1136576 1136576 100%0.06K  17759   64 71036K kmalloc-64


Thanks,
SeongJae Park



Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread Eric Dumazet



On 5/5/20 8:07 AM, SeongJae Park wrote:
> On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet  wrote:
> 

>> Why do we have 10,000,000 objects around ? Could this be because of
>> some RCU problem ?
> 
> Mainly because of a long RCU grace period, as you guess.  I have no idea how
> the grace period became so long in this case.
> 
> As my test machine was a virtual machine instance, I guess RCU readers
> preemption[1] like problem might affected this.
> 
> [1] https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
> 
>>
>> Once Al patches reverted, do you have 10,000,000 sock_alloc around ?
> 
> Yes, both the old kernel that prior to Al's patches and the recent kernel
> reverting the Al's patches didn't reproduce the problem.
>

I repeat my question : Do you have 10,000,000 (smaller) objects kept in slab 
caches ?

TCP sockets use the (very complex, error prone) SLAB_TYPESAFE_BY_RCU, but not 
the struct socket_wq
object that was allocated in sock_alloc_inode() before Al patches.

These objects should be visible in kmalloc-64 kmem cache.



Re: Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread SeongJae Park
On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet  wrote:

> On Tue, May 5, 2020 at 4:54 AM SeongJae Park  wrote:
> >
> > CC-ing sta...@vger.kernel.org and adding some more explanations.
> >
> > On Tue, 5 May 2020 10:10:33 +0200 SeongJae Park  wrote:
> >
> > > From: SeongJae Park 
> > >
> > > The commit 6d7855c54e1e ("sockfs: switch to ->free_inode()") made the
> > > deallocation of 'socket_alloc' to be done asynchronously using RCU, as
> > > same to 'sock.wq'.  And the following commit 333f7909a857 ("coallocate
> > > socket_sq with socket itself") made those to have same life cycle.
> > >
> > > The changes made the code much more simple, but also made 'socket_alloc'
> > > live longer than before.  For the reason, user programs intensively
> > > repeating allocations and deallocations of sockets could cause memory
> > > pressure on recent kernels.
> >
> > I found this problem on a production virtual machine utilizing 4GB memory 
> > while
> > running lebench[1].  The 'poll big' test of lebench opens 1000 sockets, 
> > polls
> > and closes those.  This test is repeated 10,000 times.  Therefore it should
> > consume only 1000 'socket_alloc' objects at once.  As size of socket_alloc 
> > is
> > about 800 Bytes, it's only 800 KiB.  However, on the recent kernels, it 
> > could
> > consume up to 10,000,000 objects (about 8 GiB).  On the test machine, I
> > confirmed it consuming about 4GB of the system memory and results in OOM.
> >
> > [1] https://github.com/LinuxPerfStudy/LEBench
> 
> To be fair, I have not backported Al patches to Google production
> kernels, nor I have tried this benchmark.
> 
> Why do we have 10,000,000 objects around ? Could this be because of
> some RCU problem ?

Mainly because of a long RCU grace period, as you guess.  I have no idea how
the grace period became so long in this case.

As my test machine was a virtual machine instance, I guess RCU readers
preemption[1] like problem might affected this.

[1] https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf

> 
> Once Al patches reverted, do you have 10,000,000 sock_alloc around ?

Yes, both the old kernel that prior to Al's patches and the recent kernel
reverting the Al's patches didn't reproduce the problem.


Thanks,
SeongJae Park

> 
> Thanks.
> 
> >
> > >
> > > To avoid the problem, this commit reverts the changes.
> >
> > I also tried to make fixup rather than reverts, but I couldn't easily find
> > simple fixup.  As the commits 6d7855c54e1e and 333f7909a857 were for code
> > refactoring rather than performance optimization, I thought introducing 
> > complex
> > fixup for this problem would make no sense.  Meanwhile, the memory pressure
> > regression could affect real machines.  To this end, I decided to quickly
> > revert the commits first and consider better refactoring later.
> >
> >
> > Thanks,
> > SeongJae Park
> >
> > >
> > > SeongJae Park (2):
> > >   Revert "coallocate socket_wq with socket itself"
> > >   Revert "sockfs: switch to ->free_inode()"
> > >
> > >  drivers/net/tap.c  |  5 +++--
> > >  drivers/net/tun.c  |  8 +---
> > >  include/linux/if_tap.h |  1 +
> > >  include/linux/net.h|  4 ++--
> > >  include/net/sock.h |  4 ++--
> > >  net/core/sock.c|  2 +-
> > >  net/socket.c   | 23 ---
> > >  7 files changed, 30 insertions(+), 17 deletions(-)
> > >
> > > --
> > > 2.17.1


Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread Eric Dumazet
On Tue, May 5, 2020 at 4:54 AM SeongJae Park  wrote:
>
> CC-ing sta...@vger.kernel.org and adding some more explanations.
>
> On Tue, 5 May 2020 10:10:33 +0200 SeongJae Park  wrote:
>
> > From: SeongJae Park 
> >
> > The commit 6d7855c54e1e ("sockfs: switch to ->free_inode()") made the
> > deallocation of 'socket_alloc' to be done asynchronously using RCU, as
> > same to 'sock.wq'.  And the following commit 333f7909a857 ("coallocate
> > socket_sq with socket itself") made those to have same life cycle.
> >
> > The changes made the code much more simple, but also made 'socket_alloc'
> > live longer than before.  For the reason, user programs intensively
> > repeating allocations and deallocations of sockets could cause memory
> > pressure on recent kernels.
>
> I found this problem on a production virtual machine utilizing 4GB memory 
> while
> running lebench[1].  The 'poll big' test of lebench opens 1000 sockets, polls
> and closes those.  This test is repeated 10,000 times.  Therefore it should
> consume only 1000 'socket_alloc' objects at once.  As size of socket_alloc is
> about 800 Bytes, it's only 800 KiB.  However, on the recent kernels, it could
> consume up to 10,000,000 objects (about 8 GiB).  On the test machine, I
> confirmed it consuming about 4GB of the system memory and results in OOM.
>
> [1] https://github.com/LinuxPerfStudy/LEBench

To be fair, I have not backported Al patches to Google production
kernels, nor I have tried this benchmark.

Why do we have 10,000,000 objects around ? Could this be because of
some RCU problem ?

Once Al patches reverted, do you have 10,000,000 sock_alloc around ?

Thanks.

>
> >
> > To avoid the problem, this commit reverts the changes.
>
> I also tried to make fixup rather than reverts, but I couldn't easily find
> simple fixup.  As the commits 6d7855c54e1e and 333f7909a857 were for code
> refactoring rather than performance optimization, I thought introducing 
> complex
> fixup for this problem would make no sense.  Meanwhile, the memory pressure
> regression could affect real machines.  To this end, I decided to quickly
> revert the commits first and consider better refactoring later.
>
>
> Thanks,
> SeongJae Park
>
> >
> > SeongJae Park (2):
> >   Revert "coallocate socket_wq with socket itself"
> >   Revert "sockfs: switch to ->free_inode()"
> >
> >  drivers/net/tap.c  |  5 +++--
> >  drivers/net/tun.c  |  8 +---
> >  include/linux/if_tap.h |  1 +
> >  include/linux/net.h|  4 ++--
> >  include/net/sock.h |  4 ++--
> >  net/core/sock.c|  2 +-
> >  net/socket.c   | 23 ---
> >  7 files changed, 30 insertions(+), 17 deletions(-)
> >
> > --
> > 2.17.1


Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread Nuernberger, Stefan
On Tue, 2020-05-05 at 13:54 +0200, SeongJae Park wrote:
> CC-ing sta...@vger.kernel.org and adding some more explanations.
> 
> On Tue, 5 May 2020 10:10:33 +0200 SeongJae Park 
> wrote:
> 
> > 
> > From: SeongJae Park 
> > 
> > The commit 6d7855c54e1e ("sockfs: switch to ->free_inode()") made
> > the
> > deallocation of 'socket_alloc' to be done asynchronously using RCU,
> > as
> > same to 'sock.wq'.  And the following commit 333f7909a857
> > ("coallocate
> > socket_sq with socket itself") made those to have same life cycle.
> > 
> > The changes made the code much more simple, but also made
> > 'socket_alloc'
> > live longer than before.  For the reason, user programs intensively
> > repeating allocations and deallocations of sockets could cause
> > memory
> > pressure on recent kernels.
> I found this problem on a production virtual machine utilizing 4GB
> memory while
> running lebench[1].  The 'poll big' test of lebench opens 1000
> sockets, polls
> and closes those.  This test is repeated 10,000 times.  Therefore it
> should
> consume only 1000 'socket_alloc' objects at once.  As size of
> socket_alloc is
> about 800 Bytes, it's only 800 KiB.  However, on the recent kernels,
> it could
> consume up to 10,000,000 objects (about 8 GiB).  On the test machine,
> I
> confirmed it consuming about 4GB of the system memory and results in
> OOM.
> 
> [1] https://github.com/LinuxPerfStudy/LEBench
> 
> > 
> > 
> > To avoid the problem, this commit reverts the changes.
> I also tried to make fixup rather than reverts, but I couldn't easily
> find
> simple fixup.  As the commits 6d7855c54e1e and 333f7909a857 were for
> code
> refactoring rather than performance optimization, I thought
> introducing complex
> fixup for this problem would make no sense.  Meanwhile, the memory
> pressure
> regression could affect real machines.  To this end, I decided to
> quickly
> revert the commits first and consider better refactoring later.
> 

While lebench might be exercising a rather pathological case, the
increase in memory pressure is real. I am concerned that the OOM killer
is actually engaging and killing off processes when there are lots of
resources already marked for release. This might be true for other
lazy/delayed resource deallocation, too. This has obviously just become
too lazy currently.

So for both reverts:

Reviewed-by: Stefan Nuernberger 

> 
> Thanks,
> SeongJae Park
> 
> > 
> > 
> > SeongJae Park (2):
> >   Revert "coallocate socket_wq with socket itself"
> >   Revert "sockfs: switch to ->free_inode()"
> > 
> >  drivers/net/tap.c  |  5 +++--
> >  drivers/net/tun.c  |  8 +---
> >  include/linux/if_tap.h |  1 +
> >  include/linux/net.h|  4 ++--
> >  include/net/sock.h |  4 ++--
> >  net/core/sock.c|  2 +-
> >  net/socket.c   | 23 ---
> >  7 files changed, 30 insertions(+), 17 deletions(-)
> > 



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread SeongJae Park
CC-ing sta...@vger.kernel.org and adding some more explanations.

On Tue, 5 May 2020 10:10:33 +0200 SeongJae Park  wrote:

> From: SeongJae Park 
> 
> The commit 6d7855c54e1e ("sockfs: switch to ->free_inode()") made the
> deallocation of 'socket_alloc' to be done asynchronously using RCU, as
> same to 'sock.wq'.  And the following commit 333f7909a857 ("coallocate
> socket_sq with socket itself") made those to have same life cycle.
> 
> The changes made the code much more simple, but also made 'socket_alloc'
> live longer than before.  For the reason, user programs intensively
> repeating allocations and deallocations of sockets could cause memory
> pressure on recent kernels.

I found this problem on a production virtual machine utilizing 4GB memory while
running lebench[1].  The 'poll big' test of lebench opens 1000 sockets, polls
and closes those.  This test is repeated 10,000 times.  Therefore it should
consume only 1000 'socket_alloc' objects at once.  As size of socket_alloc is
about 800 Bytes, it's only 800 KiB.  However, on the recent kernels, it could
consume up to 10,000,000 objects (about 8 GiB).  On the test machine, I
confirmed it consuming about 4GB of the system memory and results in OOM.

[1] https://github.com/LinuxPerfStudy/LEBench

> 
> To avoid the problem, this commit reverts the changes.

I also tried to make fixup rather than reverts, but I couldn't easily find
simple fixup.  As the commits 6d7855c54e1e and 333f7909a857 were for code
refactoring rather than performance optimization, I thought introducing complex
fixup for this problem would make no sense.  Meanwhile, the memory pressure
regression could affect real machines.  To this end, I decided to quickly
revert the commits first and consider better refactoring later.


Thanks,
SeongJae Park

> 
> SeongJae Park (2):
>   Revert "coallocate socket_wq with socket itself"
>   Revert "sockfs: switch to ->free_inode()"
> 
>  drivers/net/tap.c  |  5 +++--
>  drivers/net/tun.c  |  8 +---
>  include/linux/if_tap.h |  1 +
>  include/linux/net.h|  4 ++--
>  include/net/sock.h |  4 ++--
>  net/core/sock.c|  2 +-
>  net/socket.c   | 23 ---
>  7 files changed, 30 insertions(+), 17 deletions(-)
> 
> -- 
> 2.17.1


[PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread SeongJae Park
From: SeongJae Park 

The commit 6d7855c54e1e ("sockfs: switch to ->free_inode()") made the
deallocation of 'socket_alloc' to be done asynchronously using RCU, as
same to 'sock.wq'.  And the following commit 333f7909a857 ("coallocate
socket_sq with socket itself") made those to have same life cycle.

The changes made the code much more simple, but also made 'socket_alloc'
live longer than before.  For the reason, user programs intensively
repeating allocations and deallocations of sockets could cause memory
pressure on recent kernels.

To avoid the problem, this commit reverts the changes.

SeongJae Park (2):
  Revert "coallocate socket_wq with socket itself"
  Revert "sockfs: switch to ->free_inode()"

 drivers/net/tap.c  |  5 +++--
 drivers/net/tun.c  |  8 +---
 include/linux/if_tap.h |  1 +
 include/linux/net.h|  4 ++--
 include/net/sock.h |  4 ++--
 net/core/sock.c|  2 +-
 net/socket.c   | 23 ---
 7 files changed, 30 insertions(+), 17 deletions(-)

-- 
2.17.1