Re: [PATCH v2] fs: Do not check if there is a fsnotify watcher on pseudo inodes

2020-06-29 Thread Eric Dumazet



On 6/16/20 12:47 AM, Jan Kara wrote:
> On Mon 15-06-20 19:26:38, Amir Goldstein wrote:
>>> This patch changes alloc_file_pseudo() to always opt out of fsnotify by
>>> setting FMODE_NONOTIFY flag so that no check is made for fsnotify watchers
>>> on pseudo files. This should be safe as the underlying helper for the
>>> dentry is d_alloc_pseudo which explicitly states that no lookups are ever
>>> performed meaning that fanotify should have nothing useful to attach to.
>>>
>>> The test motivating this was "perf bench sched messaging --pipe". On
>>> a single-socket machine using threads the difference of the patch was
>>> as follows.
>>>
>>>   5.7.0  5.7.0
>>> vanillanofsnotify-v1r1
>>> Amean 1   1.3837 (   0.00%)  1.3547 (   2.10%)
>>> Amean 3   3.7360 (   0.00%)  3.6543 (   2.19%)
>>> Amean 5   5.8130 (   0.00%)  5.7233 *   1.54%*
>>> Amean 7   8.1490 (   0.00%)  7.9730 *   2.16%*
>>> Amean 12 14.6843 (   0.00%) 14.1820 (   3.42%)
>>> Amean 18 21.8840 (   0.00%) 21.7460 (   0.63%)
>>> Amean 24 28.8697 (   0.00%) 29.1680 (  -1.03%)
>>> Amean 30 36.0787 (   0.00%) 35.2640 *   2.26%*
>>> Amean 32 38.0527 (   0.00%) 38.1223 (  -0.18%)
>>>
>>> The difference is small but in some cases it's outside the noise so
>>> while marginal, there is still some small benefit to ignoring fsnotify
>>> for files allocated via alloc_file_pseudo in some cases.
>>>
>>> Signed-off-by: Mel Gorman 
>>
>> Reviewed-by: Amir Goldstein 
> 
> Thanks for the patch Mel and for review Amir! I've added the patch to my
> tree with small amendments to the changelog.
> 
>   Honza
> 

Note this patch broke some user programs (one instance being packetdrill)

Typical legacy packetdrill script has :

// Create a socket and set it to non-blocking.
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 fcntl(3, F_GETFL) = 0x2 (flags O_RDWR)
   +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0


But after this change, fcntl(3, F_GETFL) returns 0x402 

FMODE_NONOTIFY was not meant to be visible to user space. (otherwise
there would be a O_NONOTIFY) ?








Re: [PATCH] [net/sched] Fix null pointer deref skb in tc_ctl_action

2020-06-17 Thread Eric Dumazet



On 6/17/20 6:43 PM, Gaurav Singh wrote:
> Add null check for skb
> 

Bad choice really.

You have to really understand code intent before trying to fix it.

> Signed-off-by: Gaurav Singh 
> ---
>  net/sched/act_api.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/net/sched/act_api.c b/net/sched/act_api.c
> index 8ac7eb0a8309..fd584821d75a 100644
> --- a/net/sched/act_api.c
> +++ b/net/sched/act_api.c
> @@ -1473,9 +1473,12 @@ static const struct nla_policy 
> tcaa_policy[TCA_ROOT_MAX + 1] = {
>  static int tc_ctl_action(struct sk_buff *skb, struct nlmsghdr *n,
>struct netlink_ext_ack *extack)
>  {
> + if (!skb)
> + return 0;


We do not allow this

warning: ISO C90 forbids mixed declarations and code 
[-Wdeclaration-after-statement]

> +
>   struct net *net = sock_net(skb->sk);
>   struct nlattr *tca[TCA_ROOT_MAX + 1];
> - u32 portid = skb ? NETLINK_CB(skb).portid : 0;
> + u32 portid = NETLINK_CB(skb).portid;
>   int ret = 0, ovr = 0;
>  
>   if ((n->nlmsg_type != RTM_GETACTION) &&
> 

Please compile your patches, do not expect us from doing this.



Re: [PATCH] [net/sched]: Remove redundant condition in qdisc_graft

2020-06-17 Thread Eric Dumazet



On 6/17/20 6:23 PM, Gaurav Singh wrote:
> Signed-off-by: Gaurav Singh 

Two Signed-off-by ?

> 
> Fix build errors

Your patch does not fix a build error.

> 
> Signed-off-by: Gaurav Singh 
> ---
>  net/sched/sch_api.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index 9a3449b56bd6..be93ebcdb18d 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1094,7 +1094,7 @@ static int qdisc_graft(struct net_device *dev, struct 
> Qdisc *parent,
>  
>   /* Only support running class lockless if parent is lockless */
>   if (new && (new->flags & TCQ_F_NOLOCK) &&
> - parent && !(parent->flags & TCQ_F_NOLOCK))
> + !(parent->flags & TCQ_F_NOLOCK))
>   qdisc_clear_nolock(new);
>  
>   if (!cops || !cops->graft)
> 


Re: [PATCH] e1000e: add ifdef to avoid dead code

2020-06-16 Thread Eric Dumazet



On 6/13/20 11:11 PM, Greg Thelen wrote:
> Commit e086ba2fccda ("e1000e: disable s0ix entry and exit flows for ME
> systems") added e1000e_check_me() but it's only called from
> CONFIG_PM_SLEEP protected code.  Thus builds without CONFIG_PM_SLEEP
> see:
>   drivers/net/ethernet/intel/e1000e/netdev.c:137:13: warning: 
> 'e1000e_check_me' defined but not used [-Wunused-function]
> 
> Add CONFIG_PM_SLEEP ifdef guard to avoid dead code.
> 
> Fixes: e086ba2fccda ("e1000e: disable s0ix entry and exit flows for ME 
> systems")
> Signed-off-by: Greg Thelen 
> ---
>  drivers/net/ethernet/intel/e1000e/netdev.c | 2 ++
>  1 file changed, 2 insertions(+)

Reviewed-by: Eric Dumazet 

Thanks Greg


Re: net/sched/sch_fq.c:966:12: warning: stack frame size of 1400 bytes in function 'fq_dump'

2020-06-15 Thread Eric Dumazet
On Mon, Jun 15, 2020 at 11:07 AM Nick Desaulniers
 wrote:
>
> On Mon, Jun 15, 2020 at 10:59 AM 'Eric Dumazet' via Clang Built Linux
>  wrote:
> >
> > On Mon, Jun 15, 2020 at 10:54 AM Eric Dumazet  wrote:
> > >
> > > On Mon, Jun 15, 2020 at 10:44 AM Nick Desaulniers
> > >  wrote:
> > > >
> > > > On Mon, Jun 15, 2020 at 9:17 AM 'Eric Dumazet' via Clang Built Linux
> > > >  wrote:
> > > > >
> > > > > On Sun, Jun 14, 2020 at 11:26 PM kernel test robot  
> > > > > wrote:
> > > > > >
> > > > > > tree:   
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
> > > > > > master
> > > > > > head:   96144c58abe7ff767e754b5b80995f7b8846d49b
> > > > > > commit: 39d010504e6b4485d7ceee167743620dd33f4417 net_sched: sch_fq: 
> > > > > > add horizon attribute
> > > > > > date:   6 weeks ago
> > > > > > :: branch date: 3 hours ago
> > > > > > :: commit date: 6 weeks ago
> > > > > > config: arm-randconfig-r006-20200614 (attached as .config)
> > > > > > compiler: clang version 11.0.0 
> > > > > > (https://github.com/llvm/llvm-project 
> > > > > > c669a1ed6386d57a75a602b53266466dae1e1d84)
> > > > > > reproduce (this is a W=1 build):
> > > > > > wget 
> > > > > > https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross
> > > > > >  -O ~/bin/make.cross
> > > > > > chmod +x ~/bin/make.cross
> > > > > > # install arm cross compiling tool for clang build
> > > > > > # apt-get install binutils-arm-linux-gnueabi
> > > > > > git checkout 39d010504e6b4485d7ceee167743620dd33f4417
> > > > > > # save the attached .config to linux build tree
> > > > > > COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross 
> > > > > > ARCH=arm
> > > > > >
> > > > > > If you fix the issue, kindly add following tag as appropriate
> > > > > > Reported-by: kernel test robot 
> > > > > >
> > > > > > All warnings (new ones prefixed by >>, old ones prefixed by <<):
> > > > > >
> > > > > > >> net/sched/sch_fq.c:966:12: warning: stack frame size of 1400 
> > > > > > >> bytes in function 'fq_dump' [-Wframe-larger-than=]
> > > > > > static int fq_dump(struct Qdisc *sch, struct sk_buff *skb)
> > > > > > ^
> > > > >
> > > > >
> > > > > This looks like a bug in CLANG on ARM, there is no way fq_dump() could
> > > > > consume so much stack space.
> > > >
> > > > You can use
> > > > https://github.com/ClangBuiltLinux/frame-larger-than
> > > > to help debug these.  You might be surprised who's doing larger stack
> > > > allocations than expected.
> > >
> > > What is wrong with scripts/checkstack.pl  ?
> >
> > After using CLANG on x86 I get :
> >
> > $ objdump -d net/sched/sch_fq.o | scripts/checkstack.pl
> > 0x1136 fq_change [sch_fq.o]: 192
> > 0x0016 __direct_call_Qdisc_enqueue1 [sch_fq.o]: 112
> > 0x19e0 fq_dump_stats [sch_fq.o]: 112
> > 0x1b43 fq_dump_stats [sch_fq.o]: 112
> >
> > So CLANG on x86 seems fine.
> >
> > I said : " this looks like a bug in CLANG on ARM"
>
> This is a randconfig, so some option is turning on something that's
> causing excessive stack usage with your patch on ARM.  Building who
> knows what config with x86 and claiming it works doesn't really cut
> it.  You looked at fq_dump and claimed it should be fine.  Did you
> verify each called function that was inlined?

Nick, my patch added exactly _one_ u64 variable in the stack.
That's a whooping  8 bytes, right ?

We should be able to use as many nla_put_u32() calls in a function,
without changing stack sizes.

If CLANG can not do that, then this is a CLANG problem, I have no
intention of fixing it,
I hope this is very clear ;)

>
>
> And the issue with checkstack is it tells you the depth, but not *what
> variables* and from which inlined function may be the cause.


OK. Thank you.


>
>
> >
> >
> >
> > >
> > > >
> > > > >
> > > > >
> > > > >
> > > > 

Re: net/sched/sch_fq.c:966:12: warning: stack frame size of 1400 bytes in function 'fq_dump'

2020-06-15 Thread Eric Dumazet
On Mon, Jun 15, 2020 at 10:54 AM Eric Dumazet  wrote:
>
> On Mon, Jun 15, 2020 at 10:44 AM Nick Desaulniers
>  wrote:
> >
> > On Mon, Jun 15, 2020 at 9:17 AM 'Eric Dumazet' via Clang Built Linux
> >  wrote:
> > >
> > > On Sun, Jun 14, 2020 at 11:26 PM kernel test robot  wrote:
> > > >
> > > > tree:   
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
> > > > master
> > > > head:   96144c58abe7ff767e754b5b80995f7b8846d49b
> > > > commit: 39d010504e6b4485d7ceee167743620dd33f4417 net_sched: sch_fq: add 
> > > > horizon attribute
> > > > date:   6 weeks ago
> > > > :: branch date: 3 hours ago
> > > > :: commit date: 6 weeks ago
> > > > config: arm-randconfig-r006-20200614 (attached as .config)
> > > > compiler: clang version 11.0.0 (https://github.com/llvm/llvm-project 
> > > > c669a1ed6386d57a75a602b53266466dae1e1d84)
> > > > reproduce (this is a W=1 build):
> > > > wget 
> > > > https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross
> > > >  -O ~/bin/make.cross
> > > > chmod +x ~/bin/make.cross
> > > > # install arm cross compiling tool for clang build
> > > > # apt-get install binutils-arm-linux-gnueabi
> > > > git checkout 39d010504e6b4485d7ceee167743620dd33f4417
> > > > # save the attached .config to linux build tree
> > > > COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross 
> > > > ARCH=arm
> > > >
> > > > If you fix the issue, kindly add following tag as appropriate
> > > > Reported-by: kernel test robot 
> > > >
> > > > All warnings (new ones prefixed by >>, old ones prefixed by <<):
> > > >
> > > > >> net/sched/sch_fq.c:966:12: warning: stack frame size of 1400 bytes 
> > > > >> in function 'fq_dump' [-Wframe-larger-than=]
> > > > static int fq_dump(struct Qdisc *sch, struct sk_buff *skb)
> > > > ^
> > >
> > >
> > > This looks like a bug in CLANG on ARM, there is no way fq_dump() could
> > > consume so much stack space.
> >
> > You can use
> > https://github.com/ClangBuiltLinux/frame-larger-than
> > to help debug these.  You might be surprised who's doing larger stack
> > allocations than expected.
>
> What is wrong with scripts/checkstack.pl  ?

After using CLANG on x86 I get :

$ objdump -d net/sched/sch_fq.o | scripts/checkstack.pl
0x1136 fq_change [sch_fq.o]: 192
0x0016 __direct_call_Qdisc_enqueue1 [sch_fq.o]: 112
0x19e0 fq_dump_stats [sch_fq.o]: 112
0x1b43 fq_dump_stats [sch_fq.o]: 112

So CLANG on x86 seems fine.

I said : " this looks like a bug in CLANG on ARM"



>
> >
> > >
> > >
> > >
> > > > 1 warning generated.
> > > >
> > > > # 
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=39d010504e6b4485d7ceee167743620dd33f4417
> > > > git remote add linus 
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> > > > git remote update linus
> > > > git checkout 39d010504e6b4485d7ceee167743620dd33f4417
> > > > vim +/fq_dump +966 net/sched/sch_fq.c
> > > >
> > > > afe4fd062416b1 Eric Dumazet   2013-08-29   965
> > > > afe4fd062416b1 Eric Dumazet   2013-08-29  @966  static int 
> > > > fq_dump(struct Qdisc *sch, struct sk_buff *skb)
> > > > afe4fd062416b1 Eric Dumazet   2013-08-29   967  {
> > > > afe4fd062416b1 Eric Dumazet   2013-08-29   968  struct 
> > > > fq_sched_data *q = qdisc_priv(sch);
> > > > 48872c11b77271 Eric Dumazet   2018-11-11   969  u64 
> > > > ce_threshold = q->ce_threshold;
> > > > 39d010504e6b44 Eric Dumazet   2020-05-01   970  u64 horizon = 
> > > > q->horizon;
> > > > afe4fd062416b1 Eric Dumazet   2013-08-29   971  struct nlattr 
> > > > *opts;
> > > > afe4fd062416b1 Eric Dumazet   2013-08-29   972
> > > > ae0be8de9a53cd Michal Kubecek 2019-04-26   973  opts = 
> > > > nla_nest_start_noflag(skb, TCA_OPTIONS);
> > > > afe4fd062416b1 Eric Dumazet   2013-08-29   974  if (opts == 
> > > > NULL)
> > > > afe4fd062416b1 Eric Dumazet   2013-08-29   975  goto 
> > > > nla_put_

Re: net/sched/sch_fq.c:966:12: warning: stack frame size of 1400 bytes in function 'fq_dump'

2020-06-15 Thread Eric Dumazet
On Mon, Jun 15, 2020 at 10:44 AM Nick Desaulniers
 wrote:
>
> On Mon, Jun 15, 2020 at 9:17 AM 'Eric Dumazet' via Clang Built Linux
>  wrote:
> >
> > On Sun, Jun 14, 2020 at 11:26 PM kernel test robot  wrote:
> > >
> > > tree:   
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> > > head:   96144c58abe7ff767e754b5b80995f7b8846d49b
> > > commit: 39d010504e6b4485d7ceee167743620dd33f4417 net_sched: sch_fq: add 
> > > horizon attribute
> > > date:   6 weeks ago
> > > :: branch date: 3 hours ago
> > > :: commit date: 6 weeks ago
> > > config: arm-randconfig-r006-20200614 (attached as .config)
> > > compiler: clang version 11.0.0 (https://github.com/llvm/llvm-project 
> > > c669a1ed6386d57a75a602b53266466dae1e1d84)
> > > reproduce (this is a W=1 build):
> > > wget 
> > > https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross 
> > > -O ~/bin/make.cross
> > > chmod +x ~/bin/make.cross
> > > # install arm cross compiling tool for clang build
> > > # apt-get install binutils-arm-linux-gnueabi
> > > git checkout 39d010504e6b4485d7ceee167743620dd33f4417
> > > # save the attached .config to linux build tree
> > > COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross 
> > > ARCH=arm
> > >
> > > If you fix the issue, kindly add following tag as appropriate
> > > Reported-by: kernel test robot 
> > >
> > > All warnings (new ones prefixed by >>, old ones prefixed by <<):
> > >
> > > >> net/sched/sch_fq.c:966:12: warning: stack frame size of 1400 bytes in 
> > > >> function 'fq_dump' [-Wframe-larger-than=]
> > > static int fq_dump(struct Qdisc *sch, struct sk_buff *skb)
> > > ^
> >
> >
> > This looks like a bug in CLANG on ARM, there is no way fq_dump() could
> > consume so much stack space.
>
> You can use
> https://github.com/ClangBuiltLinux/frame-larger-than
> to help debug these.  You might be surprised who's doing larger stack
> allocations than expected.

What is wrong with scripts/checkstack.pl  ?

>
> >
> >
> >
> > > 1 warning generated.
> > >
> > > # 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=39d010504e6b4485d7ceee167743620dd33f4417
> > > git remote add linus 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> > > git remote update linus
> > > git checkout 39d010504e6b4485d7ceee167743620dd33f4417
> > > vim +/fq_dump +966 net/sched/sch_fq.c
> > >
> > > afe4fd062416b1 Eric Dumazet   2013-08-29   965
> > > afe4fd062416b1 Eric Dumazet   2013-08-29  @966  static int fq_dump(struct 
> > > Qdisc *sch, struct sk_buff *skb)
> > > afe4fd062416b1 Eric Dumazet   2013-08-29   967  {
> > > afe4fd062416b1 Eric Dumazet   2013-08-29   968  struct 
> > > fq_sched_data *q = qdisc_priv(sch);
> > > 48872c11b77271 Eric Dumazet   2018-11-11   969  u64 ce_threshold 
> > > = q->ce_threshold;
> > > 39d010504e6b44 Eric Dumazet   2020-05-01   970  u64 horizon = 
> > > q->horizon;
> > > afe4fd062416b1 Eric Dumazet   2013-08-29   971  struct nlattr 
> > > *opts;
> > > afe4fd062416b1 Eric Dumazet   2013-08-29   972
> > > ae0be8de9a53cd Michal Kubecek 2019-04-26   973  opts = 
> > > nla_nest_start_noflag(skb, TCA_OPTIONS);
> > > afe4fd062416b1 Eric Dumazet   2013-08-29   974  if (opts == NULL)
> > > afe4fd062416b1 Eric Dumazet   2013-08-29   975  goto 
> > > nla_put_failure;
> > > afe4fd062416b1 Eric Dumazet   2013-08-29   976
> > > 65c5189a2b57b9 Eric Dumazet   2013-11-15   977  /* 
> > > TCA_FQ_FLOW_DEFAULT_RATE is not used anymore */
> > > 65c5189a2b57b9 Eric Dumazet   2013-11-15   978
> > > 48872c11b77271 Eric Dumazet   2018-11-11   979  
> > > do_div(ce_threshold, NSEC_PER_USEC);
> > > 39d010504e6b44 Eric Dumazet   2020-05-01   980  do_div(horizon, 
> > > NSEC_PER_USEC);
> > > 48872c11b77271 Eric Dumazet   2018-11-11   981
> > > afe4fd062416b1 Eric Dumazet   2013-08-29   982  if 
> > > (nla_put_u32(skb, TCA_FQ_PLIMIT, sch->limit) ||
> > > afe4fd062416b1 Eric Dumazet   2013-08-29   983  
> > > nla_put_u32(skb, TCA_FQ_FLOW_PLIMIT, q->flow_plimit) ||
> > > af

Re: net/sched/sch_fq.c:966:12: warning: stack frame size of 1400 bytes in function 'fq_dump'

2020-06-15 Thread Eric Dumazet
On Sun, Jun 14, 2020 at 11:26 PM kernel test robot  wrote:
>
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
> master
> head:   96144c58abe7ff767e754b5b80995f7b8846d49b
> commit: 39d010504e6b4485d7ceee167743620dd33f4417 net_sched: sch_fq: add 
> horizon attribute
> date:   6 weeks ago
> :: branch date: 3 hours ago
> :: commit date: 6 weeks ago
> config: arm-randconfig-r006-20200614 (attached as .config)
> compiler: clang version 11.0.0 (https://github.com/llvm/llvm-project 
> c669a1ed6386d57a75a602b53266466dae1e1d84)
> reproduce (this is a W=1 build):
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # install arm cross compiling tool for clang build
> # apt-get install binutils-arm-linux-gnueabi
> git checkout 39d010504e6b4485d7ceee167743620dd33f4417
> # save the attached .config to linux build tree
> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=arm
>
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot 
>
> All warnings (new ones prefixed by >>, old ones prefixed by <<):
>
> >> net/sched/sch_fq.c:966:12: warning: stack frame size of 1400 bytes in 
> >> function 'fq_dump' [-Wframe-larger-than=]
> static int fq_dump(struct Qdisc *sch, struct sk_buff *skb)
> ^


This looks like a bug in CLANG on ARM, there is no way fq_dump() could
consume so much stack space.



> 1 warning generated.
>
> # 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=39d010504e6b4485d7ceee167743620dd33f4417
> git remote add linus 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> git remote update linus
> git checkout 39d010504e6b4485d7ceee167743620dd33f4417
> vim +/fq_dump +966 net/sched/sch_fq.c
>
> afe4fd062416b1 Eric Dumazet   2013-08-29   965
> afe4fd062416b1 Eric Dumazet   2013-08-29  @966  static int fq_dump(struct 
> Qdisc *sch, struct sk_buff *skb)
> afe4fd062416b1 Eric Dumazet   2013-08-29   967  {
> afe4fd062416b1 Eric Dumazet   2013-08-29   968  struct fq_sched_data 
> *q = qdisc_priv(sch);
> 48872c11b77271 Eric Dumazet   2018-11-11   969  u64 ce_threshold = 
> q->ce_threshold;
> 39d010504e6b44 Eric Dumazet   2020-05-01   970  u64 horizon = 
> q->horizon;
> afe4fd062416b1 Eric Dumazet   2013-08-29   971  struct nlattr *opts;
> afe4fd062416b1 Eric Dumazet   2013-08-29   972
> ae0be8de9a53cd Michal Kubecek 2019-04-26   973  opts = 
> nla_nest_start_noflag(skb, TCA_OPTIONS);
> afe4fd062416b1 Eric Dumazet   2013-08-29   974  if (opts == NULL)
> afe4fd062416b1 Eric Dumazet   2013-08-29   975  goto 
> nla_put_failure;
> afe4fd062416b1 Eric Dumazet   2013-08-29   976
> 65c5189a2b57b9 Eric Dumazet   2013-11-15   977  /* 
> TCA_FQ_FLOW_DEFAULT_RATE is not used anymore */
> 65c5189a2b57b9 Eric Dumazet   2013-11-15   978
> 48872c11b77271 Eric Dumazet   2018-11-11   979  do_div(ce_threshold, 
> NSEC_PER_USEC);
> 39d010504e6b44 Eric Dumazet   2020-05-01   980      do_div(horizon, 
> NSEC_PER_USEC);
> 48872c11b77271 Eric Dumazet   2018-11-11   981
> afe4fd062416b1 Eric Dumazet   2013-08-29   982  if (nla_put_u32(skb, 
> TCA_FQ_PLIMIT, sch->limit) ||
> afe4fd062416b1 Eric Dumazet   2013-08-29   983  nla_put_u32(skb, 
> TCA_FQ_FLOW_PLIMIT, q->flow_plimit) ||
> afe4fd062416b1 Eric Dumazet   2013-08-29   984  nla_put_u32(skb, 
> TCA_FQ_QUANTUM, q->quantum) ||
> afe4fd062416b1 Eric Dumazet   2013-08-29   985  nla_put_u32(skb, 
> TCA_FQ_INITIAL_QUANTUM, q->initial_quantum) ||
> afe4fd062416b1 Eric Dumazet   2013-08-29   986  nla_put_u32(skb, 
> TCA_FQ_RATE_ENABLE, q->rate_enable) ||
> 76a9ebe811fb3d Eric Dumazet   2018-10-15   987  nla_put_u32(skb, 
> TCA_FQ_FLOW_MAX_RATE,
> 76a9ebe811fb3d Eric Dumazet   2018-10-15   988  
> min_t(unsigned long, q->flow_max_rate, ~0U)) ||
> f52ed89971adbe Eric Dumazet   2013-11-15   989  nla_put_u32(skb, 
> TCA_FQ_FLOW_REFILL_DELAY,
> f52ed89971adbe Eric Dumazet   2013-11-15   990  
> jiffies_to_usecs(q->flow_refill_delay)) ||
> 06eb395fa9856b Eric Dumazet   2015-02-04   991  nla_put_u32(skb, 
> TCA_FQ_ORPHAN_MASK, q->orphan_mask) ||
> 77879147a3481b Eric Dumazet   2016-09-19   992  nla_put_u32(skb, 
> TCA_FQ_LOW_RATE_THRESHOLD,
> 77879147a3481b Eric Dumazet   2016-09-19   993  
> q->low_rate_threshold) ||
> 48872c11b77271 Eric Dumaz

Re: [PATCH RFC] cred: Add WARN to detect wrong use of get/put_cred

2020-06-12 Thread Eric Dumazet
On Fri, Jun 12, 2020 at 3:28 AM Xiaoming Ni  wrote:
>
> Cred release and usage check code flow:
> 1. put_cred()
> if (atomic_dec_and_test(&(cred)->usage))
> __put_cred(cred);
>
> 2. __put_cred()
> BUG_ON(atomic_read(>usage) != 0);
> call_rcu(>rcu, put_cred_rcu);
>
> 3. put_cred_rcu()
> if (atomic_read(>usage) != 0)
> panic("CRED: put_cred_rcu() sees %p with usage %d\n",
>cred, atomic_read(>usage));
> kmem_cache_free(cred_jar, cred);
>
> If panic is triggered on put_cred_rcu(), there are two possibilities
> 1. Call get_cred() after __put_cred(), usage > 0
> 2. Call put_cred() after __put_cred(), usage < 0
> Since put_cred_rcu is an asynchronous behavior, it is no longer the first
> scene when panic, there is no information about the murderer in the panic
> call stack...
>
> So, add WARN() in get_cred()/put_cred(), and pray to catch the murderer
> at the first scene.
>
> Signed-off-by: Xiaoming Ni 
> ---


It seems you reinvented refcount_t ?


Re: [PATCH] net/mlx5: Use kfree(ft->g) in arfs_create_groups()

2020-06-05 Thread Eric Dumazet



On 6/5/20 12:22 PM, Denis Efremov wrote:
> Use kfree() instead of kvfree() on ft->g in arfs_create_groups() because
> the memory is allocated with kcalloc().
> 
> Signed-off-by: Denis Efremov 
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c
> index 014639ea06e3..c4c9d6cda7e6 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c
> @@ -220,7 +220,7 @@ static int arfs_create_groups(struct mlx5e_flow_table *ft,
>   sizeof(*ft->g), GFP_KERNEL);
>   in = kvzalloc(inlen, GFP_KERNEL);
>   if  (!in || !ft->g) {
> - kvfree(ft->g);
> + kfree(ft->g);
>   kvfree(in);
>   return -ENOMEM;
>   }
> 

This is slow path, kvfree() is perfectly able to free memory that was 
kmalloc()ed

net-next is closed, can we avoid these patches during merge window ?


Re: [net] a6211caa63: dmesg.UBSAN:signed-integer-overflow_in_arch/x86/include/asm/atomic.h

2020-06-05 Thread Eric Dumazet
On Fri, Jun 5, 2020 at 1:10 AM kernel test robot  wrote:
>
> Greeting,
>
> FYI, we noticed the following commit (built with gcc-4.9):
>
> commit: a6211caa634da39d861a47437ffcda8b38ef421b ("net: revert "net: get rid 
> of an signed integer overflow in ip_idents_reserve()"")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
> in testcase: boot
>
> on test machine: qemu-system-i386 -enable-kvm -cpu SandyBridge -smp 2 -m 8G
>
> caused below changes (please refer to attached dmesg/kmsg for entire 
> log/backtrace):
>
>
>
>
> If you fix the issue, kindly add following tag
> Reported-by: kernel test robot 
>

There you go.

We decided this was a bogus report, and that UBSAN requires sane compilers.

Please read the fine comment that was added in this commit and update
your compiler or do not mess with compiler flags.

+   /* If UBSAN reports an error there, please make sure your compiler
+* supports -fno-strict-overflow before reporting it that was a bug
+* in UBSAN, and it has been fixed in GCC-8.
+*/
+   return atomic_add_return(segs + delta, p_id) - segs;

Thanks.


>
> [   35.019158] 
> 
> [   35.019995] UBSAN: signed-integer-overflow in 
> arch/x86/include/asm/atomic.h:167:2
> [   35.020884] -1045826149 + -1341282523 cannot be represented in type 'int'
> [   35.021544] CPU: 0 PID: 350 Comm: systemd-timesyn Tainted: G S 
>5.7.0-rc5-00221-ga6211caa634da #1
> [   35.022550] Call Trace:
> [   35.022812]  dump_stack+0x16/0x26
> [   35.023151]  ubsan_epilogue+0x8/0x40
> [   35.023526]  handle_overflow+0x80/0xa0
> [   35.023925]  ? __ip_append_data+0x8ca/0xdd0
> [   35.024408]  __ubsan_handle_add_overflow+0xa/0x10
> [   35.024872]  ip_idents_reserve+0x79/0x90
> [   35.025263]  __ip_select_ident+0x48/0x70
> [   35.025659]  __ip_make_skb+0x32f/0x410
> [   35.026039]  ip_make_skb+0xa6/0xe0
> [   35.026383]  ? ip_reply_glue_bits+0x50/0x50
> [   35.026770]  ? ip_route_output_key_hash+0xb6/0xe0
> [   35.027221]  udp_sendmsg+0x577/0xba0
> [   35.027551]  ? ip_reply_glue_bits+0x50/0x50
> [   35.027960]  ? lock_release+0x9d/0x260
> [   35.028328]  inet_sendmsg+0x2e/0x50
> [   35.028819]  __sys_sendto+0xe2/0x130
> [   35.029178]  ? lock_acquire+0x92/0x310
> [   35.029552]  ? __might_fault+0x41/0x80
> [   35.029903]  ? find_held_lock+0x2d/0xd0
> [   35.030262]  ? lock_release+0x9d/0x260
> [   35.030620]  __ia32_sys_socketcall+0x141/0x240
> [   35.031064]  do_int80_syscall_32+0x46/0x3d0
> [   35.031470]  entry_INT80_32+0x113/0x113
> [   35.031854] EIP: 0xb7f54a02
> [   35.032133] Code: 95 01 00 05 25 36 02 00 83 ec 14 8d 80 e8 99 ff ff 50 6a 
> 02 e8 1f ff 00 00 c7 04 24 7f 00 00 00 e8 7e 87 01 00 66 90 90 cd 80  8d 
> b6 00 00 00 00 8d bc 27 00 00 00 00 8b 1c 24 c3 8d b6 00 00
> [   35.033938] EAX: ffda EBX: 000b ECX: bfecd7c8 EDX: 
> [   35.034562] ESI: b7cd3000 EDI:  EBP:  ESP: bfecd7bc
> [   35.035199] DS: 007b ES: 007b FS:  GS: 0033 SS: 007b EFLAGS: 0293
> [   35.035865] 
> 
> [  OK  ] Started OpenBSD Secure Shell server.
> [  OK  ] Started LSB: Start and stop bmc-watchdog.
> [  OK  ] Started LSB: Execute the kexec -e command to reboot system.
> [  OK  ] Started Login Service.
>  Starting Preprocess NFS configuration...
> [  OK  ] Reached target Host and Network Name Lookups.
>  Starting LSB: Load kernel image with kexec...
> [  OK  ] Reached target Login Prompts.
> [  OK  ] Started Preprocess NFS configuration.
>  Starting Notify NFS peers of a restart...
>  Starting NFS status monitor for NFSv2/3 locking
> [  OK  ] Started Notify NFS peers of a restart.
> [  OK  ] Started LSB: Load kernel image with kexec.
> [  OK  ] Started NFS status monitor for NFSv2/3 locking..
> [   48.881188] sysrq: Emergency Sync
> [   48.881750] sysrq: Resetting
>
>
> To reproduce:
>
> # build kernel
> cd linux
> cp config-5.7.0-rc5-00221-ga6211caa634da .config
> make HOSTCC=gcc-4.9 CC=gcc-4.9 ARCH=i386 olddefconfig prepare 
> modules_prepare bzImage
>
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> bin/lkp qemu -k  job-script # job-script is attached in this 
> email
>
>
>
> Thanks,
> Rong Chen
>


Re: [PATCH v2 4.19] tcp: fix TCP socks unreleased in BBR mode

2020-06-04 Thread Eric Dumazet
On Thu, Jun 4, 2020 at 2:01 AM  wrote:
>
> From: Jason Xing 
>
> When using BBR mode, too many tcp socks cannot be released because of
> duplicate use of the sock_hold() in the manner of tcp_internal_pacing()
> when RTO happens. Therefore, this situation maddly increases the slab
> memory and then constantly triggers the OOM until crash.
>
> Besides, in addition to BBR mode, if some mode applies pacing function,
> it could trigger what we've discussed above,
>
> Reproduce procedure:
> 0) cat /proc/slabinfo | grep TCP
> 1) switch net.ipv4.tcp_congestion_control to bbr
> 2) using wrk tool something like that to send packages
> 3) using tc to increase the delay and loss to simulate the RTO case.
> 4) cat /proc/slabinfo | grep TCP
> 5) kill the wrk command and observe the number of objects and slabs in
> TCP.
> 6) at last, you could notice that the number would not decrease.
>
> v2: extend the timer which could cover all those related potential risks
> (suggested by Eric Dumazet and Neal Cardwell)
>
> Signed-off-by: Jason Xing 
> Signed-off-by: liweishi 
> Signed-off-by: Shujin Li 

That is not how things work really.

I will submit this properly so that stable teams do not have to guess
how to backport this to various kernels.

Changelog is misleading, this has nothing to do with BBR, we need to be precise.

Thank you.


Re: [PATCH] tcp: fix TCP socks unreleased in BBR mode

2020-06-03 Thread Eric Dumazet
On Wed, Jun 3, 2020 at 5:02 AM Neal Cardwell  wrote:
>
> On Wed, Jun 3, 2020 at 1:44 AM Eric Dumazet  wrote:
> >
> > On Tue, Jun 2, 2020 at 10:05 PM Jason Xing  
> > wrote:
> > >
> > > Hi Eric,
> > >
> > > I'm still trying to understand what you're saying before. Would this
> > > be better as following:
> > > 1) discard the tcp_internal_pacing() function.
> > > 2) remove where the tcp_internal_pacing() is called in the
> > > __tcp_transmit_skb() function.
> > >
> > > If we do so, we could avoid 'too late to give up pacing'. Meanwhile,
> > > should we introduce the tcp_wstamp_ns socket field as commit
> > > (864e5c090749) does?
> > >
> >
> > Please do not top-post on netdev mailing list.
> >
> >
> > I basically suggested double-checking which point in TCP could end up
> > calling tcp_internal_pacing()
> > while the timer was already armed.
> >
> > I guess this is mtu probing.
>
> Perhaps this could also happen from some of the retransmission code
> paths that don't use tcp_xmit_retransmit_queue()? Perhaps
> tcp_retransmit_timer() (RTO) and  tcp_send_loss_probe() TLP? It seems
> they could indirectly cause a call to __tcp_transmit_skb() and thus
> tcp_internal_pacing() without first checking if the pacing timer was
> already armed?

I feared this, (see recent commits about very low pacing rates) :/

I am not sure we need to properly fix all these points for old
kernels, since EDT model got rid of these problems.

Maybe we can try to extend the timer.

Something like :


diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 
cc4ba42052c21b206850594db6751810d8fc72b4..626b9f4f500f7e5270d8d59e6eb16dbfa3efbc7c
100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -966,6 +966,8 @@ enum hrtimer_restart tcp_pace_kick(struct hrtimer *timer)

 static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb)
 {
+   struct tcp_sock *tp = tcp_sk(sk);
+   ktime_t expire, now;
u64 len_ns;
u32 rate;

@@ -977,12 +979,29 @@ static void tcp_internal_pacing(struct sock *sk,
const struct sk_buff *skb)

len_ns = (u64)skb->len * NSEC_PER_SEC;
do_div(len_ns, rate);
-   hrtimer_start(_sk(sk)->pacing_timer,
- ktime_add_ns(ktime_get(), len_ns),
+
+   now = ktime_get();
+   /* If hrtimer is already armed, then our caller has not
+* used tcp_pacing_check().
+*/
+   if (unlikely(hrtimer_is_queued(>pacing_timer))) {
+   expire = hrtimer_get_softexpires(>pacing_timer);
+   if (ktime_after(expire, now))
+   now = expire;
+   if (hrtimer_try_to_cancel(>pacing_timer) == 1)
+   __sock_put(sk);
+   }
+   hrtimer_start(>pacing_timer, ktime_add_ns(now, len_ns),
  HRTIMER_MODE_ABS_PINNED_SOFT);
sock_hold(sk);
 }

+static bool tcp_pacing_check(const struct sock *sk)
+{
+   return tcp_needs_internal_pacing(sk) &&
+  hrtimer_is_queued(_sk(sk)->pacing_timer);
+}
+
 static void tcp_update_skb_after_send(struct tcp_sock *tp, struct sk_buff *skb)
 {
skb->skb_mstamp = tp->tcp_mstamp;
@@ -2117,6 +2136,9 @@ static int tcp_mtu_probe(struct sock *sk)
if (!tcp_can_coalesce_send_queue_head(sk, probe_size))
return -1;

+   if (tcp_pacing_check(sk))
+   return -1;
+
/* We're allowed to probe.  Build it now. */
nskb = sk_stream_alloc_skb(sk, probe_size, GFP_ATOMIC, false);
if (!nskb)
@@ -2190,11 +2212,6 @@ static int tcp_mtu_probe(struct sock *sk)
return -1;
 }

-static bool tcp_pacing_check(const struct sock *sk)
-{
-   return tcp_needs_internal_pacing(sk) &&
-  hrtimer_is_queued(_sk(sk)->pacing_timer);
-}

 /* TCP Small Queues :
  * Control number of packets in qdisc/devices to two packets / or ~1 ms.



>
> neal


Re: [PATCH] tcp: fix TCP socks unreleased in BBR mode

2020-06-02 Thread Eric Dumazet
On Tue, Jun 2, 2020 at 10:05 PM Jason Xing  wrote:
>
> Hi Eric,
>
> I'm still trying to understand what you're saying before. Would this
> be better as following:
> 1) discard the tcp_internal_pacing() function.
> 2) remove where the tcp_internal_pacing() is called in the
> __tcp_transmit_skb() function.
>
> If we do so, we could avoid 'too late to give up pacing'. Meanwhile,
> should we introduce the tcp_wstamp_ns socket field as commit
> (864e5c090749) does?
>

Please do not top-post on netdev mailing list.


I basically suggested double-checking which point in TCP could end up
calling tcp_internal_pacing()
while the timer was already armed.

I guess this is mtu probing.

Please try the following patch : If we still have another bug, a
WARNING should give us a stack trace.


diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 
cc4ba42052c21b206850594db6751810d8fc72b4..8f4081b228486305222767d4d118b9b6ed0ffda3
100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -977,12 +977,26 @@ static void tcp_internal_pacing(struct sock *sk,
const struct sk_buff *skb)

len_ns = (u64)skb->len * NSEC_PER_SEC;
do_div(len_ns, rate);
+
+   /* If hrtimer is already armed, then our caller has not properly
+* used tcp_pacing_check().
+*/
+   if (unlikely(hrtimer_is_queued(_sk(sk)->pacing_timer))) {
+   WARN_ON_ONCE(1);
+   return;
+   }
hrtimer_start(_sk(sk)->pacing_timer,
  ktime_add_ns(ktime_get(), len_ns),
  HRTIMER_MODE_ABS_PINNED_SOFT);
sock_hold(sk);
 }

+static bool tcp_pacing_check(const struct sock *sk)
+{
+   return tcp_needs_internal_pacing(sk) &&
+  hrtimer_is_queued(_sk(sk)->pacing_timer);
+}
+
 static void tcp_update_skb_after_send(struct tcp_sock *tp, struct sk_buff *skb)
 {
skb->skb_mstamp = tp->tcp_mstamp;
@@ -2117,6 +2131,9 @@ static int tcp_mtu_probe(struct sock *sk)
if (!tcp_can_coalesce_send_queue_head(sk, probe_size))
return -1;

+   if (tcp_pacing_check(sk))
+   return -1;
+
/* We're allowed to probe.  Build it now. */
nskb = sk_stream_alloc_skb(sk, probe_size, GFP_ATOMIC, false);
if (!nskb)
@@ -2190,11 +2207,6 @@ static int tcp_mtu_probe(struct sock *sk)
return -1;
 }

-static bool tcp_pacing_check(const struct sock *sk)
-{
-   return tcp_needs_internal_pacing(sk) &&
-  hrtimer_is_queued(_sk(sk)->pacing_timer);
-}

 /* TCP Small Queues :
  * Control number of packets in qdisc/devices to two packets / or ~1 ms.



> Thanks,
> Jason
>
> On Wed, Jun 3, 2020 at 10:44 AM Eric Dumazet  wrote:
> >
> > On Tue, Jun 2, 2020 at 7:42 PM Jason Xing  wrote:
> > >
> > > I agree with you. The upstream has already dropped and optimized this
> > > part (commit 864e5c090749), so it would not happen like that. However
> > > the old kernels like LTS still have the problem which causes
> > > large-scale crashes on our thousands of machines after running for a
> > > long while. I will send the fix to the correct tree soon :)
> >
> > If you run BBR at scale (thousands of machines), you probably should
> > use sch_fq instead of internal pacing,
> > just saying ;)
> >
> >
> > >
> > > Thanks again,
> > > Jason
> > >
> > > On Wed, Jun 3, 2020 at 10:29 AM Eric Dumazet  wrote:
> > > >
> > > > On Tue, Jun 2, 2020 at 6:53 PM Jason Xing  
> > > > wrote:
> > > > >
> > > > > Hi Eric,
> > > > >
> > > > > I'm sorry that I didn't write enough clearly. We're running the
> > > > > pristine 4.19.125 linux kernel (the latest LTS version) and have been
> > > > > haunted by such an issue. This patch is high-important, I think. So
> > > > > I'm going to resend this email with the [patch 4.19] on the headline
> > > > > and cc Greg.
> > > >
> > > > Yes, please always give for which tree a patch is meant for.
> > > >
> > > > Problem is that your patch is not correct.
> > > > In these old kernels, tcp_internal_pacing() is called _after_ the
> > > > packet has been sent.
> > > > It is too late to 'give up pacing'
> > > >
> > > > The packet should not have been sent if the pacing timer is queued
> > > > (otherwise this means we do not respect pacing)
> > > >
> > > > So the bug should be caught earlier. check where tcp_pacing_check()
> > > > calls are missing.
> > > >
> > > >
> > > >
> > > > >
> >

Re: [PATCH] tcp: fix TCP socks unreleased in BBR mode

2020-06-02 Thread Eric Dumazet
On Tue, Jun 2, 2020 at 7:42 PM Jason Xing  wrote:
>
> I agree with you. The upstream has already dropped and optimized this
> part (commit 864e5c090749), so it would not happen like that. However
> the old kernels like LTS still have the problem which causes
> large-scale crashes on our thousands of machines after running for a
> long while. I will send the fix to the correct tree soon :)

If you run BBR at scale (thousands of machines), you probably should
use sch_fq instead of internal pacing,
just saying ;)


>
> Thanks again,
> Jason
>
> On Wed, Jun 3, 2020 at 10:29 AM Eric Dumazet  wrote:
> >
> > On Tue, Jun 2, 2020 at 6:53 PM Jason Xing  wrote:
> > >
> > > Hi Eric,
> > >
> > > I'm sorry that I didn't write enough clearly. We're running the
> > > pristine 4.19.125 linux kernel (the latest LTS version) and have been
> > > haunted by such an issue. This patch is high-important, I think. So
> > > I'm going to resend this email with the [patch 4.19] on the headline
> > > and cc Greg.
> >
> > Yes, please always give for which tree a patch is meant for.
> >
> > Problem is that your patch is not correct.
> > In these old kernels, tcp_internal_pacing() is called _after_ the
> > packet has been sent.
> > It is too late to 'give up pacing'
> >
> > The packet should not have been sent if the pacing timer is queued
> > (otherwise this means we do not respect pacing)
> >
> > So the bug should be caught earlier. check where tcp_pacing_check()
> > calls are missing.
> >
> >
> >
> > >
> > >
> > > Thanks,
> > > Jason
> > >
> > > On Tue, Jun 2, 2020 at 9:05 PM Eric Dumazet  wrote:
> > > >
> > > > On Tue, Jun 2, 2020 at 1:05 AM  wrote:
> > > > >
> > > > > From: Jason Xing 
> > > > >
> > > > > TCP socks cannot be released because of the sock_hold() increasing the
> > > > > sk_refcnt in the manner of tcp_internal_pacing() when RTO happens.
> > > > > Therefore, this situation could increase the slab memory and then 
> > > > > trigger
> > > > > the OOM if the machine has beening running for a long time. This 
> > > > > issue,
> > > > > however, can happen on some machine only running a few days.
> > > > >
> > > > > We add one exception case to avoid unneeded use of sock_hold if the
> > > > > pacing_timer is enqueued.
> > > > >
> > > > > Reproduce procedure:
> > > > > 0) cat /proc/slabinfo | grep TCP
> > > > > 1) switch net.ipv4.tcp_congestion_control to bbr
> > > > > 2) using wrk tool something like that to send packages
> > > > > 3) using tc to increase the delay in the dev to simulate the busy 
> > > > > case.
> > > > > 4) cat /proc/slabinfo | grep TCP
> > > > > 5) kill the wrk command and observe the number of objects and slabs 
> > > > > in TCP.
> > > > > 6) at last, you could notice that the number would not decrease.
> > > > >
> > > > > Signed-off-by: Jason Xing 
> > > > > Signed-off-by: liweishi 
> > > > > Signed-off-by: Shujin Li 
> > > > > ---
> > > > >  net/ipv4/tcp_output.c | 3 ++-
> > > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > > > > index cc4ba42..5cf63d9 100644
> > > > > --- a/net/ipv4/tcp_output.c
> > > > > +++ b/net/ipv4/tcp_output.c
> > > > > @@ -969,7 +969,8 @@ static void tcp_internal_pacing(struct sock *sk, 
> > > > > const struct sk_buff *skb)
> > > > > u64 len_ns;
> > > > > u32 rate;
> > > > >
> > > > > -   if (!tcp_needs_internal_pacing(sk))
> > > > > +   if (!tcp_needs_internal_pacing(sk) ||
> > > > > +   hrtimer_is_queued(_sk(sk)->pacing_timer))
> > > > > return;
> > > > > rate = sk->sk_pacing_rate;
> > > > > if (!rate || rate == ~0U)
> > > > > --
> > > > > 1.8.3.1
> > > > >
> > > >
> > > > Hi Jason.
> > > >
> > > > Please do not send patches that do not apply to current upstream trees.
> > > >
> > > > Instead, backport to your kernels the needed fixes.
> > > >
> > > > I suspect that you are not using a pristine linux kernel, but some
> > > > heavily modified one and something went wrong in your backports.
> > > > Do not ask us to spend time finding what went wrong.
> > > >
> > > > Thank you.


Re: [PATCH] tcp: fix TCP socks unreleased in BBR mode

2020-06-02 Thread Eric Dumazet
On Tue, Jun 2, 2020 at 6:53 PM Jason Xing  wrote:
>
> Hi Eric,
>
> I'm sorry that I didn't write enough clearly. We're running the
> pristine 4.19.125 linux kernel (the latest LTS version) and have been
> haunted by such an issue. This patch is high-important, I think. So
> I'm going to resend this email with the [patch 4.19] on the headline
> and cc Greg.

Yes, please always give for which tree a patch is meant for.

Problem is that your patch is not correct.
In these old kernels, tcp_internal_pacing() is called _after_ the
packet has been sent.
It is too late to 'give up pacing'

The packet should not have been sent if the pacing timer is queued
(otherwise this means we do not respect pacing)

So the bug should be caught earlier. check where tcp_pacing_check()
calls are missing.



>
>
> Thanks,
> Jason
>
> On Tue, Jun 2, 2020 at 9:05 PM Eric Dumazet  wrote:
> >
> > On Tue, Jun 2, 2020 at 1:05 AM  wrote:
> > >
> > > From: Jason Xing 
> > >
> > > TCP socks cannot be released because of the sock_hold() increasing the
> > > sk_refcnt in the manner of tcp_internal_pacing() when RTO happens.
> > > Therefore, this situation could increase the slab memory and then trigger
> > > the OOM if the machine has beening running for a long time. This issue,
> > > however, can happen on some machine only running a few days.
> > >
> > > We add one exception case to avoid unneeded use of sock_hold if the
> > > pacing_timer is enqueued.
> > >
> > > Reproduce procedure:
> > > 0) cat /proc/slabinfo | grep TCP
> > > 1) switch net.ipv4.tcp_congestion_control to bbr
> > > 2) using wrk tool something like that to send packages
> > > 3) using tc to increase the delay in the dev to simulate the busy case.
> > > 4) cat /proc/slabinfo | grep TCP
> > > 5) kill the wrk command and observe the number of objects and slabs in 
> > > TCP.
> > > 6) at last, you could notice that the number would not decrease.
> > >
> > > Signed-off-by: Jason Xing 
> > > Signed-off-by: liweishi 
> > > Signed-off-by: Shujin Li 
> > > ---
> > >  net/ipv4/tcp_output.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > > index cc4ba42..5cf63d9 100644
> > > --- a/net/ipv4/tcp_output.c
> > > +++ b/net/ipv4/tcp_output.c
> > > @@ -969,7 +969,8 @@ static void tcp_internal_pacing(struct sock *sk, 
> > > const struct sk_buff *skb)
> > > u64 len_ns;
> > > u32 rate;
> > >
> > > -   if (!tcp_needs_internal_pacing(sk))
> > > +   if (!tcp_needs_internal_pacing(sk) ||
> > > +   hrtimer_is_queued(_sk(sk)->pacing_timer))
> > > return;
> > > rate = sk->sk_pacing_rate;
> > > if (!rate || rate == ~0U)
> > > --
> > > 1.8.3.1
> > >
> >
> > Hi Jason.
> >
> > Please do not send patches that do not apply to current upstream trees.
> >
> > Instead, backport to your kernels the needed fixes.
> >
> > I suspect that you are not using a pristine linux kernel, but some
> > heavily modified one and something went wrong in your backports.
> > Do not ask us to spend time finding what went wrong.
> >
> > Thank you.


Re: [PATCH] tcp: fix TCP socks unreleased in BBR mode

2020-06-02 Thread Eric Dumazet
On Tue, Jun 2, 2020 at 1:05 AM  wrote:
>
> From: Jason Xing 
>
> TCP socks cannot be released because of the sock_hold() increasing the
> sk_refcnt in the manner of tcp_internal_pacing() when RTO happens.
> Therefore, this situation could increase the slab memory and then trigger
> the OOM if the machine has beening running for a long time. This issue,
> however, can happen on some machine only running a few days.
>
> We add one exception case to avoid unneeded use of sock_hold if the
> pacing_timer is enqueued.
>
> Reproduce procedure:
> 0) cat /proc/slabinfo | grep TCP
> 1) switch net.ipv4.tcp_congestion_control to bbr
> 2) using wrk tool something like that to send packages
> 3) using tc to increase the delay in the dev to simulate the busy case.
> 4) cat /proc/slabinfo | grep TCP
> 5) kill the wrk command and observe the number of objects and slabs in TCP.
> 6) at last, you could notice that the number would not decrease.
>
> Signed-off-by: Jason Xing 
> Signed-off-by: liweishi 
> Signed-off-by: Shujin Li 
> ---
>  net/ipv4/tcp_output.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index cc4ba42..5cf63d9 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -969,7 +969,8 @@ static void tcp_internal_pacing(struct sock *sk, const 
> struct sk_buff *skb)
> u64 len_ns;
> u32 rate;
>
> -   if (!tcp_needs_internal_pacing(sk))
> +   if (!tcp_needs_internal_pacing(sk) ||
> +   hrtimer_is_queued(_sk(sk)->pacing_timer))
> return;
> rate = sk->sk_pacing_rate;
> if (!rate || rate == ~0U)
> --
> 1.8.3.1
>

Hi Jason.

Please do not send patches that do not apply to current upstream trees.

Instead, backport to your kernels the needed fixes.

I suspect that you are not using a pristine linux kernel, but some
heavily modified one and something went wrong in your backports.
Do not ask us to spend time finding what went wrong.

Thank you.


Re: general protection fault in inet_unhash

2020-05-29 Thread Eric Dumazet



On 5/29/20 10:32 AM, Eric Dumazet wrote:

> L2TP seems to use sk->sk_node to insert sockets into l2tp_ip_table, _and_ 
> uses l2tp_ip_prot.unhash == inet_unhash
> 
> So if/when BPF_CGROUP_RUN_PROG_INET_SOCK(sk) returns an error and 
> inet_create() calls sk_common_release()
> bad things happen, because inet_unhash() expects a valid hashinfo pointer.
> 
> I guess the following patch should fix this.
> 
> Bug has been there forever, but only BPF_CGROUP_RUN_PROG_INET_SOCK(sk) could 
> trigger it.
>

Official submission : 
https://patchwork.ozlabs.org/project/netdev/patch/20200529180838.107255-1-eduma...@google.com/



Re: general protection fault in inet_unhash

2020-05-29 Thread Eric Dumazet



On 5/28/20 11:32 PM, Andrii Nakryiko wrote:
> On 5/28/20 11:23 PM, Dmitry Vyukov wrote:
>> On Thu, May 28, 2020 at 11:01 PM 'Andrii Nakryiko' via syzkaller-bugs
>>  wrote:
>>>
>>> On 5/28/20 9:44 AM, syzbot wrote:
 Hello,

 syzbot found the following crash on:

 HEAD commit:    dc0f3ed1 net: phy: at803x: add cable diagnostics support 
 f..
 git tree:   net-next
 console output: 
 https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_log.txt-3Fx-3D17289cd210=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=vxqvl81C2rT6GOGdPyz8iQ=sMAtpavBBjBzFzT0V8c6FcH8cu2M9da3ZozO5Lc8do0=t1v5ZakZM9Aw_9u_I6FbFZ28U0GFs0e9dMMUOyiDxO4=
 kernel config:  
 https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_.config-3Fx-3D7e1bc97341edbea6=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=vxqvl81C2rT6GOGdPyz8iQ=sMAtpavBBjBzFzT0V8c6FcH8cu2M9da3ZozO5Lc8do0=yeXCTODuJF6ExmCJ-ppqMHsfvMCbCQ9zkmZi3W6NGHo=
 dashboard link: 
 https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_bug-3Fextid-3D3610d489778b57cc8031=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=vxqvl81C2rT6GOGdPyz8iQ=sMAtpavBBjBzFzT0V8c6FcH8cu2M9da3ZozO5Lc8do0=8fAJHh81yojiinnGJzTw6hN4w4A6XRZST4463CWL9Y8=
 compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
 syz repro:  
 https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_repro.syz-3Fx-3D15f237aa10=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=vxqvl81C2rT6GOGdPyz8iQ=sMAtpavBBjBzFzT0V8c6FcH8cu2M9da3ZozO5Lc8do0=cPv-hQsGYs0CVz3I26BmauS0hQ8_YTWHeH5p-U5ElWY=
 C reproducer:   
 https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_repro.c-3Fx-3D1553834a10=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=vxqvl81C2rT6GOGdPyz8iQ=sMAtpavBBjBzFzT0V8c6FcH8cu2M9da3ZozO5Lc8do0=r6sGJDOgosZDE9sRxqFnVibDNJFt_6IteSWeqEQLbNE=

 The bug was bisected to:

 commit af6eea57437a830293eab56246b6025cc7d46ee7
 Author: Andrii Nakryiko 
 Date:   Mon Mar 30 02:59:58 2020 +

   bpf: Implement bpf_link-based cgroup BPF program attachment

 bisection log:  
 https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_bisect.txt-3Fx-3D1173cd7e10=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=vxqvl81C2rT6GOGdPyz8iQ=sMAtpavBBjBzFzT0V8c6FcH8cu2M9da3ZozO5Lc8do0=rJIpYFSAMRfea3349dd7PhmLD_hriVwq8ZtTHcSagBA=
 final crash:    
 https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_report.txt-3Fx-3D1373cd7e10=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=vxqvl81C2rT6GOGdPyz8iQ=sMAtpavBBjBzFzT0V8c6FcH8cu2M9da3ZozO5Lc8do0=TWpx5JNdxKiKPABUScn8WB7u3fXueCp7BXwQHg4Unz0=
 console output: 
 https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_log.txt-3Fx-3D1573cd7e10=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=vxqvl81C2rT6GOGdPyz8iQ=sMAtpavBBjBzFzT0V8c6FcH8cu2M9da3ZozO5Lc8do0=-SMhn-dVZI4W51EZQ8Im0sdThgwt9M6fxUt3_bcYvk8=

 IMPORTANT: if you fix the bug, please add the following tag to the commit:
 Reported-by: syzbot+3610d489778b57cc8...@syzkaller.appspotmail.com
 Fixes: af6eea57437a ("bpf: Implement bpf_link-based cgroup BPF program 
 attachment")

 general protection fault, probably for non-canonical address 
 0xdc01:  [#1] PREEMPT SMP KASAN
 KASAN: null-ptr-deref in range [0x0008-0x000f]
 CPU: 0 PID: 7063 Comm: syz-executor654 Not tainted 5.7.0-rc6-syzkaller #0
 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
 Google 01/01/2011
 RIP: 0010:inet_unhash+0x11f/0x770 net/ipv4/inet_hashtables.c:600
>>>
>>> No idea why it was bisected to bpf_link change. It seems completely
>>> struct sock-related. Seems like
>>
>> Hi Andrii,
>>
>> You can always find a detailed explanation of syzbot bisections under
>> the "bisection log" link.
> 
> Right. Sorry, I didn't mean that bisect went wrong or anything like that. I 
> just don't see how my change has anything to do with invalid socket state. As 
> I just replied in another email, this particular repro is using 
> bpf_link_create() for cgroup attachment, which was added in my patch. So 
> running repro before my patch would always fail to attach BPF program, and 
> thus won't be able to repro the issue (because the bug is somewhere in the 
> interaction between BPF program attachment and socket itself). So it will 
> always bisect to my patch :)

L2TP seems to use sk->sk_node to insert sockets into l2tp_ip_table, _and_ uses 
l2tp_ip_prot.unhash == inet_unhash

So if/when BPF_CGROUP_RUN_PROG_INET_SOCK(sk) returns an error and inet_create() 
calls sk_common_release()
bad things happen, because inet_unhash() expects a valid hashinfo pointer.

I guess the following patch should fix this.

Bug has been there forever, but only BPF_CGROUP_RUN_PROG_INET_SOCK(sk) could 
trigger it.

diff --git a/net/l2tp/l2tp_core.h b/net/l2tp/l2tp_core.h
index 

Re: general protection fault in inet_unhash

2020-05-28 Thread Eric Dumazet



On 5/28/20 2:01 PM, Andrii Nakryiko wrote:
> On 5/28/20 9:44 AM, syzbot wrote:
>> Hello,
>>
>> syzbot found the following crash on:
>>
>> HEAD commit:    dc0f3ed1 net: phy: at803x: add cable diagnostics support f..
>> git tree:   net-next
>> console output: 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_log.txt-3Fx-3D17289cd210=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=vxqvl81C2rT6GOGdPyz8iQ=sMAtpavBBjBzFzT0V8c6FcH8cu2M9da3ZozO5Lc8do0=t1v5ZakZM9Aw_9u_I6FbFZ28U0GFs0e9dMMUOyiDxO4=
>> kernel config:  
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_.config-3Fx-3D7e1bc97341edbea6=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=vxqvl81C2rT6GOGdPyz8iQ=sMAtpavBBjBzFzT0V8c6FcH8cu2M9da3ZozO5Lc8do0=yeXCTODuJF6ExmCJ-ppqMHsfvMCbCQ9zkmZi3W6NGHo=
>> dashboard link: 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_bug-3Fextid-3D3610d489778b57cc8031=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=vxqvl81C2rT6GOGdPyz8iQ=sMAtpavBBjBzFzT0V8c6FcH8cu2M9da3ZozO5Lc8do0=8fAJHh81yojiinnGJzTw6hN4w4A6XRZST4463CWL9Y8=
>> compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
>> syz repro:  
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_repro.syz-3Fx-3D15f237aa10=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=vxqvl81C2rT6GOGdPyz8iQ=sMAtpavBBjBzFzT0V8c6FcH8cu2M9da3ZozO5Lc8do0=cPv-hQsGYs0CVz3I26BmauS0hQ8_YTWHeH5p-U5ElWY=
>> C reproducer:   
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_repro.c-3Fx-3D1553834a10=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=vxqvl81C2rT6GOGdPyz8iQ=sMAtpavBBjBzFzT0V8c6FcH8cu2M9da3ZozO5Lc8do0=r6sGJDOgosZDE9sRxqFnVibDNJFt_6IteSWeqEQLbNE=
>>
>> The bug was bisected to:
>>
>> commit af6eea57437a830293eab56246b6025cc7d46ee7
>> Author: Andrii Nakryiko 
>> Date:   Mon Mar 30 02:59:58 2020 +
>>
>>  bpf: Implement bpf_link-based cgroup BPF program attachment
>>
>> bisection log:  
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_bisect.txt-3Fx-3D1173cd7e10=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=vxqvl81C2rT6GOGdPyz8iQ=sMAtpavBBjBzFzT0V8c6FcH8cu2M9da3ZozO5Lc8do0=rJIpYFSAMRfea3349dd7PhmLD_hriVwq8ZtTHcSagBA=
>> final crash:    
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_report.txt-3Fx-3D1373cd7e10=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=vxqvl81C2rT6GOGdPyz8iQ=sMAtpavBBjBzFzT0V8c6FcH8cu2M9da3ZozO5Lc8do0=TWpx5JNdxKiKPABUScn8WB7u3fXueCp7BXwQHg4Unz0=
>> console output: 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_log.txt-3Fx-3D1573cd7e10=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=vxqvl81C2rT6GOGdPyz8iQ=sMAtpavBBjBzFzT0V8c6FcH8cu2M9da3ZozO5Lc8do0=-SMhn-dVZI4W51EZQ8Im0sdThgwt9M6fxUt3_bcYvk8=
>>
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: syzbot+3610d489778b57cc8...@syzkaller.appspotmail.com
>> Fixes: af6eea57437a ("bpf: Implement bpf_link-based cgroup BPF program 
>> attachment")
>>
>> general protection fault, probably for non-canonical address 
>> 0xdc01:  [#1] PREEMPT SMP KASAN
>> KASAN: null-ptr-deref in range [0x0008-0x000f]
>> CPU: 0 PID: 7063 Comm: syz-executor654 Not tainted 5.7.0-rc6-syzkaller #0
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
>> Google 01/01/2011
>> RIP: 0010:inet_unhash+0x11f/0x770 net/ipv4/inet_hashtables.c:600
> 
> No idea why it was bisected to bpf_link change. It seems completely struct 
> sock-related. Seems like
> 
> struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
> 
> ends up being NULL.
> 
> Can some more networking-savvy people help with investigating this, please?

Well, the repro definitely uses BPF

On the following run, my kernel does not have L2TP, so does not crash.

[pid 817013] bpf(BPF_TASK_FD_QUERY, {task_fd_query={pid=0, fd=-1, flags=0, 
buf_len=7, buf="cgroup", prog_id=0, fd_type=BPF_FD_TYPE_RAW_TRACEPOINT, 
probe_offset=0, probe_addr=0}}, 48) = -1 ENOENT (No such file or directory)
[pid 817013] openat(AT_FDCWD, "cgroup", O_RDWR|O_PATH) = 3
[pid 817013] bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_CGROUP_SOCK, 
insn_cnt=4, insns=0x2000, license="GPL", log_level=0, log_size=0, 
log_buf=NULL, kern_version=KERNEL_VERSION(0, 0, 0), prog_flags=0, prog_name="", 
prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS, prog_btf_fd=-1, 
func_info_rec_size=8, func_info=NULL, func_info_cnt=0, line_info_rec_size=16, 
line_info=NULL, line_info_cnt=0, attach_btf_id=0}, 112) = -1 EPERM (Operation 
not permitted)
[pid 817013] bpf(BPF_LINK_CREATE, {link_create={prog_fd=-1, target_fd=3, 
attach_type=BPF_CGROUP_INET_SOCK_CREATE, flags=0}}, 16) = -1 EBADF (Bad file 
descriptor)
[pid 817013] socket(AF_INET, SOCK_DGRAM, IPPROTO_L2TP 
[pid 816180] <... nanosleep resumed>NULL) = 0
[pid 816180] wait4(-1, 0x7fffa59867cc, WNOHANG|__WALL, NULL) = 0
[pid 816180] nanosleep({tv_sec=0, tv_nsec=100},  
[pid 817013] <... socket resumed>)  = -1 EPROTONOSUPPORT (Protocol not 

Re: [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount

2020-05-20 Thread Eric Dumazet



On 5/19/20 11:42 PM, Ahmed S. Darwish wrote:
> Hello Eric,
> 
> On Tue, May 19, 2020 at 07:01:38PM -0700, Eric Dumazet wrote:
>>
>> On 5/19/20 2:45 PM, Ahmed S. Darwish wrote:
>>> Sequence counters write paths are critical sections that must never be
>>> preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.
>>>
>>> Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
>>> netdev name retrieval.") handled a deadlock, observed with
>>> CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
>>> infinitely spinning: it got scheduled after the seqcount write side
>>> blocked inside its own critical section.
>>>
>>> To fix that deadlock, among other issues, the commit added a
>>> cond_resched() inside the read side section. While this will get the
>>> non-preemptible kernel eventually unstuck, the seqcount reader is fully
>>> exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.
>>>
>>> The fix is also still broken: if the seqcount reader belongs to a
>>> real-time scheduling policy, it can spin forever and the kernel will
>>> livelock.
>>>
>>> Disabling preemption over the seqcount write side critical section will
>>> not work: inside it are a number of GFP_KERNEL allocations and mutex
>>> locking through the drivers/base/ :: device_rename() call chain.
>>>
>>> From all the above, replace the seqcount with a rwsem.
>>>
>>> Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and 
>>> netdev name retrieval.)
>>> Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
>>> Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to 
>>> return an interface name)
>>> Cc: 
>>> Signed-off-by: Ahmed S. Darwish 
>>> Reviewed-by: Sebastian Andrzej Siewior 
>>> ---
>>>  net/core/dev.c | 30 --
>>>  1 file changed, 12 insertions(+), 18 deletions(-)
>>>
>>
>> Seems fine to me, assuming rwsem prevent starvation of the writer.
>>
> 
> Thanks for the review.
> 
> AFAIK, due to 5cfd92e12e13 ("locking/rwsem: Adaptive disabling of reader
> optimistic spinning"), using a rwsem shouldn't lead to writer starvation
> in the contended case.

Hmm this was in linux-5.3, so very recent stuff.

Has this patch been backported to stable releases ?

With all the Fixes: tags you added, stable teams will backport this networking 
patch to
all stable versions.

Do we have a way to tune a dedicare rwsem to 'give preference to the (unique in 
this case) writer" over
a myriad of potential readers ?

Thanks.



Re: [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount

2020-05-19 Thread Eric Dumazet



On 5/19/20 7:57 PM, David Miller wrote:
> From: Thomas Gleixner 
> Date: Wed, 20 May 2020 01:42:30 +0200
> 
>> Stephen Hemminger  writes:
>>> On Wed, 20 May 2020 00:23:48 +0200
>>> Thomas Gleixner  wrote:
 No. We did not. -ENOTESTCASE
>>>
>>> Please try, it isn't that hard..
>>>
>>> # time for ((i=0;i<1000;i++)); do ip li add dev dummy$i type dummy; done
>>>
>>> real0m17.002s
>>> user0m1.064s
>>> sys 0m0.375s
>>
>> And that solves the incorrectness of the current code in which way?
> 
> You mentioned that there wasn't a test case, he gave you one to try.
> 

I do not think this would ever use device rename, nor netdev_get_name()

None of this stuff is fast path really.

# time for ((i=1;i<1000;i++)); do ip li add dev dummy$i type dummy; done

real0m1.127s
user0m0.270s
sys 0m1.039s


Re: [PATCH v1 01/25] net: core: device_rename: Use rwsem instead of a seqcount

2020-05-19 Thread Eric Dumazet



On 5/19/20 2:45 PM, Ahmed S. Darwish wrote:
> Sequence counters write paths are critical sections that must never be
> preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.
> 
> Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
> netdev name retrieval.") handled a deadlock, observed with
> CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
> infinitely spinning: it got scheduled after the seqcount write side
> blocked inside its own critical section.
> 
> To fix that deadlock, among other issues, the commit added a
> cond_resched() inside the read side section. While this will get the
> non-preemptible kernel eventually unstuck, the seqcount reader is fully
> exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.
> 
> The fix is also still broken: if the seqcount reader belongs to a
> real-time scheduling policy, it can spin forever and the kernel will
> livelock.
> 
> Disabling preemption over the seqcount write side critical section will
> not work: inside it are a number of GFP_KERNEL allocations and mutex
> locking through the drivers/base/ :: device_rename() call chain.
> 
> From all the above, replace the seqcount with a rwsem.
> 
> Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and 
> netdev name retrieval.)
> Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
> Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to 
> return an interface name)
> Cc: 
> Signed-off-by: Ahmed S. Darwish 
> Reviewed-by: Sebastian Andrzej Siewior 
> ---
>  net/core/dev.c | 30 --
>  1 file changed, 12 insertions(+), 18 deletions(-)
>

Seems fine to me, assuming rwsem prevent starvation of the writer.

(Presumably this could be a per ndevice rwsem, or per netns, to provide some 
isolation)

Alternative would be to convert ndev->name from char array to a pointer (rcu 
protected),
but this looks quite invasive change, certainly not for stable branches.

Reviewed-by: Eric Dumazet 




Re: [PATCH] net/packet: simply allocations in alloc_one_pg_vec_page

2020-05-16 Thread Eric Dumazet
On Sat, May 16, 2020 at 3:35 PM Shakeel Butt  wrote:
>
> On Sat, May 16, 2020 at 1:40 PM David Miller  wrote:
> >
> > From: Shakeel Butt 
> > Date: Fri, 15 May 2020 19:17:36 -0700
> >
> > > and thus there is no need to have any fallback after vzalloc.
> >
> > This statement is false.
> >
> > The virtual mapping allocation or the page table allocations can fail.
> >
> > A fallback is therefore indeed necessary.
>
> I am assuming that you at least agree that vzalloc should only be
> called for non-zero order allocations. So, my argument is if non-zero
> order vzalloc has failed (allocations internal to vzalloc, including
> virtual mapping allocation and page table allocations, are order 0 and
> use GFP_KERNEL i.e. triggering reclaim and oom-killer) then the next
> non-zero order page allocation has very low chance of succeeding.


32bit kernels might have exhausted their vmalloc space, yet they can
still allocate order-0 pages.


Re: [PATCH v2] Implement close-on-fork

2020-05-15 Thread Eric Dumazet
On Fri, May 15, 2020 at 8:23 AM Nate Karstens  wrote:
>
>
> Series of 4 patches to implement close-on-fork. Tests have been
> published to https://github.com/nkarstens/ltp/tree/close-on-fork
> and cover close-on-fork functionality in the following syscalls:
>
>  * accept(4)
>  * dup3(2)
>  * fcntl(2)
>  * open(2)
>  * socket(2)
>  * socketpair(2)
>  * unshare(2)
>
> Addresses underlying issue in that there is no way to prevent
> a fork() from duplicating a file descriptor. The existing
> close-on-exec flag partially-addresses this by allowing the
> parent process to mark a file descriptor as exclusive to itself,
> but there is still a period of time the failure can occur
> because the auto-close only occurs during the exec().
>
> One manifestation of this is a race conditions in system(), which
> (depending on the implementation) is non-atomic in that it first
> calls a fork() and then an exec().
>
> This functionality was approved by the Austin Common Standards
> Revision Group for inclusion in the next revision of the POSIX
> standard (see issue 1318 in the Austin Group Defect Tracker).
>
> ---
>
> This is v2 of the change. See https://lkml.org/lkml/2020/4/20/113
> for the original work.
>
> Thanks to everyone who provided comments on the first series of
> patches. Here are replies to specific comments:
>
> > I suggest we group the two bits of a file (close_on_exec, close_on_fork)
> > together, so that we do not have to dirty two separate cache lines.
>
> I could be mistaken, but I don't think this would improve efficiency.
> The close-on-fork and close-on-exec flags are read at different
> times. If you assume separate syscalls for fork and exec then
> there are several switches between when the two flags are read.
> In addition, the close-on-fork flags in the new process must be
> cleared, which will be much harder if the flags are interleaved.

:/

Fast path in big and performance sensitive applications is not fork()
and/or exec().

This is open()/close() and others (socket(), accept(), ...)

We do not want them to access extra cache lines for this new feature.

Sorry, I will say no to these patches in their current form.


Re: [regression] TC_MD5SIG on established sockets

2020-05-13 Thread Eric Dumazet
On Wed, May 13, 2020 at 12:49 PM Eric Dumazet  wrote:
>
> I do not think we want to transition sockets in the middle. since
> packets can be re-ordered in the network.
>
> MD5 is about security (and a loose form of it), so better make sure
> all packets have it from the beginning of the flow.
>
> A flow with TCP TS on can not suddenly be sending packets without TCP TS.
>
> Clearly, trying to support this operation is a can of worms, I do not
> want to maintain such atrocity.
>
> RFC can state whatever it wants, sometimes reality forces us to have
> sane operations.
>
> Thanks.


Also the RFC states :

"This password never appears in the connection stream, and the actual
form of the password is up to the application. It could even change
during the lifetime of a particular connection so long as this change
was synchronized on both ends"

It means the key can be changed, but this does not imply the option
can be turned on/off dynamically.



>
> On Wed, May 13, 2020 at 12:38 PM Mathieu Desnoyers
>  wrote:
> >
> > Hi,
> >
> > I am reporting a regression with respect to use of TCP_MD5SIG/TCP_MD5SIG_EXT
> > on established sockets. It is observed by a customer.
> >
> > This issue is introduced by this commit:
> >
> > commit 721230326891 "tcp: md5: reject TCP_MD5SIG or TCP_MD5SIG_EXT on 
> > established sockets"
> >
> > The intent of this commit appears to be to fix a use of uninitialized value 
> > in
> > tcp_parse_options(). The change introduced by this commit is to disallow 
> > setting
> > the TCP_MD5SIG{,_EXT} socket options on an established socket.
> >
> > The justification for this change appears in the commit message:
> >
> >"I believe this was caused by a TCP_MD5SIG being set on live
> > flow.
> >
> > This is highly unexpected, since TCP option space is limited.
> >
> > For instance, presence of TCP MD5 option automatically disables
> > TCP TimeStamp option at SYN/SYNACK time, which we can not do
> > once flow has been established.
> >
> > Really, adding/deleting an MD5 key only makes sense on sockets
> > in CLOSE or LISTEN state."
> >
> > However, reading through RFC2385 [1], this justification does not appear
> > correct. Quoting to the RFC:
> >
> >"This password never appears in the connection stream, and the actual
> > form of the password is up to the application. It could even change
> > during the lifetime of a particular connection so long as this change
> > was synchronized on both ends"
> >
> > The paragraph above clearly underlines that changing the MD5 signature of
> > a live TCP socket is allowed.
> >
> > I also do not understand why it would be invalid to transition an 
> > established
> > TCP socket from no-MD5 to MD5, or transition from MD5 to no-MD5. Quoting the
> > RFC:
> >
> >   "The total header size is also an issue.  The TCP header specifies
> >where segment data starts with a 4-bit field which gives the total
> >size of the header (including options) in 32-byte words.  This means
> >that the total size of the header plus option must be less than or
> >equal to 60 bytes -- this leaves 40 bytes for options."
> >
> > The paragraph above seems to be the only indication that some TCP options
> > cannot be combined on a given TCP socket: if the resulting header size does
> > not fit. However, I do not see anything in the specification preventing any
> > of the following use-cases on an established TCP socket:
> >
> > - Transition from no-MD5 to MD5,
> > - Transition from MD5 to no-MD5,
> > - Changing the MD5 key associated with a socket.
> >
> > As long as the resulting combination of options does not exceed the 
> > available
> > header space.
> >
> > Can we please fix this KASAN report in a way that does not break user-space
> > applications expectations about Linux' implementation of RFC2385 ?
> >
> > Thanks,
> >
> > Mathieu
> >
> > [1] RFC2385: https://tools.ietf.org/html/rfc2385
> >
> > --
> > Mathieu Desnoyers
> > EfficiOS Inc.
> > http://www.efficios.com


Re: [regression] TC_MD5SIG on established sockets

2020-05-13 Thread Eric Dumazet
I do not think we want to transition sockets in the middle. since
packets can be re-ordered in the network.

MD5 is about security (and a loose form of it), so better make sure
all packets have it from the beginning of the flow.

A flow with TCP TS on can not suddenly be sending packets without TCP TS.

Clearly, trying to support this operation is a can of worms, I do not
want to maintain such atrocity.

RFC can state whatever it wants, sometimes reality forces us to have
sane operations.

Thanks.

On Wed, May 13, 2020 at 12:38 PM Mathieu Desnoyers
 wrote:
>
> Hi,
>
> I am reporting a regression with respect to use of TCP_MD5SIG/TCP_MD5SIG_EXT
> on established sockets. It is observed by a customer.
>
> This issue is introduced by this commit:
>
> commit 721230326891 "tcp: md5: reject TCP_MD5SIG or TCP_MD5SIG_EXT on 
> established sockets"
>
> The intent of this commit appears to be to fix a use of uninitialized value in
> tcp_parse_options(). The change introduced by this commit is to disallow 
> setting
> the TCP_MD5SIG{,_EXT} socket options on an established socket.
>
> The justification for this change appears in the commit message:
>
>"I believe this was caused by a TCP_MD5SIG being set on live
> flow.
>
> This is highly unexpected, since TCP option space is limited.
>
> For instance, presence of TCP MD5 option automatically disables
> TCP TimeStamp option at SYN/SYNACK time, which we can not do
> once flow has been established.
>
> Really, adding/deleting an MD5 key only makes sense on sockets
> in CLOSE or LISTEN state."
>
> However, reading through RFC2385 [1], this justification does not appear
> correct. Quoting to the RFC:
>
>"This password never appears in the connection stream, and the actual
> form of the password is up to the application. It could even change
> during the lifetime of a particular connection so long as this change
> was synchronized on both ends"
>
> The paragraph above clearly underlines that changing the MD5 signature of
> a live TCP socket is allowed.
>
> I also do not understand why it would be invalid to transition an established
> TCP socket from no-MD5 to MD5, or transition from MD5 to no-MD5. Quoting the
> RFC:
>
>   "The total header size is also an issue.  The TCP header specifies
>where segment data starts with a 4-bit field which gives the total
>size of the header (including options) in 32-byte words.  This means
>that the total size of the header plus option must be less than or
>equal to 60 bytes -- this leaves 40 bytes for options."
>
> The paragraph above seems to be the only indication that some TCP options
> cannot be combined on a given TCP socket: if the resulting header size does
> not fit. However, I do not see anything in the specification preventing any
> of the following use-cases on an established TCP socket:
>
> - Transition from no-MD5 to MD5,
> - Transition from MD5 to no-MD5,
> - Changing the MD5 key associated with a socket.
>
> As long as the resulting combination of options does not exceed the available
> header space.
>
> Can we please fix this KASAN report in a way that does not break user-space
> applications expectations about Linux' implementation of RFC2385 ?
>
> Thanks,
>
> Mathieu
>
> [1] RFC2385: https://tools.ietf.org/html/rfc2385
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com


Re: [PATCH 3/3] net: cleanly handle kernel vs user buffers for ->msg_control

2020-05-13 Thread Eric Dumazet



On 5/13/20 9:09 AM, Christoph Hellwig wrote:
> On Wed, May 13, 2020 at 08:41:57AM -0700, Eric Dumazet wrote:
>>> +* recv* side when msg_control_is_user is set, msg_control is the kernel
>>> +* buffer used for all other cases.
>>> +*/
>>> +   union {
>>> +   void*msg_control;
>>> +   void __user *msg_control_user;
>>> +   };
>>> +   boolmsg_control_is_user : 1;
>>
>> Adding a field in this structure seems dangerous.
>>
>> Some users of 'struct msghdr '  define their own struct on the stack,
>> and are unaware of this new mandatory field.
>>
>> This bit contains garbage, crashes are likely to happen ?
>>
>> Look at IPV6_2292PKTOPTIONS for example.
> 
> I though of that, an that is why the field is structured as-is.  The idea
> is that the field only matters if:
> 
>  (1) we are in the recvmsg and friends path, and
>  (2) msg_control is non-zero
> 
> I went through the places that initialize msg_control to find any spot
> that would need an annotation.  The IPV6_2292PKTOPTIONS sockopt doesn't
> need one as it is using the msghdr in sendmsg-like context.
> 
> That being said while I did the audit I'd appreciate another look from
> people that know the networking code better than me of course.
> 

Please try the following syzbot repro, since it crashes after your patch.

// autogenerated by syzkaller (https://github.com/google/syzkaller)

#define _GNU_SOURCE

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

uint64_t r[1] = {0x};

int main(void)
{
  syscall(__NR_mmap, 0x1000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul);
  syscall(__NR_mmap, 0x2000ul, 0x100ul, 7ul, 0x32ul, -1, 0ul);
  syscall(__NR_mmap, 0x2100ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul);
  intptr_t res = 0;

  // socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 3
  res = syscall(__NR_socket, 0xaul, 1ul, 0);
  if (res != -1)
r[0] = res;

  *(uint32_t*)0x2080 = 7;
  // setsockopt(3, SOL_IPV6, IPV6_2292HOPLIMIT, [7], 4) = 0
  syscall(__NR_setsockopt, r[0], 0x29, 8, 0x2080ul, 4ul);

  *(uint32_t*)0x2040 = 0x18ff8;
  // getsockopt(3, SOL_IPV6, IPV6_2292PKTOPTIONS, 
"\24\0\0\0\0\0\0\0)\0\0\0\10\0\0\0\1\0\0\0\0\0\0\0", [102392->24]) = 0
  syscall(__NR_getsockopt, r[0], 0x29, 6, 0x20004040ul, 0x2040ul);

  return 0;
}




Re: [PATCH 3/3] net: cleanly handle kernel vs user buffers for ->msg_control

2020-05-13 Thread Eric Dumazet



On 5/11/20 4:59 AM, Christoph Hellwig wrote:
> The msg_control field in struct msghdr can either contain a user
> pointer when used with the recvmsg system call, or a kernel pointer
> when used with sendmsg.  To complicate things further kernel_recvmsg
> can stuff a kernel pointer in and then use set_fs to make the uaccess
> helpers accept it.
> 
> Replace it with a union of a kernel pointer msg_control field, and
> a user pointer msg_control_user one, and allow kernel_recvmsg operate
> on a proper kernel pointer using a bitfield to override the normal
> choice of a user pointer for recvmsg.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  include/linux/socket.h | 12 ++-
>  net/compat.c   |  5 +++--
>  net/core/scm.c | 49 --
>  net/ipv4/ip_sockglue.c |  3 ++-
>  net/socket.c   | 22 ++-
>  5 files changed, 50 insertions(+), 41 deletions(-)
> 
> diff --git a/include/linux/socket.h b/include/linux/socket.h
> index 4cc64d611cf49..04d2bc97f497d 100644
> --- a/include/linux/socket.h
> +++ b/include/linux/socket.h
> @@ -50,7 +50,17 @@ struct msghdr {
>   void*msg_name;  /* ptr to socket address structure */
>   int msg_namelen;/* size of socket address structure */
>   struct iov_iter msg_iter;   /* data */
> - void*msg_control;   /* ancillary data */
> +
> + /*
> +  * Ancillary data. msg_control_user is the user buffer used for the
> +  * recv* side when msg_control_is_user is set, msg_control is the kernel
> +  * buffer used for all other cases.
> +  */
> + union {
> + void*msg_control;
> + void __user *msg_control_user;
> + };
> + boolmsg_control_is_user : 1;

Adding a field in this structure seems dangerous.

Some users of 'struct msghdr '  define their own struct on the stack,
and are unaware of this new mandatory field.

This bit contains garbage, crashes are likely to happen ?

Look at IPV6_2292PKTOPTIONS for example.




Re: [PATCH] net: optimize cmpxchg in ip_idents_reserve

2020-05-07 Thread Eric Dumazet
On Thu, May 7, 2020 at 2:12 AM Shaokun Zhang  wrote:
>
> Hi Peter/Eric,
>
> Shall we use atomic_add_return() unconditionally and add some comments? Or I 
> missed
> something.
>


Yes. A big fat comment, because I do not want yet another bogus
complaint from someone playing with a buggy UBSAN.


Re: Re: Re: Re: Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-06 Thread Eric Dumazet
On Wed, May 6, 2020 at 5:59 AM SeongJae Park  wrote:
>
> TL; DR: It was not kernel's fault, but the benchmark program.
>
> So, the problem is reproducible using the lebench[1] only.  I carefully read
> it's code again.
>
> Before running the problem occurred "poll big" sub test, lebench executes
> "context switch" sub test.  For the test, it sets the cpu affinity[2] and
> process priority[3] of itself to '0' and '-20', respectively.  However, it
> doesn't restore the values to original value even after the "context switch" 
> is
> finished.  For the reason, "select big" sub test also run binded on CPU 0 and
> has lowest nice value.  Therefore, it can disturb the RCU callback thread for
> the CPU 0, which processes the deferred deallocations of the sockets, and as a
> result it triggers the OOM.
>
> We confirmed the problem disappears by offloading the RCU callbacks from the
> CPU 0 using rcu_nocbs=0 boot parameter or simply restoring the affinity and/or
> priority.
>
> Someone _might_ still argue that this is kernel problem because the problem
> didn't occur on the old kernels prior to the Al's patches.  However, setting
> the affinity and priority was available because the program received the
> permission.  Therefore, it would be reasonable to blame the system
> administrators rather than the kernel.
>
> So, please ignore this patchset, apology for making confuse.  If you still has
> some doubts or need more tests, please let me know.
>
> [1] https://github.com/LinuxPerfStudy/LEBench
> [2] 
> https://github.com/LinuxPerfStudy/LEBench/blob/master/TEST_DIR/OS_Eval.c#L820
> [3] 
> https://github.com/LinuxPerfStudy/LEBench/blob/master/TEST_DIR/OS_Eval.c#L822
>
>
> Thanks,
> SeongJae Park

No harm done, thanks for running more tests and root-causing the issue !


Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread Eric Dumazet



On 5/5/20 9:31 AM, Eric Dumazet wrote:
> 
> 
> On 5/5/20 9:25 AM, Eric Dumazet wrote:
>>
>>
>> On 5/5/20 9:13 AM, SeongJae Park wrote:
>>> On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet  wrote:
>>>
>>>> On Tue, May 5, 2020 at 8:47 AM SeongJae Park  wrote:
>>>>>
>>>>> On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet  
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On 5/5/20 8:07 AM, SeongJae Park wrote:
>>>>>>> On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet  
>>>>>>> wrote:
>>>>>>>
>>>>>>
>>>>>>>> Why do we have 10,000,000 objects around ? Could this be because of
>>>>>>>> some RCU problem ?
>>>>>>>
>>>>>>> Mainly because of a long RCU grace period, as you guess.  I have no 
>>>>>>> idea how
>>>>>>> the grace period became so long in this case.
>>>>>>>
>>>>>>> As my test machine was a virtual machine instance, I guess RCU readers
>>>>>>> preemption[1] like problem might affected this.
>>>>>>>
>>>>>>> [1] 
>>>>>>> https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
>>>>>>>
>>>>>>>>
>>>>>>>> Once Al patches reverted, do you have 10,000,000 sock_alloc around ?
>>>>>>>
>>>>>>> Yes, both the old kernel that prior to Al's patches and the recent 
>>>>>>> kernel
>>>>>>> reverting the Al's patches didn't reproduce the problem.
>>>>>>>
>>>>>>
>>>>>> I repeat my question : Do you have 10,000,000 (smaller) objects kept in 
>>>>>> slab caches ?
>>>>>>
>>>>>> TCP sockets use the (very complex, error prone) SLAB_TYPESAFE_BY_RCU, 
>>>>>> but not the struct socket_wq
>>>>>> object that was allocated in sock_alloc_inode() before Al patches.
>>>>>>
>>>>>> These objects should be visible in kmalloc-64 kmem cache.
>>>>>
>>>>> Not exactly the 10,000,000, as it is only the possible highest number, 
>>>>> but I
>>>>> was able to observe clear exponential increase of the number of the 
>>>>> objects
>>>>> using slabtop.  Before the start of the problematic workload, the number 
>>>>> of
>>>>> objects of 'kmalloc-64' was 5760, but I was able to observe the number 
>>>>> increase
>>>>> to 1,136,576.
>>>>>
>>>>>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
>>>>> before:   5760   5088  88%0.06K 90   64   360K kmalloc-64
>>>>> after:  1136576 1136576 100%0.06K  17759   64 71036K 
>>>>> kmalloc-64
>>>>>
>>>>
>>>> Great, thanks.
>>>>
>>>> How recent is the kernel you are running for your experiment ?
>>>
>>> It's based on 5.4.35.
>>>
>>>>
>>>> Let's make sure the bug is not in RCU.
>>>
>>> One thing I can currently say is that the grace period passes at last.  I
>>> modified the benchmark to repeat not 10,000 times but only 5,000 times to 
>>> run
>>> the test without OOM but easily observable memory pressure.  As soon as the
>>> benchmark finishes, the memory were freed.
>>>
>>> If you need more tests, please let me know.
>>>
>>
>> I would ask Paul opinion on this issue, because we have many objects
>> being freed after RCU grace periods.
>>
>> If RCU subsystem can not keep-up, I guess other workloads will also suffer.
>>
>> Sure, we can revert patches there and there trying to work around the issue,
>> but for objects allocated from process context, we should not have these 
>> problems.
>>
> 
> I wonder if simply adjusting rcu_divisor to 6 or 5 would help 
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 
> d9a49cd6065a20936edbda1b334136ab597cde52..fde833bac0f9f81e8536211b4dad6e7575c1219a
>  100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -427,7 +427,7 @@ module_param(qovld, long, 0444);
>  static ulong jiffies_till_first_fqs = ULONG_MAX;
>  static ulong jiffies_till_next_fqs = ULONG_MAX;
>  static bool rcu_kick_kthreads;
> -static int rcu_divisor = 7;
> +static int rcu_divisor = 6;
>  module_param(rcu_divisor, int, 0644);
>  
>  /* Force an exit from rcu_do_batch() after 3 milliseconds. */
> 

To be clear, you can adjust the value without building a new kernel.

echo 6 >/sys/module/rcutree/parameters/rcu_divisor




Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread Eric Dumazet



On 5/5/20 9:25 AM, Eric Dumazet wrote:
> 
> 
> On 5/5/20 9:13 AM, SeongJae Park wrote:
>> On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet  wrote:
>>
>>> On Tue, May 5, 2020 at 8:47 AM SeongJae Park  wrote:
>>>>
>>>> On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet  
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On 5/5/20 8:07 AM, SeongJae Park wrote:
>>>>>> On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet  
>>>>>> wrote:
>>>>>>
>>>>>
>>>>>>> Why do we have 10,000,000 objects around ? Could this be because of
>>>>>>> some RCU problem ?
>>>>>>
>>>>>> Mainly because of a long RCU grace period, as you guess.  I have no idea 
>>>>>> how
>>>>>> the grace period became so long in this case.
>>>>>>
>>>>>> As my test machine was a virtual machine instance, I guess RCU readers
>>>>>> preemption[1] like problem might affected this.
>>>>>>
>>>>>> [1] https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
>>>>>>
>>>>>>>
>>>>>>> Once Al patches reverted, do you have 10,000,000 sock_alloc around ?
>>>>>>
>>>>>> Yes, both the old kernel that prior to Al's patches and the recent kernel
>>>>>> reverting the Al's patches didn't reproduce the problem.
>>>>>>
>>>>>
>>>>> I repeat my question : Do you have 10,000,000 (smaller) objects kept in 
>>>>> slab caches ?
>>>>>
>>>>> TCP sockets use the (very complex, error prone) SLAB_TYPESAFE_BY_RCU, but 
>>>>> not the struct socket_wq
>>>>> object that was allocated in sock_alloc_inode() before Al patches.
>>>>>
>>>>> These objects should be visible in kmalloc-64 kmem cache.
>>>>
>>>> Not exactly the 10,000,000, as it is only the possible highest number, but 
>>>> I
>>>> was able to observe clear exponential increase of the number of the objects
>>>> using slabtop.  Before the start of the problematic workload, the number of
>>>> objects of 'kmalloc-64' was 5760, but I was able to observe the number 
>>>> increase
>>>> to 1,136,576.
>>>>
>>>>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
>>>> before:   5760   5088  88%0.06K 90   64   360K kmalloc-64
>>>> after:  1136576 1136576 100%0.06K  17759   64 71036K kmalloc-64
>>>>
>>>
>>> Great, thanks.
>>>
>>> How recent is the kernel you are running for your experiment ?
>>
>> It's based on 5.4.35.
>>
>>>
>>> Let's make sure the bug is not in RCU.
>>
>> One thing I can currently say is that the grace period passes at last.  I
>> modified the benchmark to repeat not 10,000 times but only 5,000 times to run
>> the test without OOM but easily observable memory pressure.  As soon as the
>> benchmark finishes, the memory were freed.
>>
>> If you need more tests, please let me know.
>>
> 
> I would ask Paul opinion on this issue, because we have many objects
> being freed after RCU grace periods.
> 
> If RCU subsystem can not keep-up, I guess other workloads will also suffer.
> 
> Sure, we can revert patches there and there trying to work around the issue,
> but for objects allocated from process context, we should not have these 
> problems.
> 

I wonder if simply adjusting rcu_divisor to 6 or 5 would help 

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 
d9a49cd6065a20936edbda1b334136ab597cde52..fde833bac0f9f81e8536211b4dad6e7575c1219a
 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -427,7 +427,7 @@ module_param(qovld, long, 0444);
 static ulong jiffies_till_first_fqs = ULONG_MAX;
 static ulong jiffies_till_next_fqs = ULONG_MAX;
 static bool rcu_kick_kthreads;
-static int rcu_divisor = 7;
+static int rcu_divisor = 6;
 module_param(rcu_divisor, int, 0644);
 
 /* Force an exit from rcu_do_batch() after 3 milliseconds. */



Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread Eric Dumazet



On 5/5/20 9:13 AM, SeongJae Park wrote:
> On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet  wrote:
> 
>> On Tue, May 5, 2020 at 8:47 AM SeongJae Park  wrote:
>>>
>>> On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet  
>>> wrote:
>>>
>>>>
>>>>
>>>> On 5/5/20 8:07 AM, SeongJae Park wrote:
>>>>> On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet  
>>>>> wrote:
>>>>>
>>>>
>>>>>> Why do we have 10,000,000 objects around ? Could this be because of
>>>>>> some RCU problem ?
>>>>>
>>>>> Mainly because of a long RCU grace period, as you guess.  I have no idea 
>>>>> how
>>>>> the grace period became so long in this case.
>>>>>
>>>>> As my test machine was a virtual machine instance, I guess RCU readers
>>>>> preemption[1] like problem might affected this.
>>>>>
>>>>> [1] https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
>>>>>
>>>>>>
>>>>>> Once Al patches reverted, do you have 10,000,000 sock_alloc around ?
>>>>>
>>>>> Yes, both the old kernel that prior to Al's patches and the recent kernel
>>>>> reverting the Al's patches didn't reproduce the problem.
>>>>>
>>>>
>>>> I repeat my question : Do you have 10,000,000 (smaller) objects kept in 
>>>> slab caches ?
>>>>
>>>> TCP sockets use the (very complex, error prone) SLAB_TYPESAFE_BY_RCU, but 
>>>> not the struct socket_wq
>>>> object that was allocated in sock_alloc_inode() before Al patches.
>>>>
>>>> These objects should be visible in kmalloc-64 kmem cache.
>>>
>>> Not exactly the 10,000,000, as it is only the possible highest number, but I
>>> was able to observe clear exponential increase of the number of the objects
>>> using slabtop.  Before the start of the problematic workload, the number of
>>> objects of 'kmalloc-64' was 5760, but I was able to observe the number 
>>> increase
>>> to 1,136,576.
>>>
>>>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
>>> before:   5760   5088  88%0.06K 90   64   360K kmalloc-64
>>> after:  1136576 1136576 100%0.06K  17759   64 71036K kmalloc-64
>>>
>>
>> Great, thanks.
>>
>> How recent is the kernel you are running for your experiment ?
> 
> It's based on 5.4.35.
> 
>>
>> Let's make sure the bug is not in RCU.
> 
> One thing I can currently say is that the grace period passes at last.  I
> modified the benchmark to repeat not 10,000 times but only 5,000 times to run
> the test without OOM but easily observable memory pressure.  As soon as the
> benchmark finishes, the memory were freed.
> 
> If you need more tests, please let me know.
> 

I would ask Paul opinion on this issue, because we have many objects
being freed after RCU grace periods.

If RCU subsystem can not keep-up, I guess other workloads will also suffer.

Sure, we can revert patches there and there trying to work around the issue,
but for objects allocated from process context, we should not have these 
problems.





Re: Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread Eric Dumazet
On Tue, May 5, 2020 at 8:47 AM SeongJae Park  wrote:
>
> On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet  wrote:
>
> >
> >
> > On 5/5/20 8:07 AM, SeongJae Park wrote:
> > > On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet  
> > > wrote:
> > >
> >
> > >> Why do we have 10,000,000 objects around ? Could this be because of
> > >> some RCU problem ?
> > >
> > > Mainly because of a long RCU grace period, as you guess.  I have no idea 
> > > how
> > > the grace period became so long in this case.
> > >
> > > As my test machine was a virtual machine instance, I guess RCU readers
> > > preemption[1] like problem might affected this.
> > >
> > > [1] https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
> > >
> > >>
> > >> Once Al patches reverted, do you have 10,000,000 sock_alloc around ?
> > >
> > > Yes, both the old kernel that prior to Al's patches and the recent kernel
> > > reverting the Al's patches didn't reproduce the problem.
> > >
> >
> > I repeat my question : Do you have 10,000,000 (smaller) objects kept in 
> > slab caches ?
> >
> > TCP sockets use the (very complex, error prone) SLAB_TYPESAFE_BY_RCU, but 
> > not the struct socket_wq
> > object that was allocated in sock_alloc_inode() before Al patches.
> >
> > These objects should be visible in kmalloc-64 kmem cache.
>
> Not exactly the 10,000,000, as it is only the possible highest number, but I
> was able to observe clear exponential increase of the number of the objects
> using slabtop.  Before the start of the problematic workload, the number of
> objects of 'kmalloc-64' was 5760, but I was able to observe the number 
> increase
> to 1,136,576.
>
>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> before:   5760   5088  88%0.06K 90   64   360K kmalloc-64
> after:  1136576 1136576 100%0.06K  17759   64 71036K kmalloc-64
>

Great, thanks.

How recent is the kernel you are running for your experiment ?

Let's make sure the bug is not in RCU.

After Al changes, RCU got slightly better under stress.


Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread Eric Dumazet



On 5/5/20 8:07 AM, SeongJae Park wrote:
> On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet  wrote:
> 

>> Why do we have 10,000,000 objects around ? Could this be because of
>> some RCU problem ?
> 
> Mainly because of a long RCU grace period, as you guess.  I have no idea how
> the grace period became so long in this case.
> 
> As my test machine was a virtual machine instance, I guess RCU readers
> preemption[1] like problem might affected this.
> 
> [1] https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
> 
>>
>> Once Al patches reverted, do you have 10,000,000 sock_alloc around ?
> 
> Yes, both the old kernel that prior to Al's patches and the recent kernel
> reverting the Al's patches didn't reproduce the problem.
>

I repeat my question : Do you have 10,000,000 (smaller) objects kept in slab 
caches ?

TCP sockets use the (very complex, error prone) SLAB_TYPESAFE_BY_RCU, but not 
the struct socket_wq
object that was allocated in sock_alloc_inode() before Al patches.

These objects should be visible in kmalloc-64 kmem cache.



Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

2020-05-05 Thread Eric Dumazet
On Tue, May 5, 2020 at 4:54 AM SeongJae Park  wrote:
>
> CC-ing sta...@vger.kernel.org and adding some more explanations.
>
> On Tue, 5 May 2020 10:10:33 +0200 SeongJae Park  wrote:
>
> > From: SeongJae Park 
> >
> > The commit 6d7855c54e1e ("sockfs: switch to ->free_inode()") made the
> > deallocation of 'socket_alloc' to be done asynchronously using RCU, as
> > same to 'sock.wq'.  And the following commit 333f7909a857 ("coallocate
> > socket_sq with socket itself") made those to have same life cycle.
> >
> > The changes made the code much more simple, but also made 'socket_alloc'
> > live longer than before.  For the reason, user programs intensively
> > repeating allocations and deallocations of sockets could cause memory
> > pressure on recent kernels.
>
> I found this problem on a production virtual machine utilizing 4GB memory 
> while
> running lebench[1].  The 'poll big' test of lebench opens 1000 sockets, polls
> and closes those.  This test is repeated 10,000 times.  Therefore it should
> consume only 1000 'socket_alloc' objects at once.  As size of socket_alloc is
> about 800 Bytes, it's only 800 KiB.  However, on the recent kernels, it could
> consume up to 10,000,000 objects (about 8 GiB).  On the test machine, I
> confirmed it consuming about 4GB of the system memory and results in OOM.
>
> [1] https://github.com/LinuxPerfStudy/LEBench

To be fair, I have not backported Al patches to Google production
kernels, nor I have tried this benchmark.

Why do we have 10,000,000 objects around ? Could this be because of
some RCU problem ?

Once Al patches reverted, do you have 10,000,000 sock_alloc around ?

Thanks.

>
> >
> > To avoid the problem, this commit reverts the changes.
>
> I also tried to make fixup rather than reverts, but I couldn't easily find
> simple fixup.  As the commits 6d7855c54e1e and 333f7909a857 were for code
> refactoring rather than performance optimization, I thought introducing 
> complex
> fixup for this problem would make no sense.  Meanwhile, the memory pressure
> regression could affect real machines.  To this end, I decided to quickly
> revert the commits first and consider better refactoring later.
>
>
> Thanks,
> SeongJae Park
>
> >
> > SeongJae Park (2):
> >   Revert "coallocate socket_wq with socket itself"
> >   Revert "sockfs: switch to ->free_inode()"
> >
> >  drivers/net/tap.c  |  5 +++--
> >  drivers/net/tun.c  |  8 +---
> >  include/linux/if_tap.h |  1 +
> >  include/linux/net.h|  4 ++--
> >  include/net/sock.h |  4 ++--
> >  net/core/sock.c|  2 +-
> >  net/socket.c   | 23 ---
> >  7 files changed, 30 insertions(+), 17 deletions(-)
> >
> > --
> > 2.17.1


Re: [PATCH] net: fix memory leaks in flush_backlog() with RPS

2020-05-01 Thread Eric Dumazet



On 5/1/20 8:15 PM, Qian Cai wrote:
> netif_receive_skb_list_internal() could call enqueue_to_backlog() to put
> some skb to softnet_data.input_pkt_queue and then in
> ip_route_input_slow(), it allocates a dst_entry to be used in
> skb_dst_set(). Later,
> 
> cleanup_net
>   default_device_exit_batch
> unregister_netdevice_many
>   rollback_registered_many
> flush_all_backlogs
> 
> will call flush_backlog() for all CPUs which would call kfree_skb() for
> each skb on the input_pkt_queue without calling skb_dst_drop() first.
> 
> unreferenced object 0x97008e4c4040 (size 176):
>  comm "softirq", pid 0, jiffies 4295173845 (age 32012.550s)
>  hex dump (first 32 bytes):
>00 d0 a5 74 04 97 ff ff 40 72 1a 96 ff ff ff ff  ...t@r..
>c1 a3 c5 95 ff ff ff ff 00 00 00 00 00 00 00 00  
>  backtrace:
>[<30483fae>] kmem_cache_alloc+0x184/0x430
>[<7ae17545>] dst_alloc+0x8e/0x128
>[<1efe9a1f>] rt_dst_alloc+0x6f/0x1e0
>rt_dst_alloc at net/ipv4/route.c:1628
>[] ip_route_input_rcu+0xdfe/0x1640
>ip_route_input_slow at net/ipv4/route.c:2218
>(inlined by) ip_route_input_rcu at net/ipv4/route.c:2348
>[<9f30cbc0>] ip_route_input_noref+0xab/0x1a0
>[<4f53bd04>] arp_process+0x83a/0xf50
>arp_process at net/ipv4/arp.c:813 (discriminator 1)
>[<61fd547d>] arp_rcv+0x276/0x330
>[<07dbfa7a>] __netif_receive_skb_list_core+0x4d2/0x500
>[<62d5f6d2>] netif_receive_skb_list_internal+0x4cb/0x7d0
>[<2baa2b74>] gro_normal_list+0x55/0xc0
>[<93d04885>] napi_complete_done+0xea/0x350
>[<467dd088>] tg3_poll_msix+0x174/0x310 [tg3]
>[<498af7d9>] net_rx_action+0x278/0x890
>[<1e81d7e6>] __do_softirq+0xd9/0x589
>[<087ee354>] irq_exit+0xa2/0xc0
>[<1c4db0cd>] do_IRQ+0x87/0x180
> 
> Signed-off-by: Qian Cai 
> ---
>  net/core/dev.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 522288177bbd..b898cd3036da 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5496,6 +5496,7 @@ static void flush_backlog(struct work_struct *work)
>   skb_queue_walk_safe(>input_pkt_queue, skb, tmp) {
>   if (skb->dev->reg_state == NETREG_UNREGISTERING) {
>   __skb_unlink(skb, >input_pkt_queue);
> + skb_dst_drop(skb);
>   kfree_skb(skb);
>   input_queue_head_incr(sd);
>   }
> 


kfree_skb() is supposed to call skb_dst_drop() (look in 
skb_release_head_state())

If you think about it, we would have hundreds of similar bugs if this was not 
the case.


Re: [PATCH] net: fix sk_page_frag() recursion from memory reclaim

2019-10-19 Thread Eric Dumazet



On 10/19/19 2:18 PM, Tejun Heo wrote:

> Whatever works is fine by me.  gfpflags_allow_blocking() is clearer
> than testing __GFP_DIRECT_RECLAIM directly tho.  Maybe a better way is
> introducing a new gfpflags_ helper?

Sounds good to me !



Re: [PATCH] net: fix sk_page_frag() recursion from memory reclaim

2019-10-19 Thread Eric Dumazet



On 10/19/19 10:01 AM, Tejun Heo wrote:
> From f0335a5d14d3596d36e3ffddb2fd4fa0dc6ca9c2 Mon Sep 17 00:00:00 2001
> From: Tejun Heo 
> Date: Sat, 19 Oct 2019 09:10:57 -0700
> 
> sk_page_frag() optimizes skb_frag allocations by using per-task
> skb_frag cache when it knows it's the only user.  The condition is
> determined by seeing whether the socket allocation mask allows
> blocking - if the allocation may block, it obviously owns the task's
> context and ergo exclusively owns current->task_frag.
> 
> Unfortunately, this misses recursion through memory reclaim path.
> Please take a look at the following backtrace.
> 
>  [2] RIP: 0010:tcp_sendmsg_locked+0xccf/0xe10
>  ...
>  tcp_sendmsg+0x27/0x40
>  sock_sendmsg+0x30/0x40
>  sock_xmit.isra.24+0xa1/0x170 [nbd]
>  nbd_send_cmd+0x1d2/0x690 [nbd]
>  nbd_queue_rq+0x1b5/0x3b0 [nbd]
>  __blk_mq_try_issue_directly+0x108/0x1b0
>  blk_mq_request_issue_directly+0xbd/0xe0
>  blk_mq_try_issue_list_directly+0x41/0xb0
>  blk_mq_sched_insert_requests+0xa2/0xe0
>  blk_mq_flush_plug_list+0x205/0x2a0
>  blk_flush_plug_list+0xc3/0xf0
>  [1] blk_finish_plug+0x21/0x2e
>  _xfs_buf_ioapply+0x313/0x460
>  __xfs_buf_submit+0x67/0x220
>  xfs_buf_read_map+0x113/0x1a0
>  xfs_trans_read_buf_map+0xbf/0x330
>  xfs_btree_read_buf_block.constprop.42+0x95/0xd0
>  xfs_btree_lookup_get_block+0x95/0x170
>  xfs_btree_lookup+0xcc/0x470
>  xfs_bmap_del_extent_real+0x254/0x9a0
>  __xfs_bunmapi+0x45c/0xab0
>  xfs_bunmapi+0x15/0x30
>  xfs_itruncate_extents_flags+0xca/0x250
>  xfs_free_eofblocks+0x181/0x1e0
>  xfs_fs_destroy_inode+0xa8/0x1b0
>  destroy_inode+0x38/0x70
>  dispose_list+0x35/0x50
>  prune_icache_sb+0x52/0x70
>  super_cache_scan+0x120/0x1a0
>  do_shrink_slab+0x120/0x290
>  shrink_slab+0x216/0x2b0
>  shrink_node+0x1b6/0x4a0
>  do_try_to_free_pages+0xc6/0x370
>  try_to_free_mem_cgroup_pages+0xe3/0x1e0
>  try_charge+0x29e/0x790
>  mem_cgroup_charge_skmem+0x6a/0x100
>  __sk_mem_raise_allocated+0x18e/0x390
>  __sk_mem_schedule+0x2a/0x40
>  [0] tcp_sendmsg_locked+0x8eb/0xe10
>  tcp_sendmsg+0x27/0x40
>  sock_sendmsg+0x30/0x40
>  ___sys_sendmsg+0x26d/0x2b0
>  __sys_sendmsg+0x57/0xa0
>  do_syscall_64+0x42/0x100
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> In [0], tcp_send_msg_locked() was using current->page_frag when it
> called sk_wmem_schedule().  It already calculated how many bytes can
> be fit into current->page_frag.  Due to memory pressure,
> sk_wmem_schedule() called into memory reclaim path which called into
> xfs and then IO issue path.  Because the filesystem in question is
> backed by nbd, the control goes back into the tcp layer - back into
> tcp_sendmsg_locked().
> 
> nbd sets sk_allocation to (GFP_NOIO | __GFP_MEMALLOC) which makes
> sense - it's in the process of freeing memory and wants to be able to,
> e.g., drop clean pages to make forward progress.  However, this
> confused sk_page_frag() called from [2].  Because it only tests
> whether the allocation allows blocking which it does, it now thinks
> current->page_frag can be used again although it already was being
> used in [0].
> 
> After [2] used current->page_frag, the offset would be increased by
> the used amount.  When the control returns to [0],
> current->page_frag's offset is increased and the previously calculated
> number of bytes now may overrun the end of allocated memory leading to
> silent memory corruptions.
> 
> Fix it by updating sk_page_frag() to test __GFP_MEMALLOC and not use
> current->task_frag if set.
> 
> Signed-off-by: Tejun Heo 
> Cc: Josef Bacik 
> Cc: sta...@vger.kernel.org
> ---
>  include/net/sock.h | 15 ---
>  1 file changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 2c53f1a1d905..4e2ca38acc3c 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -2233,12 +2233,21 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, 
> int size, gfp_t gfp,
>   * sk_page_frag - return an appropriate page_frag
>   * @sk: socket
>   *
> - * If socket allocation mode allows current thread to sleep, it means its
> - * safe to use the per task page_frag instead of the per socket one.
> + * Use the per task page_frag instead of the per socket one for
> + * optimization when we know there can be no other users.
> + *
> + * 1. The socket allocation mode allows current thread to sleep.  This is
> + *the sleepable context which owns the task page_frag.
> + *
> + * 2. The socket allocation mode doesn't indicate that the socket is being
> + *used to reclaim memory.  Memory reclaim may nest inside other socket
> + *operations and end up recursing into sk_page_frag() while it's
> + *already in use.
>   */
>  static inline struct page_frag *sk_page_frag(struct sock *sk)
>  {
> - if (gfpflags_allow_blocking(sk->sk_allocation))
> + if 

Re: [PATCH] net: stmmac: Fix the problem of tso_xmit

2019-10-18 Thread Eric Dumazet



On 10/17/19 3:59 AM, Shaokun Zhang wrote:
> From: yuqi jin 
> 
> When the address width of DMA is greater than 32, the packet header occupies
> a BD descriptor. The starting address of the data should be added to the
> header length.
> 
> Cc: Giuseppe Cavallaro 
> Cc: Alexandre Torgue 
> Cc: Jose Abreu 
> Cc: "David S. Miller" 
> Cc: Maxime Coquelin 
> Signed-off-by: yuqi jin 
> Signed-off-by: Shaokun Zhang 
> ---

Please add a Fixes: tag,
thanks.



[tip: timers/urgent] hrtimer: Annotate lockless access to timer->base

2019-10-14 Thread tip-bot2 for Eric Dumazet
The following commit has been merged into the timers/urgent branch of tip:

Commit-ID: ff229eee3d897f52bd001c841f2d3cce8853ecdc
Gitweb:
https://git.kernel.org/tip/ff229eee3d897f52bd001c841f2d3cce8853ecdc
Author:Eric Dumazet 
AuthorDate:Tue, 08 Oct 2019 10:32:04 -07:00
Committer: Thomas Gleixner 
CommitterDate: Mon, 14 Oct 2019 15:51:49 +02:00

hrtimer: Annotate lockless access to timer->base

Followup to commit dd2261ed45aa ("hrtimer: Protect lockless access
to timer->base")

lock_hrtimer_base() fetches timer->base without lock exclusion.

Compiler is allowed to read timer->base twice (even if considered dumb)
which could end up trying to lock migration_base and return
_base.

  base = timer->base;
  if (likely(base != _base)) {

   /* compiler reads timer->base again, and now (base == _base)

   raw_spin_lock_irqsave(>cpu_base->lock, *flags);
   if (likely(base == timer->base))
return base; /* == _base ! */

Similarly the write sides must use WRITE_ONCE() to avoid store tearing.

Signed-off-by: Eric Dumazet 
Signed-off-by: Thomas Gleixner 
Link: https://lkml.kernel.org/r/20191008173204.180879-1-eduma...@google.com

---
 kernel/time/hrtimer.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 0d4dc24..6560553 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -164,7 +164,7 @@ struct hrtimer_clock_base *lock_hrtimer_base(const struct 
hrtimer *timer,
struct hrtimer_clock_base *base;
 
for (;;) {
-   base = timer->base;
+   base = READ_ONCE(timer->base);
if (likely(base != _base)) {
raw_spin_lock_irqsave(>cpu_base->lock, *flags);
if (likely(base == timer->base))
@@ -244,7 +244,7 @@ again:
return base;
 
/* See the comment in lock_hrtimer_base() */
-   timer->base = _base;
+   WRITE_ONCE(timer->base, _base);
raw_spin_unlock(>cpu_base->lock);
raw_spin_lock(_base->cpu_base->lock);
 
@@ -253,10 +253,10 @@ again:
raw_spin_unlock(_base->cpu_base->lock);
raw_spin_lock(>cpu_base->lock);
new_cpu_base = this_cpu_base;
-   timer->base = base;
+   WRITE_ONCE(timer->base, base);
goto again;
}
-   timer->base = new_base;
+   WRITE_ONCE(timer->base, new_base);
} else {
if (new_cpu_base != this_cpu_base &&
hrtimer_check_target(timer, new_base)) {


Re: tcp: Checking a kmemdup() call in tcp_time_wait()

2019-10-14 Thread Eric Dumazet
On Mon, Oct 14, 2019 at 5:51 AM Markus Elfring  wrote:
>
>  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/tcp_minisocks.c?id=1c0cc5f1ae5ee5a6913704c0d75a6e99604ee30a#n306
>  https://elixir.bootlin.com/linux/v5.4-rc2/source/net/ipv4/tcp_minisocks.c#L306
> …
>
> >> Can an other error reporting approach be nicer here?
> >
> > There is no error reported if kmemdup() has failed.
>
> How do data from the Linux allocation failure report fit to this information?
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/coding-style.rst?id=4f5cafb5cb8471e54afdc9054d973535614f7675#n878
>

This is coding style for newly submitted code.

We do not refactor code to the latest coding style, this would cost a lot.

Especially TCP stack that is quite often changed.

>
> > timewait is best effort.
>
> How do you think about to return an error code like “-ENOMEM” at this place?

tcp_time_wait() is void, the caller won't care. I told you time_wait
is best effort.

What is the problem you want to solve _exactly_ ?

Have you seen a real issue, or should you augment your static analyser
to not complain on :

ptr = kmemdup();
BUG_ON();

( being different than (ptr == NULL))

I believe we have enough real bugs to fix.
I would prefer to not spend time arguing for every single BUG() or BUG_ON().

Thank you.


Re: [PATCH net] net: sched: act_mirred: drop skb's dst_entry in ingress redirection

2019-10-14 Thread Eric Dumazet



On 10/14/19 12:07 AM, Zhiyuan Hou wrote:
> 
> On 2019/10/12 6:59 下午, Eric Dumazet wrote:
>>
>> On 10/12/19 12:16 AM, Zhiyuan Hou wrote:
>>> In act_mirred's ingress redirection, if the skb's dst_entry is valid
>>> when call function netif_receive_skb, the fllowing l3 stack process
>>> (ip_rcv_finish_core) will check dst_entry and skip the routing
>>> decision. Using the old dst_entry is unexpected and may discard the
>>> skb in some case. For example dst->dst_input points to dst_discard.
>>>
>>> This patch drops the skb's dst_entry before calling netif_receive_skb
>>> so that the skb can be made routing decision like a normal ingress
>>> skb.
>>>
>>> Signed-off-by: Zhiyuan Hou 
>>> ---
>>>   net/sched/act_mirred.c | 5 -
>>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
>>> index 9ce073a05414..6108a64c0cd5 100644
>>> --- a/net/sched/act_mirred.c
>>> +++ b/net/sched/act_mirred.c
>>> @@ -18,6 +18,7 @@
>>>   #include 
>>>   #include 
>>>   #include 
>>> +#include 
>>>   #include 
>>>   #include 
>>>   #include 
>>> @@ -298,8 +299,10 @@ static int tcf_mirred_act(struct sk_buff *skb, const 
>>> struct tc_action *a,
>>>     if (!want_ingress)
>>>   err = dev_queue_xmit(skb2);
>>> -    else
>>> +    else {
>>> +    skb_dst_drop(skb2);
>>>   err = netif_receive_skb(skb2);
>>> +    }
>>>     if (err) {
>>>   out:
>>>
>> Why is dst_discard used ?
> When send a skb from local to external, the dst->dst_input will be
> assigned dst_discard after routing decision. So if we redirect these
> skbs to ingress stack, it will be dropped.
> 
> For ipvlan l2 mode or macvlan, clsact egress filters on master deivce
> may also meet these skbs even if they came from slave device. Ingress
> redirection on these skbs may drop them on l3 stack.

Can you please add a test, so that we can see what you are trying to do exactly 
?




>> This could actually drop packets, for loopback.
>>
>> A Fixes: tag would tremendously help, I wonder if you are not working around
>> the other issue Wei was tracking yesterday ( 
>> https://www.spinics.net/lists/netdev/msg604397.html )
> No, this is a different issue ^_^.

Please add a Fixes: tag then.

Thanks.


Re: tcp: Checking a kmemdup() call in tcp_time_wait()

2019-10-14 Thread Eric Dumazet



On 10/13/19 11:51 PM, Markus Elfring wrote:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/tcp_minisocks.c?id=1c0cc5f1ae5ee5a6913704c0d75a6e99604ee30a#n306
>>> https://elixir.bootlin.com/linux/v5.4-rc2/source/net/ipv4/tcp_minisocks.c#L306
> …
>> Presumably the BUG would trigger if a really disturbing bug happened.
> 
> How “buggy” is this place if the function call “kmemdup” failed?

It is not buggy. The BUG will not trigger.

BUG_ON(tcptw->tw_md5_key && !tcp_alloc_md5sig_pool());

This would be different if we had instead :

BUG_ON(!tcptw->tw_md5_key && !tcp_alloc_md5sig_pool());

> 
> Can an other error reporting approach be nicer here?

There is no error reported if kmemdup() has failed.

timewait is best effort.


Re: INFO: task hung in addrconf_verify_work (2)

2019-10-13 Thread Eric Dumazet



On 10/13/19 9:42 PM, syzbot wrote:
> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:    c208bdb9 tcp: improve recv_skip_hint for tcp_zerocopy_rece..
> git tree:   net-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=15b6133b60
> kernel config:  https://syzkaller.appspot.com/x/.config?x=d9be300620399522
> dashboard link: https://syzkaller.appspot.com/bug?extid=cf0adbb9c28c8866c788
> compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1548666f60
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11957d3b60
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+cf0adbb9c28c8866c...@syzkaller.appspotmail.com
> 
> INFO: task kworker/0:2:2913 blocked for more than 143 seconds.
>   Not tainted 5.4.0-rc1+ #0
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> kworker/0:2 D27000  2913  2 0x80004000
> Workqueue: ipv6_addrconf addrconf_verify_work
> Call Trace:
>  context_switch kernel/sched/core.c:3384 [inline]
>  __schedule+0x94f/0x1e70 kernel/sched/core.c:4069
>  schedule+0xd9/0x260 kernel/sched/core.c:4136
>  schedule_preempt_disabled+0x13/0x20 kernel/sched/core.c:4195
>  __mutex_lock_common kernel/locking/mutex.c:1033 [inline]
>  __mutex_lock+0x7b0/0x13c0 kernel/locking/mutex.c:1103
>  mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:1118
>  rtnl_lock+0x17/0x20 net/core/rtnetlink.c:72
>  addrconf_verify_work+0xe/0x20 net/ipv6/addrconf.c:4520
>  process_one_work+0x9af/0x1740 kernel/workqueue.c:2269
>  worker_thread+0x98/0xe40 kernel/workqueue.c:2415
>  kthread+0x361/0x430 kernel/kthread.c:255
>  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
> 
> Showing all locks held in the system:
> 1 lock held by khungtaskd/1054:
>  #0: 88faae40 (rcu_read_lock){}, at: 
> debug_show_all_locks+0x5f/0x27e kernel/locking/lockdep.c:5337
> 3 locks held by kworker/0:2/2913:
>  #0: 888216019428 ((wq_completion)ipv6_addrconf){+.+.}, at: 
> __write_once_size include/linux/compiler.h:226 [inline]
>  #0: 888216019428 ((wq_completion)ipv6_addrconf){+.+.}, at: 
> arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
>  #0: 888216019428 ((wq_completion)ipv6_addrconf){+.+.}, at: atomic64_set 
> include/asm-generic/atomic-instrumented.h:855 [inline]
>  #0: 888216019428 ((wq_completion)ipv6_addrconf){+.+.}, at: 
> atomic_long_set include/asm-generic/atomic-long.h:40 [inline]
>  #0: 888216019428 ((wq_completion)ipv6_addrconf){+.+.}, at: set_work_data 
> kernel/workqueue.c:620 [inline]
>  #0: 888216019428 ((wq_completion)ipv6_addrconf){+.+.}, at: 
> set_work_pool_and_clear_pending kernel/workqueue.c:647 [inline]
>  #0: 888216019428 ((wq_completion)ipv6_addrconf){+.+.}, at: 
> process_one_work+0x88b/0x1740 kernel/workqueue.c:2240
>  #1: 8880a05b7dc0 ((addr_chk_work).work){+.+.}, at: 
> process_one_work+0x8c1/0x1740 kernel/workqueue.c:2244
>  #2: 89993b20 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20 
> net/core/rtnetlink.c:72
> 1 lock held by rsyslogd/8744:
>  #0: 8880899fa120 (>f_pos_lock){+.+.}, at: __fdget_pos+0xee/0x110 
> fs/file.c:801
> 2 locks held by getty/8833:
>  #0: 888090baedd0 (>ldisc_sem){}, at: ldsem_down_read+0x33/0x40 
> drivers/tty/tty_ldsem.c:340
>  #1: c90005f292e0 (>atomic_read_lock){+.+.}, at: 
> n_tty_read+0x232/0x1c10 drivers/tty/n_tty.c:2156
> 2 locks held by getty/8834:
>  #0: 88808d0f6dd0 (>ldisc_sem){}, at: ldsem_down_read+0x33/0x40 
> drivers/tty/tty_ldsem.c:340
>  #1: c90005f392e0 (>atomic_read_lock){+.+.}, at: 
> n_tty_read+0x232/0x1c10 drivers/tty/n_tty.c:2156
> 2 locks held by getty/8835:
>  #0: 888090148e10 (>ldisc_sem){}, at: ldsem_down_read+0x33/0x40 
> drivers/tty/tty_ldsem.c:340
>  #1: c90005f252e0 (>atomic_read_lock){+.+.}, at: 
> n_tty_read+0x232/0x1c10 drivers/tty/n_tty.c:2156
> 2 locks held by getty/8836:
>  #0: 8880a7ab3750 (>ldisc_sem){}, at: ldsem_down_read+0x33/0x40 
> drivers/tty/tty_ldsem.c:340
>  #1: c90005f412e0 (>atomic_read_lock){+.+.}, at: 
> n_tty_read+0x232/0x1c10 drivers/tty/n_tty.c:2156
> 2 locks held by getty/8837:
>  #0: 8880a7accf10 (>ldisc_sem){}, at: ldsem_down_read+0x33/0x40 
> drivers/tty/tty_ldsem.c:340
>  #1: c90005f3d2e0 (>atomic_read_lock){+.+.}, at: 
> n_tty_read+0x232/0x1c10 drivers/tty/n_tty.c:2156
> 2 locks held by getty/8838:
>  #0: 88808d0f7650 (>ldisc_sem){}, at: ldsem_down_read+0x33/0x40 
> drivers/tty/tty_ldsem.c:340
>  #1: c90005f352e0 (>atomic_read_lock){+.+.}, at: 
> n_tty_read+0x232/0x1c10 drivers/tty/n_tty.c:2156
> 2 locks held by getty/8839:
>  #0: 88808d162bd0 (>ldisc_sem){}, at: ldsem_down_read+0x33/0x40 
> drivers/tty/tty_ldsem.c:340
>  #1: c90005f112e0 (>atomic_read_lock){+.+.}, at: 
> n_tty_read+0x232/0x1c10 drivers/tty/n_tty.c:2156
> 1 lock held by syz-executor910/8859:
> 
> 

Re: [PATCH] hrtimer: annotate lockless access to timer->base

2019-10-13 Thread Eric Dumazet
On Tue, Oct 8, 2019 at 10:32 AM Eric Dumazet  wrote:
>
> Followup to commit dd2261ed45aa ("hrtimer: Protect lockless access
> to timer->base")
>
> lock_hrtimer_base() fetches timer->base without lock exclusion.
>
> Compiler is allowed to read timer->base twice (even if considered dumb)
> and we could end up trying to lock migration_base and
> return _base.
>
>   base = timer->base;
>   if (likely(base != _base)) {
>
>/* compiler reads timer->base again, and now (base == _base)
>
>raw_spin_lock_irqsave(>cpu_base->lock, *flags);
>if (likely(base == timer->base))
> return base; /* == _base ! */
>
> Similarly the write sides should use WRITE_ONCE() to avoid
> store tearing.
>
> Signed-off-by: Eric Dumazet 
> Cc: Julien Grall 
> Cc: Thomas Gleixner 
> ---
>  kernel/time/hrtimer.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> index 
> 0d4dc241c0fb498036c91a571e65cb00f5d19ba6..65605530ee349c9682690c4fccb43aa9284d4287
>  100644
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -164,7 +164,7 @@ struct hrtimer_clock_base *lock_hrtimer_base(const struct 
> hrtimer *timer,
> struct hrtimer_clock_base *base;
>
> for (;;) {
> -   base = timer->base;
> +   base = READ_ONCE(timer->base);
> if (likely(base != _base)) {
> raw_spin_lock_irqsave(>cpu_base->lock, *flags);
> if (likely(base == timer->base))
> @@ -244,7 +244,7 @@ switch_hrtimer_base(struct hrtimer *timer, struct 
> hrtimer_clock_base *base,
> return base;
>
> /* See the comment in lock_hrtimer_base() */
> -   timer->base = _base;
> +   WRITE_ONCE(timer->base, _base);
> raw_spin_unlock(>cpu_base->lock);
> raw_spin_lock(_base->cpu_base->lock);
>
> @@ -253,10 +253,10 @@ switch_hrtimer_base(struct hrtimer *timer, struct 
> hrtimer_clock_base *base,
> raw_spin_unlock(_base->cpu_base->lock);
> raw_spin_lock(>cpu_base->lock);
> new_cpu_base = this_cpu_base;
> -   timer->base = base;
> +   WRITE_ONCE(timer->base, base);
> goto again;
> }
> -   timer->base = new_base;
> +   WRITE_ONCE(timer->base, new_base);
> } else {
> if (new_cpu_base != this_cpu_base &&
> hrtimer_check_target(timer, new_base)) {
> --
> 2.23.0.581.g78d2f28ef7-goog
>

Any news on this patch ?

If more information is needed, let me know.

Maybe I need to point to :

commit b831275a3553c32091222ac619cfddd73a5553fb timers: Plug locking
race vs. timer migration

Thanks


Re: [PATCH] net: core: datagram: tidy up copy functions a bit

2019-10-13 Thread Eric Dumazet



On 10/13/19 3:41 PM, Vito Caputo wrote:

> I read it, thank you for your responses.
> 
> Do you have any guidance to offer someone wanting to contribute with 1-2
> hours available per day?  I don't want to cause a nuisance, but would
> like to help where I can.  My flawed assumption was that small, isolated
> hygienic contributions without functionally changing anything would be
> appropriate.

I suggest you look at the syzkaller huge queue of bugs.

You will learn more interesting stuff, and you will help the community
in a more quantifiable way.

https://syzkaller.appspot.com/upstream




Re: [PATCH] net: core: datagram: tidy up copy functions a bit

2019-10-13 Thread Eric Dumazet



On 10/13/19 1:01 PM, Vito Caputo wrote:
> On Sun, Oct 13, 2019 at 12:30:41PM -0700, Eric Dumazet wrote:
>>
>>
>> On 10/12/19 4:55 AM, Vito Caputo wrote:
>>> Eliminate some verbosity by using min() macro and consolidating some
>>> things, also fix inconsistent zero tests (! vs. == 0).
>>>
>>> Signed-off-by: Vito Caputo 
>>> ---
>>>  net/core/datagram.c | 44 ++--
>>>  1 file changed, 14 insertions(+), 30 deletions(-)
>>>
>>> diff --git a/net/core/datagram.c b/net/core/datagram.c
>>> index 4cc8dc5db2b7..08d403f93952 100644
>>> --- a/net/core/datagram.c
>>> +++ b/net/core/datagram.c
>>> @@ -413,13 +413,11 @@ static int __skb_datagram_iter(const struct sk_buff 
>>> *skb, int offset,
>>> struct iov_iter *), void *data)
>>>  {
>>> int start = skb_headlen(skb);
>>> -   int i, copy = start - offset, start_off = offset, n;
>>> +   int i, copy, start_off = offset, n;
>>> struct sk_buff *frag_iter;
>>>  
>>> /* Copy header. */
>>> -   if (copy > 0) {
>>> -   if (copy > len)
>>> -   copy = len;
>>> +   if ((copy = min(start - offset, len)) > 0) {
>>
>> No, we prefer not having this kind of construct anymore.
>>
>> This refactoring looks unnecessary code churn, making our future backports 
>> not
>> clean cherry-picks.
>>
>> Simply making sure this patch does not bring a regression is very time 
>> consuming.
> 
> Should I not bother submitting patches for such cleanups?
> 
> I submitted another, more trivial patch, is it also considered unnecessary 
> churn:
> 
> ---
> 
> Author: Vito Caputo 
> Date:   Sat Oct 12 17:10:41 2019 -0700
> 
> net: core: skbuff: skb_checksum_setup() drop err
> 
> Return directly from all switch cases, no point in storing in err.
> 
> Signed-off-by: Vito Caputo 
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index f5f904f46893..c59b68a413b5 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -4888,23 +4888,14 @@ static int skb_checksum_setup_ipv6(struct sk_buff 
> *skb, bool recalculate)
>   */
>  int skb_checksum_setup(struct sk_buff *skb, bool recalculate)
>  {
> -   int err;
> -
> switch (skb->protocol) {
> case htons(ETH_P_IP):
> -   err = skb_checksum_setup_ipv4(skb, recalculate);
> -   break;
> -
> +   return skb_checksum_setup_ipv4(skb, recalculate);
> case htons(ETH_P_IPV6):
> -   err = skb_checksum_setup_ipv6(skb, recalculate);
> -   break;
> -
> +   return skb_checksum_setup_ipv6(skb, recalculate);
> default:
> -   err = -EPROTO;
> -   break;
> +   return -EPROTO;
> }
> -
> -   return err;
>  }
>  EXPORT_SYMBOL(skb_checksum_setup);
> 
> ---
> 
> Asking to calibrate my thresholds to yours, since I was planning to volunteer
> some time each evening to reading kernel code and submitting any obvious
> cleanups.
> 

This is not a cleanup.

You prefer seeing the code written the way you did, but that is really a matter 
of taste.

Think about backports of real bug fixes to stable kernels.

Having these re-writes of code make things less easy for us really.
So in general we tend to leave the existing code style.

I already replied to the other patch submission, please read

https://marc.info/?l=linux-netdev=157099669227635=2




Re: tcp: Checking a kmemdup() call in tcp_time_wait()

2019-10-13 Thread Eric Dumazet



On 10/12/19 7:51 AM, Markus Elfring wrote:
> Hello,
> 
> I tried another script for the semantic patch language out.
> This source code analysis approach points out that the implementation
> of the function “tcp_time_wait” contains also a call of the function 
> “kmemdup”.
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/tcp_minisocks.c?id=1c0cc5f1ae5ee5a6913704c0d75a6e99604ee30a#n306
> https://elixir.bootlin.com/linux/v5.4-rc2/source/net/ipv4/tcp_minisocks.c#L306
> 
> * Do you find the usage of the macro call “BUG_ON” still appropriate at this 
> place?
>   
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/scripts/checkpatch.pl?id=1c0cc5f1ae5ee5a6913704c0d75a6e99604ee30a#n4080
> 
> * Is there a need to adjust the error handling here?

Presumably the BUG would trigger if a really disturbing bug happened.

There is no chance a timewait socket could be created with a MD5 key, 
if the established socket that is the 'parent' of the timewait
has not a MD5 context itself.

The parent socket only could have MD5 context if tcp_md5sig_pool_populated
could have been set to true.

Once tcp_md5sig_pool_populated is true it can never go back to false.

So the bug here would be that a socket  had a successful MD5 context,
and following tcp_alloc_md5sig_pool() would return false.

We can discuss of all BUG() in general, some people simply disable
all of them (cf CONFIG_BUG), but this particular one does not seem
specially bad to me, compared to others.



Re: [PATCH] net: core: datagram: tidy up copy functions a bit

2019-10-13 Thread Eric Dumazet



On 10/12/19 4:55 AM, Vito Caputo wrote:
> Eliminate some verbosity by using min() macro and consolidating some
> things, also fix inconsistent zero tests (! vs. == 0).
> 
> Signed-off-by: Vito Caputo 
> ---
>  net/core/datagram.c | 44 ++--
>  1 file changed, 14 insertions(+), 30 deletions(-)
> 
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index 4cc8dc5db2b7..08d403f93952 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -413,13 +413,11 @@ static int __skb_datagram_iter(const struct sk_buff 
> *skb, int offset,
>   struct iov_iter *), void *data)
>  {
>   int start = skb_headlen(skb);
> - int i, copy = start - offset, start_off = offset, n;
> + int i, copy, start_off = offset, n;
>   struct sk_buff *frag_iter;
>  
>   /* Copy header. */
> - if (copy > 0) {
> - if (copy > len)
> - copy = len;
> + if ((copy = min(start - offset, len)) > 0) {

No, we prefer not having this kind of construct anymore.

This refactoring looks unnecessary code churn, making our future backports not
clean cherry-picks.

Simply making sure this patch does not bring a regression is very time 
consuming.


Re: [PATCH net-next 2/2] net: core: increase the default size of GRO_NORMAL skb lists to flush

2019-10-12 Thread Eric Dumazet
On Sat, Oct 12, 2019 at 2:22 AM Alexander Lobakin  wrote:

>
> I've generated an another solution. Considering that gro_normal_batch
> is very individual for every single case, maybe it would be better to
> make it per-NAPI (or per-netdevice) variable rather than a global
> across the kernel?
> I think most of all network-capable configurations and systems has more
> than one network device nowadays, and they might need different values
> for achieving their bests.
>
> One possible variant is:
>
> #define THIS_DRIVER_GRO_NORMAL_BATCH16
>
> /* ... */
>
> netif_napi_add(dev, napi, this_driver_rx_poll, NAPI_POLL_WEIGHT); /*
> napi->gro_normal_batch will be set to the systcl value during NAPI
> context initialization */
> napi_set_gro_normal_batch(napi, THIS_DRIVER_GRO_NORMAL_BATCH); /* new
> static inline helper, napi->gro_normal_batch will be set to the
> driver-speficic value of 16 */
>
> The second possible variant is to make gro_normal_batch sysctl
> per-netdevice to tune it from userspace.
> Or we can combine them into one to make it available for tweaking from
> both driver and userspace, just like it's now with XPS CPUs setting.
>
> If you'll find any of this reasonable and worth implementing, I'll come
> with it in v2 after a proper testing.

Most likely the optimal tuning is also a function of the host cpu caches.

Building a too big list can also lead to premature cache evictions.

Tuning the value on your test machines does not mean the value will be good
for other systems.

Adding yet another per device value should only be done if you demonstrate
a significant performance increase compared to the conservative value
Edward chose.

Also the behavior can be quite different depending on the protocols,
make sure you test handling of TCP pure ACK packets.

Accumulating 64 (in case the device uses standard NAPI_POLL_WEIGHT)
of them before entering upper stacks seems not a good choice, since 64 skbs
will need to be kept in the GRO system, compared to only 8 with Edward value.


Re: [PATCH net] rxrpc: Fix possible NULL pointer access in ICMP handling

2019-10-12 Thread Eric Dumazet



On 10/12/19 3:49 AM, Eric Dumazet wrote:

> 
> Okay, but we also need this.
> 
> diff --git a/net/rxrpc/peer_event.c b/net/rxrpc/peer_event.c
> index 
> c97ebdc043e44525eaecdd54bc447c1895bdca74..38db10e61f7a5cb50f9ee036b5e16ec284e723ac
>  100644
> --- a/net/rxrpc/peer_event.c
> +++ b/net/rxrpc/peer_event.c
> @@ -145,9 +145,9 @@ static void rxrpc_adjust_mtu(struct rxrpc_peer *peer, 
> struct sock_exterr_skb *se
>   */
>  void rxrpc_error_report(struct sock *sk)
>  {
> +   struct rxrpc_local *local = rcu_dereference_sk_user_data(sk);
> struct sock_exterr_skb *serr;
> struct sockaddr_rxrpc srx;
> -   struct rxrpc_local *local = sk->sk_user_data;
> struct rxrpc_peer *peer;
> struct sk_buff *skb;
> 

I will psubmit the patch later once David pushes his net tree, since I do not 
know yet
the sha1 of your patch (to provide a proper Fixes: tag)


Re: [PATCH net] net: sched: act_mirred: drop skb's dst_entry in ingress redirection

2019-10-12 Thread Eric Dumazet



On 10/12/19 12:16 AM, Zhiyuan Hou wrote:
> In act_mirred's ingress redirection, if the skb's dst_entry is valid
> when call function netif_receive_skb, the fllowing l3 stack process
> (ip_rcv_finish_core) will check dst_entry and skip the routing
> decision. Using the old dst_entry is unexpected and may discard the
> skb in some case. For example dst->dst_input points to dst_discard.
> 
> This patch drops the skb's dst_entry before calling netif_receive_skb
> so that the skb can be made routing decision like a normal ingress
> skb.
> 
> Signed-off-by: Zhiyuan Hou 
> ---
>  net/sched/act_mirred.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
> index 9ce073a05414..6108a64c0cd5 100644
> --- a/net/sched/act_mirred.c
> +++ b/net/sched/act_mirred.c
> @@ -18,6 +18,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -298,8 +299,10 @@ static int tcf_mirred_act(struct sk_buff *skb, const 
> struct tc_action *a,
>  
>   if (!want_ingress)
>   err = dev_queue_xmit(skb2);
> - else
> + else {
> + skb_dst_drop(skb2);
>   err = netif_receive_skb(skb2);
> + }
>  
>   if (err) {
>  out:
> 

Why is dst_discard used ?

This could actually drop packets, for loopback.

A Fixes: tag would tremendously help, I wonder if you are not working around
the other issue Wei was tracking yesterday ( 
https://www.spinics.net/lists/netdev/msg604397.html )



Re: [PATCH net] rxrpc: Fix possible NULL pointer access in ICMP handling

2019-10-12 Thread Eric Dumazet



On 10/10/19 7:52 AM, David Howells wrote:
> If an ICMP packet comes in on the UDP socket backing an AF_RXRPC socket as
> the UDP socket is being shut down, rxrpc_error_report() may get called to
> deal with it after sk_user_data on the UDP socket has been cleared, leading
> to a NULL pointer access when this local endpoint record gets accessed.
> 
> Fix this by just returning immediately if sk_user_data was NULL.
> 
> The oops looks like the following:
> 
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x) - not-present page
> ...
> RIP: 0010:rxrpc_error_report+0x1bd/0x6a9
> ...
> Call Trace:
>  ? sock_queue_err_skb+0xbd/0xde
>  ? __udp4_lib_err+0x313/0x34d
>  __udp4_lib_err+0x313/0x34d
>  icmp_unreach+0x1ee/0x207
>  icmp_rcv+0x25b/0x28f
>  ip_protocol_deliver_rcu+0x95/0x10e
>  ip_local_deliver+0xe9/0x148
>  __netif_receive_skb_one_core+0x52/0x6e
>  process_backlog+0xdc/0x177
>  net_rx_action+0xf9/0x270
>  __do_softirq+0x1b6/0x39a
>  ? smpboot_register_percpu_thread+0xce/0xce
>  run_ksoftirqd+0x1d/0x42
>  smpboot_thread_fn+0x19e/0x1b3
>  kthread+0xf1/0xf6
>  ? kthread_delayed_work_timer_fn+0x83/0x83
>  ret_from_fork+0x24/0x30
> 
> Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by 
> userspace and kernel both")
> Reported-by: syzbot+611164843bd48cc21...@syzkaller.appspotmail.com
> Signed-off-by: David Howells 
> ---
> 
>  net/rxrpc/peer_event.c |3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/net/rxrpc/peer_event.c b/net/rxrpc/peer_event.c
> index c97ebdc043e4..61451281d74a 100644
> --- a/net/rxrpc/peer_event.c
> +++ b/net/rxrpc/peer_event.c
> @@ -151,6 +151,9 @@ void rxrpc_error_report(struct sock *sk)
>   struct rxrpc_peer *peer;
>   struct sk_buff *skb;
>  
> + if (unlikely(!local))
> + return;
> +
>   _enter("%p{%d}", sk, local->debug_id);
>  
>   /* Clear the outstanding error value on the socket so that it doesn't
> 

Okay, but we also need this.

diff --git a/net/rxrpc/peer_event.c b/net/rxrpc/peer_event.c
index 
c97ebdc043e44525eaecdd54bc447c1895bdca74..38db10e61f7a5cb50f9ee036b5e16ec284e723ac
 100644
--- a/net/rxrpc/peer_event.c
+++ b/net/rxrpc/peer_event.c
@@ -145,9 +145,9 @@ static void rxrpc_adjust_mtu(struct rxrpc_peer *peer, 
struct sock_exterr_skb *se
  */
 void rxrpc_error_report(struct sock *sk)
 {
+   struct rxrpc_local *local = rcu_dereference_sk_user_data(sk);
struct sock_exterr_skb *serr;
struct sockaddr_rxrpc srx;
-   struct rxrpc_local *local = sk->sk_user_data;
struct rxrpc_peer *peer;
struct sk_buff *skb;



Re: Potential NULL pointer deference in spi

2019-10-11 Thread Eric Dumazet



On 10/10/19 10:31 PM, Yizhuo Zhai wrote:
> Hi Eric:
> 
> My apologies for bothering, we got those report via static analysis
> and haven't got a good method to verify the path to trigger them.
> Therefore I sent those email to you maintainers first since you
> know much better about the details. Sorry again for your time and
> I take your suggestions.

My suggestion is that you need to make deep investigations on your own,
before sending mails to lkml@, reaching thousands of people on the planet.

Static analysis tools having too many false positive are not worth
the time spent by humans.

I knew nothing about drivers/spi/spi.c, but after few minutes reading the code,
it was clear your report was wrong.

Do not ask us to do what you should do yourself.

Thanks.

> 
> On Wed, Oct 9, 2019 at 10:48 PM Eric Dumazet  wrote:
>>
>>
>>
>> On 10/9/19 10:37 PM, Yizhuo Zhai wrote:
>>> Hi All:
>>>
>>> drivers/spi/spi.c:
>>>
>>> The function to_spi_device() could return NULL, but some callers
>>> in this file does not check the return value while directly dereference
>>> it, which seems potentially unsafe.
>>>
>>> Such callers include spidev_release(),  spi_dev_check(),
>>> driver_override_store(), etc.
>>>
>>>
>>
>>
>> Many of your reports are completely bogus.
>>
>> I suggest you spend more time before sending such emails to very large 
>> audience
>> and risk being ignored at some point.
>>
>> Thanks.
> 
> 
> 


Re: Potential NULL pointer deference in spi

2019-10-09 Thread Eric Dumazet



On 10/9/19 10:37 PM, Yizhuo Zhai wrote:
> Hi All:
> 
> drivers/spi/spi.c:
> 
> The function to_spi_device() could return NULL, but some callers
> in this file does not check the return value while directly dereference
> it, which seems potentially unsafe.
> 
> Such callers include spidev_release(),  spi_dev_check(),
> driver_override_store(), etc.
> 
> 


Many of your reports are completely bogus.

I suggest you spend more time before sending such emails to very large audience
and risk being ignored at some point.

Thanks.


Re: [PATCH] rcu: avoid data-race in rcu_gp_fqs_check_wake()

2019-10-09 Thread Eric Dumazet
On Wed, Oct 9, 2019 at 8:18 PM Paul E. McKenney  wrote:

> Again, good catch, applied for review and testing, thank you!
>
> I added another READ_ONCE() to dump_blkd_tasks(), which is not exercised
> unless you get an RCU CPU stall warning or some such.  The updated patch
> is below, please let me know if I messed anything up.
>
> Thanx, Paul

This looks good to me, thanks Paul.


[PATCH] rcu: avoid data-race in rcu_gp_fqs_check_wake()

2019-10-09 Thread Eric Dumazet
rcu_gp_fqs_check_wake() uses rcu_preempt_blocked_readers_cgp()
to read ->gp_tasks while otehr cpus might write over this field.

We need READ_ONCE()/WRITE_ONCE() pairs to avoid compiler
tricks and KCSAN splats like the following :

BUG: KCSAN: data-race in rcu_gp_fqs_check_wake / 
rcu_preempt_deferred_qs_irqrestore

write to 0x85a7f190 of 8 bytes by task 7317 on cpu 0:
 rcu_preempt_deferred_qs_irqrestore+0x43d/0x580 kernel/rcu/tree_plugin.h:507
 rcu_read_unlock_special+0xec/0x370 kernel/rcu/tree_plugin.h:659
 __rcu_read_unlock+0xcf/0xe0 kernel/rcu/tree_plugin.h:394
 rcu_read_unlock include/linux/rcupdate.h:645 [inline]
 __ip_queue_xmit+0x3b0/0xa40 net/ipv4/ip_output.c:533
 ip_queue_xmit+0x45/0x60 include/net/ip.h:236
 __tcp_transmit_skb+0xdeb/0x1cd0 net/ipv4/tcp_output.c:1158
 __tcp_send_ack+0x246/0x300 net/ipv4/tcp_output.c:3685
 tcp_send_ack+0x34/0x40 net/ipv4/tcp_output.c:3691
 tcp_cleanup_rbuf+0x130/0x360 net/ipv4/tcp.c:1575
 tcp_recvmsg+0x633/0x1a30 net/ipv4/tcp.c:2179
 inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838
 sock_recvmsg_nosec net/socket.c:871 [inline]
 sock_recvmsg net/socket.c:889 [inline]
 sock_recvmsg+0x92/0xb0 net/socket.c:885
 sock_read_iter+0x15f/0x1e0 net/socket.c:967
 call_read_iter include/linux/fs.h:1864 [inline]
 new_sync_read+0x389/0x4f0 fs/read_write.c:414

read to 0x85a7f190 of 8 bytes by task 10 on cpu 1:
 rcu_gp_fqs_check_wake kernel/rcu/tree.c:1556 [inline]
 rcu_gp_fqs_check_wake+0x93/0xd0 kernel/rcu/tree.c:1546
 rcu_gp_fqs_loop+0x36c/0x580 kernel/rcu/tree.c:1611
 rcu_gp_kthread+0x143/0x220 kernel/rcu/tree.c:1768
 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 10 Comm: rcu_preempt Not tainted 5.3.0+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011

Signed-off-by: Eric Dumazet 
Cc: "Paul E. McKenney" 
Reported-by: syzbot 
---
 kernel/rcu/tree_plugin.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 
2defc7fe74c3991b03d1968a18c4d208bccf25f9..5500a8ba92da19f49fecf8db583ac7b5a21f5e7b
 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -220,7 +220,7 @@ static void rcu_preempt_ctxt_queue(struct rcu_node *rnp, 
struct rcu_data *rdp)
 * blocked tasks.
 */
if (!rnp->gp_tasks && (blkd_state & RCU_GP_BLKD)) {
-   rnp->gp_tasks = >rcu_node_entry;
+   WRITE_ONCE(rnp->gp_tasks, >rcu_node_entry);
WARN_ON_ONCE(rnp->completedqs == rnp->gp_seq);
}
if (!rnp->exp_tasks && (blkd_state & RCU_EXP_BLKD))
@@ -340,7 +340,7 @@ EXPORT_SYMBOL_GPL(rcu_note_context_switch);
  */
 static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
 {
-   return rnp->gp_tasks != NULL;
+   return READ_ONCE(rnp->gp_tasks) != NULL;
 }
 
 /* Bias and limit values for ->rcu_read_lock_nesting. */
@@ -493,7 +493,7 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, 
unsigned long flags)
trace_rcu_unlock_preempted_task(TPS("rcu_preempt"),
rnp->gp_seq, t->pid);
if (>rcu_node_entry == rnp->gp_tasks)
-   rnp->gp_tasks = np;
+   WRITE_ONCE(rnp->gp_tasks, np);
if (>rcu_node_entry == rnp->exp_tasks)
rnp->exp_tasks = np;
if (IS_ENABLED(CONFIG_RCU_BOOST)) {
@@ -663,7 +663,7 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node 
*rnp)
dump_blkd_tasks(rnp, 10);
if (rcu_preempt_has_tasks(rnp) &&
(rnp->qsmaskinit || rnp->wait_blkd_tasks)) {
-   rnp->gp_tasks = rnp->blkd_tasks.next;
+   WRITE_ONCE(rnp->gp_tasks, rnp->blkd_tasks.next);
t = container_of(rnp->gp_tasks, struct task_struct,
 rcu_node_entry);
trace_rcu_unlock_preempted_task(TPS("rcu_preempt-GPS"),
-- 
2.23.0.581.g78d2f28ef7-goog



Re: Kernel Concurrency Sanitizer (KCSAN)

2019-10-09 Thread Eric Dumazet



On 10/9/19 12:45 AM, Dmitry Vyukov wrote:
> On Sat, Oct 5, 2019 at 6:16 AM Dmitry Vyukov  wrote:
>>
>> On Sat, Oct 5, 2019 at 2:58 AM Eric Dumazet  wrote:
>>>> This one is tricky. What I think we need to avoid is an onslaught of
>>>> patches adding READ_ONCE/WRITE_ONCE without a concrete analysis of the
>>>> code being modified. My worry is that Joe Developer is eager to get their
>>>> first patch into the kernel, so runs this tool and starts spamming
>>>> maintainers with these things to the point that they start ignoring KCSAN
>>>> reports altogether because of the time they take up.
>>>>
>>>> I suppose one thing we could do is to require each new READ_ONCE/WRITE_ONCE
>>>> to have a comment describing the racy access, a bit like we do for memory
>>>> barriers. Another possibility would be to use atomic_t more widely if
>>>> there is genuine concurrency involved.
>>>>
>>>
>>> About READ_ONCE() and WRITE_ONCE(), we will probably need
>>>
>>> ADD_ONCE(var, value)  for arches that can implement the RMW in a single 
>>> instruction.
>>>
>>> WRITE_ONCE(var, var + value) does not look pretty, and increases register 
>>> pressure.
>>
>> FWIW modern compilers can handle this if we tell them what we are trying to 
>> do:
>>
>> void foo(int *p, int x)
>> {
>> x += __atomic_load_n(p, __ATOMIC_RELAXED);
>> __atomic_store_n(p, x, __ATOMIC_RELAXED);
>> }
>>
>> $ clang test.c -c -O2 && objdump -d test.o
>>
>>  :
>>0: 01 37add%esi,(%rdi)
>>2: c3retq
>>
>> We can have syntactic sugar on top of this of course.
> 
> An interesting precedent come up in another KCSAN bug report. Namely,
> it may be reasonable for a compiler to use different optimization
> heuristics for concurrent and non-concurrent code. Consider there are
> some legal code transformations, but it's unclear if they are
> profitable or not. It may be the case that for non-concurrent code the
> expectation is that it's a profitable transformation, but for
> concurrent code it is not. So that may be another reason to
> communicate to compiler what we want to do, rather than trying to
> trick and play against each other. I've added the concrete example
> here:
> https://github.com/google/ktsan/wiki/READ_ONCE-and-WRITE_ONCE#it-may-improve-performance
> 

Note that for bit fields, READ_ONCE() wont work.

Concrete example in net/xfrm/xfrm_algo.c:xfrm_probe_algs(void)
...
if (aalg_list[i].available != status)
aalg_list[i].available = status;
...
if (ealg_list[i].available != status)
ealg_list[i].available = status;
...
if (calg_list[i].available != status)
calg_list[i].available = status;



[PATCH] hrtimer: annotate lockless access to timer->base

2019-10-08 Thread Eric Dumazet
Followup to commit dd2261ed45aa ("hrtimer: Protect lockless access
to timer->base")

lock_hrtimer_base() fetches timer->base without lock exclusion.

Compiler is allowed to read timer->base twice (even if considered dumb)
and we could end up trying to lock migration_base and
return _base.

  base = timer->base;
  if (likely(base != _base)) {

   /* compiler reads timer->base again, and now (base == _base)

   raw_spin_lock_irqsave(>cpu_base->lock, *flags);
   if (likely(base == timer->base))
return base; /* == _base ! */

Similarly the write sides should use WRITE_ONCE() to avoid
store tearing.

Signed-off-by: Eric Dumazet 
Cc: Julien Grall 
Cc: Thomas Gleixner 
---
 kernel/time/hrtimer.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 
0d4dc241c0fb498036c91a571e65cb00f5d19ba6..65605530ee349c9682690c4fccb43aa9284d4287
 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -164,7 +164,7 @@ struct hrtimer_clock_base *lock_hrtimer_base(const struct 
hrtimer *timer,
struct hrtimer_clock_base *base;
 
for (;;) {
-   base = timer->base;
+   base = READ_ONCE(timer->base);
if (likely(base != _base)) {
raw_spin_lock_irqsave(>cpu_base->lock, *flags);
if (likely(base == timer->base))
@@ -244,7 +244,7 @@ switch_hrtimer_base(struct hrtimer *timer, struct 
hrtimer_clock_base *base,
return base;
 
/* See the comment in lock_hrtimer_base() */
-   timer->base = _base;
+   WRITE_ONCE(timer->base, _base);
raw_spin_unlock(>cpu_base->lock);
raw_spin_lock(_base->cpu_base->lock);
 
@@ -253,10 +253,10 @@ switch_hrtimer_base(struct hrtimer *timer, struct 
hrtimer_clock_base *base,
raw_spin_unlock(_base->cpu_base->lock);
raw_spin_lock(>cpu_base->lock);
new_cpu_base = this_cpu_base;
-   timer->base = base;
+   WRITE_ONCE(timer->base, base);
goto again;
}
-   timer->base = new_base;
+   WRITE_ONCE(timer->base, new_base);
} else {
if (new_cpu_base != this_cpu_base &&
hrtimer_check_target(timer, new_base)) {
-- 
2.23.0.581.g78d2f28ef7-goog



Re: Potential NULL pointer deference in net: sched

2019-10-07 Thread Eric Dumazet



On 10/7/19 2:08 PM, Yizhuo Zhai wrote:
> Hi All:
> 
> net/sched/sch_mq.c:
> Inside function mq_dump_class(), mq_queue_get() could return NULL,
> however, the return value of dev_queue is not checked  and get used.
> This could potentially be unsafe.
> 
> 

Not really.

mq_dump_class() is called by a layer that made sure the @cl argument
was not complete garbage.



Re: Kernel Concurrency Sanitizer (KCSAN)

2019-10-04 Thread Eric Dumazet



On 9/20/19 8:54 AM, Will Deacon wrote:

> 
> This one is tricky. What I think we need to avoid is an onslaught of
> patches adding READ_ONCE/WRITE_ONCE without a concrete analysis of the
> code being modified. My worry is that Joe Developer is eager to get their
> first patch into the kernel, so runs this tool and starts spamming
> maintainers with these things to the point that they start ignoring KCSAN
> reports altogether because of the time they take up.
> 
> I suppose one thing we could do is to require each new READ_ONCE/WRITE_ONCE
> to have a comment describing the racy access, a bit like we do for memory
> barriers. Another possibility would be to use atomic_t more widely if
> there is genuine concurrency involved.
> 

About READ_ONCE() and WRITE_ONCE(), we will probably need

ADD_ONCE(var, value)  for arches that can implement the RMW in a single 
instruction.

WRITE_ONCE(var, var + value) does not look pretty, and increases register 
pressure.

I had a look at first KCSAN reports, and I can tell that tcp_poll() being 
lockless
means we need to add hundreds of READ_ONCE(), WRITE_ONCE() and ADD_ONCE() all 
over the places.

-> Absolute nightmare for future backports to stable branches.


Re: [PATCH] vsock/virtio: add support for MSG_PEEK

2019-09-27 Thread Eric Dumazet



On 9/27/19 1:55 AM, Stefano Garzarella wrote:

> Good catch!
> 
> Maybe we can solve in this way:
> 
>   list_for_each_entry(pkt, >rx_queue, list) {
>   size_t off = pkt->off;
> 
>   if (total == len)
>   break;
> 
>   while (total < len && off < pkt->len) {
>   /* using 'off' instead of 'pkt->off' */
>   ...
> 
>   total += bytes;
>   off += bytes;
>   }
>   }
> 
> What do you think?
>

Maybe, but I need to see a complete patch, evil is in the details :)



Re: [RT PATCH 1/3] hrtimer: Use READ_ONCE to access timer->base in hrimer_grab_expiry_lock()

2019-09-26 Thread Eric Dumazet



On 8/21/19 6:50 AM, Thomas Gleixner wrote:
> On Wed, 21 Aug 2019, Sebastian Andrzej Siewior wrote:
> 
>> On 2019-08-21 10:24:07 [+0100], Julien Grall wrote:
>>> The update to timer->base is protected by the base->cpu_base->lock().
>>> However, hrtimer_grab_expirty_lock() does not access it with the lock.
>>>
>>> So it would theorically be possible to have timer->base changed under
>>> our feet. We need to prevent the compiler to refetch timer->base so the
>>> check and the access is performed on the same base.
>>
>> It is not a problem if the timer's bases changes. We get here because we
>> want to help the timer to complete its callback.
>> The base can only change if the timer gets re-armed on another CPU which
>> means is completed callback. In every case we can cancel the timer on
>> the next iteration.
> 
> It _IS_ a problem when the base changes and the compiler reloads
> 
>CPU0   CPU1
>base = timer->base;
> 
>lock(base->);
>   switch base
> 
>reload
>   base = timer->base;
> 
>unlock(base->);
> 

It seems we could hit a similar problem in lock_hrtimer_base()

 base = timer->base;

 if (likely(base != _base)) {

 

 raw_spin_lock_irqsave(>cpu_base->lock, *flags);

Probably not a big deal, since migration_base-cpu_base->lock can be locked just 
fine,
(without lockdep complaining that the lock has not been initialized since we 
use raw_ variant),
but this could cause unnecessary false sharing.


diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 
0d4dc241c0fb498036c91a571e65cb00f5d19ba6..fa881c03e0a1a351186a8d8f798dd7471067a951
 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -164,7 +164,7 @@ struct hrtimer_clock_base *lock_hrtimer_base(const struct 
hrtimer *timer,
struct hrtimer_clock_base *base;
 
for (;;) {
-   base = timer->base;
+   base = READ_ONCE(timer->base);
if (likely(base != _base)) {
raw_spin_lock_irqsave(>cpu_base->lock, *flags);
if (likely(base == timer->base))




Re: [PATCH] vsock/virtio: add support for MSG_PEEK

2019-09-26 Thread Eric Dumazet



On 9/26/19 11:23 AM, Matias Ezequiel Vara Larsen wrote:
> This patch adds support for MSG_PEEK. In such a case, packets are not
> removed from the rx_queue and credit updates are not sent.
> 
> Signed-off-by: Matias Ezequiel Vara Larsen 
> ---
>  net/vmw_vsock/virtio_transport_common.c | 50 
> +++--
>  1 file changed, 47 insertions(+), 3 deletions(-)
> 
> diff --git a/net/vmw_vsock/virtio_transport_common.c 
> b/net/vmw_vsock/virtio_transport_common.c
> index 94cc0fa..938f2ed 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -264,6 +264,50 @@ static int virtio_transport_send_credit_update(struct 
> vsock_sock *vsk,
>  }
>  
>  static ssize_t
> +virtio_transport_stream_do_peek(struct vsock_sock *vsk,
> + struct msghdr *msg,
> + size_t len)
> +{
> + struct virtio_vsock_sock *vvs = vsk->trans;
> + struct virtio_vsock_pkt *pkt;
> + size_t bytes, total = 0;
> + int err = -EFAULT;
> +
> + spin_lock_bh(>rx_lock);
> +
> + list_for_each_entry(pkt, >rx_queue, list) {
> + if (total == len)
> + break;
> +
> + bytes = len - total;
> + if (bytes > pkt->len - pkt->off)
> + bytes = pkt->len - pkt->off;
> +
> + /* sk_lock is held by caller so no one else can dequeue.
> +  * Unlock rx_lock since memcpy_to_msg() may sleep.
> +  */
> + spin_unlock_bh(>rx_lock);
> +
> + err = memcpy_to_msg(msg, pkt->buf + pkt->off, bytes);
> + if (err)
> + goto out;
> +
> + spin_lock_bh(>rx_lock);
> +
> + total += bytes;
> + }
> +
> + spin_unlock_bh(>rx_lock);
> +
> + return total;
> +
> +out:
> + if (total)
> + err = total;
> + return err;
> +}
>

This seems buggy to me.

virtio_transport_recv_enqueue() seems to be able to add payload to the last 
packet in the queue.

The loop you wrote here would miss newly added chunks while the vvs->rx_lock 
spinlock has been released.

virtio_transport_stream_do_dequeue() is ok, because it makes sure pkt->off == 
pkt->len
before cleaning the packet from the queue.





Re: [PATCH] ipv6: Properly check reference count flag before taking reference

2019-09-25 Thread Eric Dumazet



On 9/23/19 8:06 AM, Petr Vorel wrote:
> Hi,
> 
>> People are reporting that WireGuard experiences erratic crashes on 5.3,
>> and bisected it down to 7d30a7f6424e. Casually flipping through that
>> commit I noticed that a flag is checked using `|` instead of `&`, which in
>> this current case, means that a reference is never incremented, which
>> would result in the use-after-free users are seeing. This commit changes
>> the `|` to the proper `&` test.
> 
>> Cc: sta...@vger.kernel.org
>> Fixes: 7d30a7f6424e ("Merge branch 
>> 'ipv6-avoid-taking-refcnt-on-dst-during-route-lookup'")
>> Signed-off-by: Jason A. Donenfeld 
> 
> Reviewed-by: Petr Vorel 
> 
> NOTE: this change was added in d64a1f574a29 ("ipv6: honor 
> RT6_LOOKUP_F_DST_NOREF in rule lookup logic")
> 
> Kind regards,
> Petr
> 

This was fixed earlier I think

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=7b09c2d052db4b4ad0b27b97918b46a7746966fa

>> ---
>>  net/ipv6/ip6_fib.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
>> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
>> index 87f47bc55c5e..6e2af411cd9c 100644
>> --- a/net/ipv6/ip6_fib.c
>> +++ b/net/ipv6/ip6_fib.c
>> @@ -318,7 +318,7 @@ struct dst_entry *fib6_rule_lookup(struct net *net, 
>> struct flowi6 *fl6,
>>  if (rt->dst.error == -EAGAIN) {
>>  ip6_rt_put_flags(rt, flags);
>>  rt = net->ipv6.ip6_null_entry;
>> -if (!(flags | RT6_LOOKUP_F_DST_NOREF))
>> +if (!(flags & RT6_LOOKUP_F_DST_NOREF))
>>  dst_hold(>dst);
>>  }


Re: [PATCH] kcm: use BPF_PROG_RUN

2019-09-24 Thread Eric Dumazet



On 9/24/19 11:59 AM, Daniel Borkmann wrote:
> On Mon, Sep 23, 2019 at 02:31:04PM -0700, Eric Dumazet wrote:
>> On 9/6/19 10:06 AM, Alexei Starovoitov wrote:
>>> On Fri, Sep 6, 2019 at 3:03 AM Yonghong Song  wrote:
>>>> On 9/5/19 2:15 PM, Sami Tolvanen wrote:
>>>>> Instead of invoking struct bpf_prog::bpf_func directly, use the
>>>>> BPF_PROG_RUN macro.
>>>>>
>>>>> Signed-off-by: Sami Tolvanen 
>>>>
>>>> Acked-by: Yonghong Song 
>>>
>>> Applied. Thanks
>>
>> Then we probably need this as well, what do you think ?
> 
> Yep, it's broken. 6cab5e90ab2b ("bpf: run bpf programs with preemption
> disabled") probably forgot about it since it wasn't using BPF_PROG_RUN()
> in the first place. If you get a chance, please send a proper patch,
> thanks!

Sure, I will send this today.



Re: [PATCH] kcm: use BPF_PROG_RUN

2019-09-23 Thread Eric Dumazet



On 9/6/19 10:06 AM, Alexei Starovoitov wrote:
> On Fri, Sep 6, 2019 at 3:03 AM Yonghong Song  wrote:
>>
>>
>>
>> On 9/5/19 2:15 PM, Sami Tolvanen wrote:
>>> Instead of invoking struct bpf_prog::bpf_func directly, use the
>>> BPF_PROG_RUN macro.
>>>
>>> Signed-off-by: Sami Tolvanen 
>>
>> Acked-by: Yonghong Song 
> 
> Applied. Thanks
> 

Then we probably need this as well, what do you think ?

diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index 
8f12f5c6ab875ebaa6c59c6268c337919fb43bb9..6508e88efdaf57f206b84307f5ad5915a2ed21f7
 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -378,8 +378,13 @@ static int kcm_parse_func_strparser(struct strparser 
*strp, struct sk_buff *skb)
 {
struct kcm_psock *psock = container_of(strp, struct kcm_psock, strp);
struct bpf_prog *prog = psock->bpf_prog;
+   int res;
 
-   return BPF_PROG_RUN(prog, skb);
+   preempt_disable();
+   res = BPF_PROG_RUN(prog, skb);
+   preempt_enable();
+
+   return res;
 }
 
 static int kcm_read_sock_done(struct strparser *strp, int err)


Re: [PATCH] tcp: fix tcp_disconnect() not clear tp->fastopen_rsk sometimes

2019-09-06 Thread Eric Dumazet
On Fri, Sep 6, 2019 at 11:34 AM chunguo feng  wrote:
>
> From: fengchunguo 
>
> This patch avoids fastopen_rsk not be cleared every times, then occur
> the below BUG_ON:
> tcp_v4_destroy_sock
> ->BUG_ON(tp->fastopen_rsk);
>
> When playback some videos from netwrok,used tcp_disconnect continually.

Wow, tcp_disconnect() being used by something else than syzkaller !

> kthread+0x114/0x140
> ret_from_fork+0x10/0x18
>
> Signed-off-by: fengchunguo 
> ---
>  net/ipv4/tcp.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 61082065b26a..f5c354c0b24c 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2655,6 +2655,7 @@ int tcp_disconnect(struct sock *sk, int flags)
> /* Clean up fastopen related fields */
> tcp_free_fastopen_req(tp);
> inet->defer_connect = 0;
> +   tp->fastopen_rsk = 0;
>
> WARN_ON(inet->inet_num && !icsk->icsk_bind_hash);
>

This seems suspicious to me.

Are we leaking a block of memory after your patch ?

If the block of memory has been freed, maybe clear the pointer at the
place the free happened ?

I am traveling to Lisbon for LPC, maybe Yuchung or Neal can take a look.


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-05 Thread Eric Dumazet



On 9/5/19 4:09 PM, Qian Cai wrote:

> Instead of repeatedly make generalize statements, could you enlighten me with
> some concrete examples that have the similar properties which would trigger a
> livelock,
> 
> - guaranteed GFP_ATOMIC allocations when processing softirq batches.
> - the allocation has a fallback mechanism that is unnecessary to warn a 
> failure.
> 
> I thought "skb" is a special-case here as every packet sent or received is
> handled using this data structure.
>

Just  'git grep GFP_ATOMIC -- net' and carefully study all the places.

You will discover many allocations done for incoming packets.

All of them can fail and trigger a trace.

Please fix the problem for good, do not pretend addressing the skb allocations
will solve it.

The skb allocation can succeed, then the following allocation might fail.

skb are one of the many objects that networking need to allocate dynamically.



Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-05 Thread Eric Dumazet



On 9/5/19 4:09 PM, Qian Cai wrote:
> 
> I feel like you may not follow the thread closely. There are more details
> uncovered in the last few days and narrowed down to the culprits.
> 

I have followed the thread closely, thank you very much.

I am happy that the problem is addressed as I suggested.
Ie not individual patches adding selected __GFP_NOWARN.



Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-09-05 Thread Eric Dumazet



On 9/4/19 10:42 PM, Qian Cai wrote:

> To summary, those look to me are all good long-term improvement that would
> reduce the likelihood of this kind of livelock in general especially for other
> unknown allocations that happen while processing softirqs, but it is still up 
> to
> the air if it fixes it 100% in all situations as printk() is going to take 
> more
> time and could deal with console hardware that involve irq_exit() anyway.
> 
> On the other hand, adding __GPF_NOWARN in the build_skb() allocation will fix
> this known NET_TX_SOFTIRQ case which is common when softirqd involved at least
> in short-term. It even have a benefit to reduce the overall warn_alloc() noise
> out there.
> 
> I can resubmit with an update changelog. Does it make any sense?

It does not make sense.

We have thousands other GFP_ATOMIC allocations in the networking stacks.

Soon you will have to send more and more patches adding __GFP_NOWARN once
your workloads/tests can hit all these various points.

It is really time to fix this problem generically, instead of having
to review hundreds of patches.

This was my initial feedback really, nothing really has changed since.

The ability to send a warning with a stack trace, holding the cpu
for many milliseconds should not be decided case by case, otherwise
every call points will decide to opt-out from the harmful warnings.


Re: [PATCH net] net: sonic: remove dev_kfree_skb before return NETDEV_TX_BUSY

2019-09-04 Thread Eric Dumazet



On 9/4/19 11:42 AM, Mao Wenan wrote:
> When dma_map_single is failed to map buffer, skb can't be freed
> before sonic driver return to stack with NETDEV_TX_BUSY, because
> this skb may be requeued to qdisc, it might trigger use-after-free.
> 
> Fixes: d9fb9f384292 ("*sonic/natsemi/ns83829: Move the National 
> Semi-conductor drivers")
> Signed-off-by: Mao Wenan 
> ---
>  drivers/net/ethernet/natsemi/sonic.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/natsemi/sonic.c 
> b/drivers/net/ethernet/natsemi/sonic.c
> index d0a01e8f000a..248a8f22a33b 100644
> --- a/drivers/net/ethernet/natsemi/sonic.c
> +++ b/drivers/net/ethernet/natsemi/sonic.c
> @@ -233,7 +233,6 @@ static int sonic_send_packet(struct sk_buff *skb, struct 
> net_device *dev)
>   laddr = dma_map_single(lp->device, skb->data, length, DMA_TO_DEVICE);
>   if (!laddr) {
>   printk(KERN_ERR "%s: failed to map tx DMA buffer.\n", 
> dev->name);
> - dev_kfree_skb(skb);
>   return NETDEV_TX_BUSY;
>   }
>  
> 

That is the wrong way to fix this bug.

What guarantee do we have that the mapping operation will succeed next time we 
attempt
the transmit (and the dma_map_single() operation) ?

NETDEV_TX_BUSY is very dangerous, this might trigger an infinite loop.

I would rather leave the dev_kfree_skb(skb), and return NETDEV_TX_OK

Also the printk(KERN_ERR ...) should be replaced by pr_err_ratelimited(...)

NETDEV_TX_BUSY really should only be used by drivers that call 
netif_tx_stop_queue()
at the wrong moment.


Re: [PATCH] net: sched: taprio: Fix potential integer overflow in taprio_set_picos_per_byte

2019-09-03 Thread Eric Dumazet



On 9/3/19 3:08 AM, Gustavo A. R. Silva wrote:
> Add suffix LL to constant 1000 in order to avoid a potential integer
> overflow and give the compiler complete information about the proper
> arithmetic to use. Notice that this constant is being used in a context
> that expects an expression of type s64, but it's currently evaluated
> using 32-bit arithmetic.
> 
> Addresses-Coverity-ID: 1453459 ("Unintentional integer overflow")
> Fixes: f04b514c0ce2 ("taprio: Set default link speed to 10 Mbps in 
> taprio_set_picos_per_byte")
> Signed-off-by: Gustavo A. R. Silva 
> ---
>  net/sched/sch_taprio.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
> index 8d8bc2ec5cd6..956f837436ea 100644
> --- a/net/sched/sch_taprio.c
> +++ b/net/sched/sch_taprio.c
> @@ -966,7 +966,7 @@ static void taprio_set_picos_per_byte(struct net_device 
> *dev,
>  
>  skip:
>   picos_per_byte = div64_s64(NSEC_PER_SEC * 1000LL * 8,
> -speed * 1000 * 1000);
> +speed * 1000LL * 1000);
>  
>   atomic64_set(>picos_per_byte, picos_per_byte);
>   netdev_dbg(dev, "taprio: set %s's picos_per_byte to: %lld, linkspeed: 
> %d\n",
> 

But, why even multiplying by 1,000,000 in the first place, this seems silly,
a standard 32 bit divide could be used instead.

->

diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index 
8d8bc2ec5cd6281d811fd5d8a5c5211ebb0edd73..944b1af3215668e927d486b6c6c65c4599fb9539
 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -965,8 +965,7 @@ static void taprio_set_picos_per_byte(struct net_device 
*dev,
speed = ecmd.base.speed;
 
 skip:
-   picos_per_byte = div64_s64(NSEC_PER_SEC * 1000LL * 8,
-  speed * 1000 * 1000);
+   picos_per_byte = (USEC_PER_SEC * 8) / speed;
 
atomic64_set(>picos_per_byte, picos_per_byte);
netdev_dbg(dev, "taprio: set %s's picos_per_byte to: %lld, linkspeed: 
%d\n",





Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-08-30 Thread Eric Dumazet



On 8/30/19 5:25 PM, Qian Cai wrote:
> On Fri, 2019-08-30 at 17:11 +0200, Eric Dumazet wrote:
>>
>> On 8/30/19 4:57 PM, Qian Cai wrote:
>>> When running heavy memory pressure workloads, the system is throwing
>>> endless warnings below due to the allocation could fail from
>>> __build_skb(), and the volume of this call could be huge which may
>>> generate a lot of serial console output and cosumes all CPUs as
>>> warn_alloc() could be expensive by calling dump_stack() and then
>>> show_mem().
>>>
>>> Fix it by silencing the warning in this call site. Also, it seems
>>> unnecessary to even print a warning at all if the allocation failed in
>>> __build_skb(), as it may just retransmit the packet and retry.
>>>
>>
>> Same patches are showing up there and there from time to time.
>>
>> Why is this particular spot interesting, against all others not adding
>> __GFP_NOWARN ?
>>
>> Are we going to have hundred of patches adding __GFP_NOWARN at various 
>> points,
>> or should we get something generic to not flood the syslog in case of memory
>> pressure ?
>>
> 
> From my testing which uses LTP oom* tests. There are only 3 places need to be
> patched. The other two are in IOMMU code for both Intel and AMD. The place is
> particular interesting because it could cause the system with floating serial
> console output for days without making progress in OOM. I suppose it ends up 
> in
> a looping condition that warn_alloc() would end up generating more calls into
> __build_skb() via ksoftirqd.
> 

Yes, but what about other tests done by other people ?

You do not really answer my last question, which was really the point I tried
to make.

If there is a risk of flooding the syslog, we should fix this generically
in mm layer, not adding hundred of __GFP_NOWARN all over the places.

Maybe just make __GFP_NOWARN the default, I dunno.


Re: [PATCH] net/skbuff: silence warnings under memory pressure

2019-08-30 Thread Eric Dumazet



On 8/30/19 4:57 PM, Qian Cai wrote:
> When running heavy memory pressure workloads, the system is throwing
> endless warnings below due to the allocation could fail from
> __build_skb(), and the volume of this call could be huge which may
> generate a lot of serial console output and cosumes all CPUs as
> warn_alloc() could be expensive by calling dump_stack() and then
> show_mem().
> 
> Fix it by silencing the warning in this call site. Also, it seems
> unnecessary to even print a warning at all if the allocation failed in
> __build_skb(), as it may just retransmit the packet and retry.
> 

Same patches are showing up there and there from time to time.

Why is this particular spot interesting, against all others not adding 
__GFP_NOWARN ?

Are we going to have hundred of patches adding __GFP_NOWARN at various points,
or should we get something generic to not flood the syslog in case of memory 
pressure ?



Re: [PATCH net-next] r8152: fix accessing skb after napi_gro_receive

2019-08-29 Thread Eric Dumazet



On 8/19/19 5:15 AM, Hayes Wang wrote:
> Fix accessing skb after napi_gro_receive which is caused by
> commit 47922fcde536 ("r8152: support skb_add_rx_frag").
> 
> Fixes: 47922fcde536 ("r8152: support skb_add_rx_frag")
> Signed-off-by: Hayes Wang 
> ---

It is customary to add a tag to credit the reporter...

Something like :

Reported-by: 

Thanks.


Re: [v1] net_sched: act_police: add 2 new attributes to support police 64bit rate and peakrate

2019-08-29 Thread Eric Dumazet



On 8/29/19 12:51 AM, David Dai wrote:
> For high speed adapter like Mellanox CX-5 card, it can reach upto
> 100 Gbits per second bandwidth. Currently htb already supports 64bit rate
> in tc utility. However police action rate and peakrate are still limited
> to 32bit value (upto 32 Gbits per second). Add 2 new attributes
> TCA_POLICE_RATE64 and TCA_POLICE_RATE64 in kernel for 64bit support
> so that tc utility can use them for 64bit rate and peakrate value to
> break the 32bit limit, and still keep the backward binary compatibility.
> 
> Tested-by: David Dai 
> Signed-off-by: David Dai 
> ---
>  include/uapi/linux/pkt_cls.h |2 ++
>  net/sched/act_police.c   |   27 +++
>  2 files changed, 25 insertions(+), 4 deletions(-)
> 
> diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
> index b057aee..eb4ea4d 100644
> --- a/include/uapi/linux/pkt_cls.h
> +++ b/include/uapi/linux/pkt_cls.h
> @@ -159,6 +159,8 @@ enum {
>   TCA_POLICE_AVRATE,
>   TCA_POLICE_RESULT,
>   TCA_POLICE_TM,
> + TCA_POLICE_RATE64,
> + TCA_POLICE_PEAKRATE64,
>   TCA_POLICE_PAD,
>   __TCA_POLICE_MAX
>  #define TCA_POLICE_RESULT TCA_POLICE_RESULT

Never insert new attributes, as this breaks compatibility with old binaries 
(including
old kernels)

Keep TCA_POLICE_PAD value the same, thanks.


Re: [PATCH] ipv6: Not to probe neighbourless routes

2019-08-27 Thread Eric Dumazet



On 8/27/19 11:08 AM, Yi Wang wrote:
> From: Cheng Lin 
> 
> Originally, Router Reachability Probing require a neighbour entry
> existed. Commit 2152caea7196 ("ipv6: Do not depend on rt->n in
> rt6_probe().") removed the requirement for a neighbour entry. And
> commit f547fac624be ("ipv6: rate-limit probes for neighbourless
> routes") adds rate-limiting for neighbourless routes.
> 
> And, the Neighbor Discovery for IP version 6 (IPv6)(rfc4861) says,
> "
> 7.2.5.  Receipt of Neighbor Advertisements
> 
> When a valid Neighbor Advertisement is received (either solicited or
> unsolicited), the Neighbor Cache is searched for the target's entry.
> If no entry exists, the advertisement SHOULD be silently discarded.
> There is no need to create an entry if none exists, since the
> recipient has apparently not initiated any communication with the
> target.
> ".
> 
> In rt6_probe(), just a Neighbor Solicitation message are transmited.
> When receiving a Neighbor Advertisement, the node does nothing in a
> Neighborless condition.
> 
> Not sure it's needed to create a neighbor entry in Router
> Reachability Probing. And the Original way may be the right way.
> 
> This patch recover the requirement for a neighbour entry.
> 
> Signed-off-by: Cheng Lin 
> ---
>  include/net/ip6_fib.h | 5 -
>  net/ipv6/route.c  | 5 +
>  2 files changed, 1 insertion(+), 9 deletions(-)
> 
> diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
> index 4b5656c..8c2e022 100644
> --- a/include/net/ip6_fib.h
> +++ b/include/net/ip6_fib.h
> @@ -124,11 +124,6 @@ struct rt6_exception {
>  
>  struct fib6_nh {
>   struct fib_nh_commonnh_common;
> -
> -#ifdef CONFIG_IPV6_ROUTER_PREF
> - unsigned long   last_probe;
> -#endif
> -
>   struct rt6_info * __percpu *rt6i_pcpu;
>   struct rt6_exception_bucket __rcu *rt6i_exception_bucket;
>  };
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index fd059e0..c4bcffc 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -639,12 +639,12 @@ static void rt6_probe(struct fib6_nh *fib6_nh)
>   nh_gw = _nh->fib_nh_gw6;
>   dev = fib6_nh->fib_nh_dev;
>   rcu_read_lock_bh();
> - idev = __in6_dev_get(dev);
>   neigh = __ipv6_neigh_lookup_noref(dev, nh_gw);
>   if (neigh) {
>   if (neigh->nud_state & NUD_VALID)
>   goto out;
>  
> + idev = __in6_dev_get(dev);
>   write_lock(>lock);
>   if (!(neigh->nud_state & NUD_VALID) &&
>   time_after(jiffies,
> @@ -654,9 +654,6 @@ static void rt6_probe(struct fib6_nh *fib6_nh)
>   __neigh_set_probe_once(neigh);
>   }
>   write_unlock(>lock);
> - } else if (time_after(jiffies, fib6_nh->last_probe +
> -idev->cnf.rtr_probe_interval)) {
> - work = kmalloc(sizeof(*work), GFP_ATOMIC);
>   }
>  
>   if (work) {
> 

Have you really compiled this patch ?




Re: [PATCH] net: Adding parameter detection in __ethtool_get_link_ksettings.

2019-08-26 Thread Eric Dumazet



On 8/26/19 9:23 AM, Dongxu Liu wrote:
> The __ethtool_get_link_ksettings symbol will be exported,
> and external users may use an illegal address.
> We should check the parameters before using them,
> otherwise the system will crash.
> 
> [ 8980.991134] BUG: unable to handle kernel NULL pointer dereference at   
> (null)
> [ 8980.993049] IP: [] 
> __ethtool_get_link_ksettings+0x27/0x140
> [ 8980.994285] PGD 0
> [ 8980.995013] Oops:  [#1] SMP
> [ 8980.995896] Modules linked in: sch_ingress ...
> [ 8981.013220] CPU: 3 PID: 25174 Comm: kworker/3:3 Tainted: G   O   
> V---   3.10.0-327.36.58.4.x86_64 #1
> [ 8981.017667] Workqueue: events linkwatch_event
> [ 8981.018652] task: 8800a8348000 ti: 8800b045c000 task.ti: 
> 8800b045c000
> [ 8981.020418] RIP: 0010:[]  [] 
> __ethtool_get_link_ksettings+0x27/0x140
> [ 8981.022383] RSP: 0018:8800b045fc88  EFLAGS: 00010202
> [ 8981.023453] RAX:  RBX: 8800b045fcac RCX: 
> 
> [ 8981.024726] RDX: 8800b658f600 RSI: 8800b045fcac RDI: 
> 8802296e
> [ 8981.026000] RBP: 8800b045fc98 R08:  R09: 
> 0001
> [ 8981.027273] R10: 73e0 R11: 082b0cc8adea R12: 
> 8802296e
> [ 8981.028561] R13: 8800b566e8c0 R14: 8800b658f600 R15: 
> 8800b566e000
> [ 8981.029841] FS:  () GS:88023ed8() 
> knlGS:
> [ 8981.031715] CS:  0010 DS:  ES:  CR0: 80050033
> [ 8981.032845] CR2:  CR3: b39a9000 CR4: 
> 003407e0
> [ 8981.034137] DR0:  DR1:  DR2: 
> 
> [ 8981.035427] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [ 8981.036702] Stack:
> [ 8981.037406]  8800b658f600 9c40 8800b045fce8 
> a047a71d
> [ 8981.039238]  004d 8800b045fcc8 8800b045fd28 
> 815cb198
> [ 8981.041070]  8800b045fcd8 810807e6 e8212951 
> 0001
> [ 8981.042910] Call Trace:
> [ 8981.043660]  [] bond_update_speed_duplex+0x3d/0x90 
> [bonding]
> [ 8981.045424]  [] ? inetdev_event+0x38/0x530
> [ 8981.046554]  [] ? put_online_cpus+0x56/0x80
> [ 8981.047688]  [] bond_netdev_event+0x137/0x360 [bonding]
> ...
> 
> Signed-off-by: Dongxu Liu 
> ---
>  net/core/ethtool.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/net/core/ethtool.c b/net/core/ethtool.c
> index 6288e69..9a50b64 100644
> --- a/net/core/ethtool.c
> +++ b/net/core/ethtool.c
> @@ -545,6 +545,8 @@ int __ethtool_get_link_ksettings(struct net_device *dev,
>  {
>   ASSERT_RTNL();
>  
> + if (!dev || !dev->ethtool_ops)
> + return -EOPNOTSUPP;

I do not believe dev can possibly be NULL at this point.

>   if (!dev->ethtool_ops->get_link_ksettings)
>   return -EOPNOTSUPP;
>  
> 

I tried to find an appropriate Fixes: tag.

It seems this particular bug was added either by

Fixes: 9856909c2abb ("net: bonding: use __ethtool_get_ksettings")

or generically in :

Fixes: 3f1ac7a700d0 ("net: ethtool: add new ETHTOOL_xLINKSETTINGS API")



Re: 5.3-rc3-ish VM crash: RIP: 0010:tcp_trim_head+0x20/0xe0

2019-08-17 Thread Eric Dumazet



On 8/17/19 10:24 AM, Sander Eikelenboom wrote:
> On 12/08/2019 19:56, Eric Dumazet wrote:
>>
>>
>> On 8/12/19 2:50 PM, Sander Eikelenboom wrote:
>>> L.S.,
>>>
>>> While testing a somewhere-after-5.3-rc3 kernel (which included the latest 
>>> net merge (33920f1ec5bf47c5c0a1d2113989bdd9dfb3fae9),
>>> one of my Xen VM's (which gets quite some network load) crashed.
>>> See below for the stacktrace.
>>>
>>> Unfortunately I haven't got a clear trigger, so bisection doesn't seem to 
>>> be an option at the moment. 
>>> I haven't encountered this on 5.2, so it seems to be an regression against 
>>> 5.2.
>>>
>>> Any ideas ?
>>>
>>> --
>>> Sander
>>>
>>>
>>> [16930.653595] general protection fault:  [#1] SMP NOPTI
>>> [16930.653624] CPU: 0 PID: 3275 Comm: rsync Not tainted 
>>> 5.3.0-rc3-20190809-doflr+ #1
>>> [16930.653657] RIP: 0010:tcp_trim_head+0x20/0xe0
>>> [16930.653677] Code: 2e 0f 1f 84 00 00 00 00 00 90 41 54 41 89 d4 55 48 89 
>>> fd 53 48 89 f3 f6 46 7e 01 74 2f 8b 86 bc 00 00 00 48 03 86 c0 00 00 00 
>>> <8b> 40 20 66 83 f8 01 74 19 31 d2 31 f6 b9 20 0a 00 00 48 89 df e8
>>> [16930.653741] RSP: :c9003ad8 EFLAGS: 00010286
>>> [16930.653762] RAX: fffe888005bf62c0 RBX: 8880115fb800 RCX: 
>>> 801b
>>
>> crash in " mov0x20(%rax),%eax"   and RAX=fffe888005bf62c0 (not a valid 
>> kernel address)
>>
>> Look like one bit corruption maybe.
>>
>> Nothing comes to mind really between 5.2 and 53 that could explain this.
>>
>>> [16930.653791] RDX: 05a0 RSI: 8880115fb800 RDI: 
>>> 888016b00880
>>> [16930.653819] RBP: 888016b00880 R08: 0001 R09: 
>>> 
>>> [16930.653848] R10: 88800ae00800 R11: bfe632e6 R12: 
>>> 05a0
>>> [16930.653875] R13: 0001 R14: bfe62d46 R15: 
>>> 0004
>>> [16930.653913] FS:  7fe71fe2cb80() GS:88801f20() 
>>> knlGS:
>>> [16930.653943] CS:  0010 DS:  ES:  CR0: 80050033
>>> [16930.653965] CR2: 55de0f3e7000 CR3: 11f32000 CR4: 
>>> 06f0
>>> [16930.653993] Call Trace:
>>> [16930.654005]  
>>> [16930.654018]  tcp_ack+0xbb0/0x1230
>>> [16930.654033]  tcp_rcv_established+0x2e8/0x630
>>> [16930.654053]  tcp_v4_do_rcv+0x129/0x1d0
>>> [16930.654070]  tcp_v4_rcv+0xac9/0xcb0
>>> [16930.654088]  ip_protocol_deliver_rcu+0x27/0x1b0
>>> [16930.654109]  ip_local_deliver_finish+0x3f/0x50
>>> [16930.654128]  ip_local_deliver+0x4d/0xe0
>>> [16930.654145]  ? ip_protocol_deliver_rcu+0x1b0/0x1b0
>>> [16930.654163]  ip_rcv+0x4c/0xd0
>>> [16930.654179]  __netif_receive_skb_one_core+0x79/0x90
>>> [16930.654200]  netif_receive_skb_internal+0x2a/0xa0
>>> [16930.654219]  napi_gro_receive+0xe7/0x140
>>> [16930.654237]  xennet_poll+0x9be/0xae0
>>> [16930.654254]  net_rx_action+0x136/0x340
>>> [16930.654271]  __do_softirq+0xdd/0x2cf
>>> [16930.654287]  irq_exit+0x7a/0xa0
>>> [16930.654304]  xen_evtchn_do_upcall+0x27/0x40
>>> [16930.654320]  xen_hvm_callback_vector+0xf/0x20
>>> [16930.654339]  
>>> [16930.654349] RIP: 0033:0x55de0d87db99
>>> [16930.654364] Code: 00 00 48 89 7c 24 f8 45 39 fe 45 0f 42 fe 44 89 7c 24 
>>> f4 eb 09 0f 1f 40 00 83 e9 01 74 3e 89 f2 48 63 f8 4c 01 d2 44 38 1c 3a 
>>> <75> 25 44 38 6c 3a ff 75 1e 41 0f b6 3c 24 40 38 3a 75 14 41 0f b6
>>> [16930.654432] RSP: 002b:7ffd5531eec8 EFLAGS: 0a87 ORIG_RAX: 
>>> ff0c
>>> [16930.655004] RAX: 0002 RBX: 55de0f3e8e50 RCX: 
>>> 007f
>>> [16930.655034] RDX: 55de0f3dc2d2 RSI: 3492 RDI: 
>>> 0002
>>> [16930.655062] RBP: 7fff R08: 80ea R09: 
>>> 01f0
>>> [16930.655089] R10: 55de0f3d8e40 R11: 0094 R12: 
>>> 55de0f3e0f2a
>>> [16930.655116] R13: 0010 R14: 7f16 R15: 
>>> 0080
>>> [16930.655144] Modules linked in:
>>> [16930.655200] ---[ end trace 533367c95501b645 ]---
>>> [16930.655223] RIP: 0010:tcp_trim_head+0x20/0xe0
>>> [16930.655243] Code: 2e 0f 1f 84 00 00 00 00 00 90 41 54 41 89 d4 55 48 89 
>>> fd 53 48 89 f3 f6 46 7e 01 74 2f 8b 86 bc 00 00 00 48 03

Re: [PATCH net-next] r8152: divide the tx and rx bottom functions

2019-08-16 Thread Eric Dumazet



On 8/16/19 11:08 AM, Hayes Wang wrote:
> Eric Dumazet [mailto:eric.duma...@gmail.com]
>> Sent: Friday, August 16, 2019 4:20 PM
> [...]
>> Which callback ?
> 
> The USB device has two endpoints for Tx and Rx.
> If I submit tx or rx URB to the USB host controller,
> the relative callback functions would be called, when
> they are finished. For rx, it is read_bulk_callback.
> For tx, it is write_bulk_callback.
> 
>> After an idle period (no activity, no prior packets being tx-completed ...),
>> a packet is sent by the upper stack, enters the ndo_start_xmit() of a network
>> driver.
>>
>> This driver ndo_start_xmit() simply adds an skb to a local list, and returns.
> 
> Base on the current method (without tasklet), when
> ndo_start_xmit() is called, napi_schedule is called only
> if there is at least one free buffer (!list_empty(>tx_free))
> to transmit the packet. Then, the flow would be as following.

Very uncommon naming conventions really :/


Maybe you would avoid messing with a tasklet (we really try to get rid
of tasklets in general) using two NAPI, one for TX, one for RX.

Some drivers already use two NAPI, it is fine.

This might avoid the ugly dance in r8152_poll(),
calling napi_schedule(napi) after napi_complete_done() !



Re: [PATCH net-next] r8152: divide the tx and rx bottom functions

2019-08-16 Thread Eric Dumazet



On 8/16/19 10:10 AM, Hayes Wang wrote:
> Eric Dumazet [mailto:eric.duma...@gmail.com]
>> Sent: Friday, August 16, 2019 2:40 PM
> [...]
>> tasklet and NAPI are scheduled on the same core (the current
>> cpu calling napi_schedule() or tasklet_schedule())
>>
>> I would rather not add this dubious tasklet, and instead try to understand
>> what is wrong in this driver ;)
>>
>> The various napi_schedule() calls are suspect IMO.
> 
> The original method as following.
> 
> static int r8152_poll(struct napi_struct *napi, int budget)
> {
>   struct r8152 *tp = container_of(napi, struct r8152, napi);
>   int work_done;
> 
>   work_done = rx_bottom(tp, budget); <-- RX
>   bottom_half(tp); <-- Tx (tx_bottom)
>   [...]
> 
> The rx_bottom and tx_bottom would only be called in r8152_poll.
> That is, tx_bottom wouldn't be run unless rx_bottom is finished.
> And, rx_bottom would be called if tx_bottom is running.
> 
> If the traffic is busy. rx_bottom or tx_bottom may take a lot
> of time to deal with the packets. And the one would increase
> the latency time for the other one.
> 
> Therefore, when I separate the tx_bottom and rx_bottom to
> different tasklet and napi, the callback functions of tx and
> rx may schedule the tasklet and napi to different cpu. Then,
> the rx_bottom and tx_bottom may be run at the same time.


Your patch makes absolutely no guarantee of doing what you
want, I am afraid.

> 
> Take our arm platform for example. There are five cpus to
> handle the interrupt of USB host controller. When the rx is
> completed, cpu #1 may handle the interrupt and napi would
> be scheduled. When the tx is finished, cpu #2 may handle
> the interrupt and the tasklet is scheduled. Then, napi is
> run on cpu #1, and tasklet is run on cpu #2.
> 
>> Also rtl8152_start_xmit() uses skb_queue_tail(>tx_queue, skb);
>>
>> But I see nothing really kicking the transmit if tx_free is empty ?
> 
> Tx callback function "write_bulk_callback" would deal with it.
> The callback function would check if there are packets waiting
> to be sent.

Which callback ?

After an idle period (no activity, no prior packets being tx-completed ...),
a packet is sent by the upper stack, enters the ndo_start_xmit() of a network 
driver.

This driver ndo_start_xmit() simply adds an skb to a local list, and returns.

Where/how is scheduled this callback ?

Some kind of timer ?
An (unrelated) incoming packet ?

> 
> 
> Best Regards,
> Hayes
> 
> 


Re: [PATCH net-next v2 4/5] r8152: support skb_add_rx_frag

2019-08-16 Thread Eric Dumazet



On 8/13/19 5:42 AM, Hayes Wang wrote:
> Use skb_add_rx_frag() to reduce the memory copy for rx data.
> 
> Use a new list of rx_used to store the rx buffer which couldn't be
> reused yet.
> 
> Besides, the total number of rx buffer may be increased or decreased
> dynamically. And it is limited by RTL8152_MAX_RX_AGG.
> 
> Signed-off-by: Hayes Wang 
>

...

>   skb->protocol = eth_type_trans(skb, netdev);
>   rtl_rx_vlan_tag(rx_desc, skb);
>   if (work_done < budget) {
>   napi_gro_receive(napi, skb);
>   work_done++;
>   stats->rx_packets++;
> - stats->rx_bytes += pkt_len;
> + stats->rx_bytes += skb->len;

use-after-free. skb is no longer in your hands after napi_gro_receive()



Re: [PATCH net-next] r8152: divide the tx and rx bottom functions

2019-08-16 Thread Eric Dumazet



On 8/14/19 10:30 AM, Hayes Wang wrote:
> Move the tx bottom function from NAPI to a new tasklet. Then, for
> multi-cores, the bottom functions of tx and rx may be run at same
> time with different cores. This is used to improve performance.
> 
>  

tasklet and NAPI are scheduled on the same core (the current
cpu calling napi_schedule() or tasklet_schedule())

I would rather not add this dubious tasklet, and instead try to understand
what is wrong in this driver ;)

The various napi_schedule() calls are suspect IMO.

Also rtl8152_start_xmit() uses skb_queue_tail(>tx_queue, skb);

But I see nothing really kicking the transmit if tx_free is empty ?

tx_bottom() is only called from bottom_half() (called from r8152_poll())




Re: KMSAN: uninit-value in nh_valid_get_del_req

2019-08-13 Thread Eric Dumazet



On 8/13/19 3:48 PM, syzbot wrote:
> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:    61ccdad1 Revert "drm/bochs: Use shadow buffer for bochs fr..
> git tree:   https://github.com/google/kmsan.git master
> console output: https://syzkaller.appspot.com/x/log.txt?x=14c120e260
> kernel config:  https://syzkaller.appspot.com/x/.config?x=27abc558ecb16a3b
> dashboard link: https://syzkaller.appspot.com/bug?extid=86ec9d8c02c07571873c
> compiler:   clang version 9.0.0 (/home/glider/llvm/clang 
> 80fee25776c2fb61e74c1ecb1a523375c2500b69)
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=15ed6c4a60
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11de024a60
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+86ec9d8c02c075718...@syzkaller.appspotmail.com
> 
> ==
> BUG: KMSAN: uninit-value in nh_valid_get_del_req+0x6f1/0x8c0 
> net/ipv4/nexthop.c:1510
> CPU: 0 PID: 11812 Comm: syz-executor444 Not tainted 5.3.0-rc3+ #17
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x191/0x1f0 lib/dump_stack.c:113
>  kmsan_report+0x162/0x2d0 mm/kmsan/kmsan_report.c:109
>  __msan_warning+0x75/0xe0 mm/kmsan/kmsan_instr.c:294
>  nh_valid_get_del_req+0x6f1/0x8c0 net/ipv4/nexthop.c:1510
>  rtm_del_nexthop+0x1b1/0x610 net/ipv4/nexthop.c:1543
>  rtnetlink_rcv_msg+0x115a/0x1580 net/core/rtnetlink.c:5223
>  netlink_rcv_skb+0x431/0x620 net/netlink/af_netlink.c:2477
>  rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:5241
>  netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
>  netlink_unicast+0xf6c/0x1050 net/netlink/af_netlink.c:1328
>  netlink_sendmsg+0x110f/0x1330 net/netlink/af_netlink.c:1917
>  sock_sendmsg_nosec net/socket.c:637 [inline]
>  sock_sendmsg net/socket.c:657 [inline]
>  ___sys_sendmsg+0x14ff/0x1590 net/socket.c:2311
>  __sys_sendmmsg+0x53a/0xae0 net/socket.c:2413
>  __do_sys_sendmmsg net/socket.c:2442 [inline]
>  __se_sys_sendmmsg+0xbd/0xe0 net/socket.c:2439
>  __x64_sys_sendmmsg+0x56/0x70 net/socket.c:2439
>  do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:297
>  entry_SYSCALL_64_after_hwframe+0x63/0xe7
> RIP: 0033:0x440259
> Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 
> 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 
> 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
> RSP: 002b:7fff15f10d08 EFLAGS: 0246 ORIG_RAX: 0133
> RAX: ffda RBX: 004002c8 RCX: 00440259
> RDX: 0492492492492805 RSI: 2140 RDI: 0003
> RBP: 006ca018 R08:  R09: 004002c8
> R10:  R11: 0246 R12: 00401ae0
> R13: 00401b70 R14:  R15: 
> 
> Uninit was created at:
>  kmsan_save_stack_with_flags mm/kmsan/kmsan.c:187 [inline]
>  kmsan_internal_poison_shadow+0x53/0xa0 mm/kmsan/kmsan.c:146
>  kmsan_slab_alloc+0xaa/0x120 mm/kmsan/kmsan_hooks.c:175
>  slab_alloc_node mm/slub.c:2790 [inline]
>  __kmalloc_node_track_caller+0xb55/0x1320 mm/slub.c:4388
>  __kmalloc_reserve net/core/skbuff.c:141 [inline]
>  __alloc_skb+0x306/0xa10 net/core/skbuff.c:209
>  alloc_skb include/linux/skbuff.h:1056 [inline]
>  netlink_alloc_large_skb net/netlink/af_netlink.c:1174 [inline]
>  netlink_sendmsg+0x783/0x1330 net/netlink/af_netlink.c:1892
>  sock_sendmsg_nosec net/socket.c:637 [inline]
>  sock_sendmsg net/socket.c:657 [inline]
>  ___sys_sendmsg+0x14ff/0x1590 net/socket.c:2311
>  __sys_sendmmsg+0x53a/0xae0 net/socket.c:2413
>  __do_sys_sendmmsg net/socket.c:2442 [inline]
>  __se_sys_sendmmsg+0xbd/0xe0 net/socket.c:2439
>  __x64_sys_sendmmsg+0x56/0x70 net/socket.c:2439
>  do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:297
>  entry_SYSCALL_64_after_hwframe+0x63/0xe7
> ==
> 
> 
> ---
> This bug is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkal...@googlegroups.com.
> 
> syzbot will keep track of this bug report. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> syzbot can test patches for this bug, for details see:
> https://goo.gl/tpsmEJ#testing-patches

Fix is under review

https://patchwork.ozlabs.org/patch/1145879/



Re: 5.3-rc3-ish VM crash: RIP: 0010:tcp_trim_head+0x20/0xe0

2019-08-12 Thread Eric Dumazet



On 8/12/19 2:50 PM, Sander Eikelenboom wrote:
> L.S.,
> 
> While testing a somewhere-after-5.3-rc3 kernel (which included the latest net 
> merge (33920f1ec5bf47c5c0a1d2113989bdd9dfb3fae9),
> one of my Xen VM's (which gets quite some network load) crashed.
> See below for the stacktrace.
> 
> Unfortunately I haven't got a clear trigger, so bisection doesn't seem to be 
> an option at the moment. 
> I haven't encountered this on 5.2, so it seems to be an regression against 
> 5.2.
> 
> Any ideas ?
> 
> --
> Sander
> 
> 
> [16930.653595] general protection fault:  [#1] SMP NOPTI
> [16930.653624] CPU: 0 PID: 3275 Comm: rsync Not tainted 
> 5.3.0-rc3-20190809-doflr+ #1
> [16930.653657] RIP: 0010:tcp_trim_head+0x20/0xe0
> [16930.653677] Code: 2e 0f 1f 84 00 00 00 00 00 90 41 54 41 89 d4 55 48 89 fd 
> 53 48 89 f3 f6 46 7e 01 74 2f 8b 86 bc 00 00 00 48 03 86 c0 00 00 00 <8b> 40 
> 20 66 83 f8 01 74 19 31 d2 31 f6 b9 20 0a 00 00 48 89 df e8
> [16930.653741] RSP: :c9003ad8 EFLAGS: 00010286
> [16930.653762] RAX: fffe888005bf62c0 RBX: 8880115fb800 RCX: 
> 801b

crash in " mov0x20(%rax),%eax"   and RAX=fffe888005bf62c0 (not a valid 
kernel address)

Look like one bit corruption maybe.

Nothing comes to mind really between 5.2 and 53 that could explain this.

> [16930.653791] RDX: 05a0 RSI: 8880115fb800 RDI: 
> 888016b00880
> [16930.653819] RBP: 888016b00880 R08: 0001 R09: 
> 
> [16930.653848] R10: 88800ae00800 R11: bfe632e6 R12: 
> 05a0
> [16930.653875] R13: 0001 R14: bfe62d46 R15: 
> 0004
> [16930.653913] FS:  7fe71fe2cb80() GS:88801f20() 
> knlGS:
> [16930.653943] CS:  0010 DS:  ES:  CR0: 80050033
> [16930.653965] CR2: 55de0f3e7000 CR3: 11f32000 CR4: 
> 06f0
> [16930.653993] Call Trace:
> [16930.654005]  
> [16930.654018]  tcp_ack+0xbb0/0x1230
> [16930.654033]  tcp_rcv_established+0x2e8/0x630
> [16930.654053]  tcp_v4_do_rcv+0x129/0x1d0
> [16930.654070]  tcp_v4_rcv+0xac9/0xcb0
> [16930.654088]  ip_protocol_deliver_rcu+0x27/0x1b0
> [16930.654109]  ip_local_deliver_finish+0x3f/0x50
> [16930.654128]  ip_local_deliver+0x4d/0xe0
> [16930.654145]  ? ip_protocol_deliver_rcu+0x1b0/0x1b0
> [16930.654163]  ip_rcv+0x4c/0xd0
> [16930.654179]  __netif_receive_skb_one_core+0x79/0x90
> [16930.654200]  netif_receive_skb_internal+0x2a/0xa0
> [16930.654219]  napi_gro_receive+0xe7/0x140
> [16930.654237]  xennet_poll+0x9be/0xae0
> [16930.654254]  net_rx_action+0x136/0x340
> [16930.654271]  __do_softirq+0xdd/0x2cf
> [16930.654287]  irq_exit+0x7a/0xa0
> [16930.654304]  xen_evtchn_do_upcall+0x27/0x40
> [16930.654320]  xen_hvm_callback_vector+0xf/0x20
> [16930.654339]  
> [16930.654349] RIP: 0033:0x55de0d87db99
> [16930.654364] Code: 00 00 48 89 7c 24 f8 45 39 fe 45 0f 42 fe 44 89 7c 24 f4 
> eb 09 0f 1f 40 00 83 e9 01 74 3e 89 f2 48 63 f8 4c 01 d2 44 38 1c 3a <75> 25 
> 44 38 6c 3a ff 75 1e 41 0f b6 3c 24 40 38 3a 75 14 41 0f b6
> [16930.654432] RSP: 002b:7ffd5531eec8 EFLAGS: 0a87 ORIG_RAX: 
> ff0c
> [16930.655004] RAX: 0002 RBX: 55de0f3e8e50 RCX: 
> 007f
> [16930.655034] RDX: 55de0f3dc2d2 RSI: 3492 RDI: 
> 0002
> [16930.655062] RBP: 7fff R08: 80ea R09: 
> 01f0
> [16930.655089] R10: 55de0f3d8e40 R11: 0094 R12: 
> 55de0f3e0f2a
> [16930.655116] R13: 0010 R14: 7f16 R15: 
> 0080
> [16930.655144] Modules linked in:
> [16930.655200] ---[ end trace 533367c95501b645 ]---
> [16930.655223] RIP: 0010:tcp_trim_head+0x20/0xe0
> [16930.655243] Code: 2e 0f 1f 84 00 00 00 00 00 90 41 54 41 89 d4 55 48 89 fd 
> 53 48 89 f3 f6 46 7e 01 74 2f 8b 86 bc 00 00 00 48 03 86 c0 00 00 00 <8b> 40 
> 20 66 83 f8 01 74 19 31 d2 31 f6 b9 20 0a 00 00 48 89 df e8
> [16930.655312] RSP: :c9003ad8 EFLAGS: 00010286
> [16930.655331] RAX: fffe888005bf62c0 RBX: 8880115fb800 RCX: 
> 801b
> [16930.655360] RDX: 05a0 RSI: 8880115fb800 RDI: 
> 888016b00880
> [16930.655387] RBP: 888016b00880 R08: 0001 R09: 
> 
> [16930.655414] R10: 88800ae00800 R11: bfe632e6 R12: 
> 05a0
> [16930.655441] R13: 0001 R14: bfe62d46 R15: 
> 0004
> [16930.655475] FS:  7fe71fe2cb80() GS:88801f20() 
> knlGS:
> [16930.655502] CS:  0010 DS:  ES:  CR0: 80050033
> [16930.655525] CR2: 55de0f3e7000 CR3: 11f32000 CR4: 
> 06f0
> [16930.63] Kernel panic - not syncing: Fatal exception in interrupt
> [16930.655789] Kernel Offset: disabled
> 


Re: [PATCH v3] net: sched: Fix a possible null-pointer dereference in dequeue_func()

2019-08-05 Thread Eric Dumazet



On 7/29/19 7:23 PM, Cong Wang wrote:
> On Mon, Jul 29, 2019 at 1:24 AM Jia-Ju Bai  wrote:
>>
>> In dequeue_func(), there is an if statement on line 74 to check whether
>> skb is NULL:
>> if (skb)
>>
>> When skb is NULL, it is used on line 77:
>> prefetch(>end);
>>
>> Thus, a possible null-pointer dereference may occur.
>>
>> To fix this bug, skb->end is used when skb is not NULL.
>>
>> This bug is found by a static analysis tool STCheck written by us.
> 
> It doesn't dereference the pointer, it simply calculates the address
> and passes it to gcc builtin prefetch. Both are fine when it is NULL,
> as prefetching a NULL address should be okay for kernel.
> 
> So your changelog is misleading and it doesn't fix any bug,
> although it does very slightly make the code better.
> 

Original code was intentional.

A prefetch() need to be done as early as possible.

At the time of commit 76e3cc126bb223013a6b9a0e2a51238d1ef2e409
this was pretty clear :

+static struct sk_buff *dequeue(struct codel_vars *vars, struct Qdisc *sch)
+{
+   struct sk_buff *skb = __skb_dequeue(>q);
+
+   prefetch(>end); /* we'll need skb_shinfo() */
+   return skb;
+}


Really static analysis should learn about prefetch(X) being legal for _any_ 
value of X


Re: [PATCH] Revert "net: get rid of an signed integer overflow in ip_idents_reserve()"

2019-07-26 Thread Eric Dumazet



On 7/26/19 11:17 AM, Shaokun Zhang wrote:
> From: Yang Guo 
> 
> There is an significant performance regression with the following
> commit-id 
> ("net: get rid of an signed integer overflow in ip_idents_reserve()").
> 
>

So, you jump around and took ownership of this issue, while some of us
are already working on it ?

Have you first checked that current UBSAN versions will not complain anymore ?

A revert adding back the original issue would be silly, performance of
benchmarks is nice but secondary.



Re: [PATCH 4.4 stable net] net: tcp: Fix use-after-free in tcp_write_xmit

2019-07-25 Thread Eric Dumazet



On 7/25/19 6:29 AM, maowenan wrote:
> 

> Syzkaller reproducer():
> r0 = socket$packet(0x11, 0x3, 0x300)
> r1 = socket$inet_tcp(0x2, 0x1, 0x0)
> bind$inet(r1, &(0x7f000300)={0x2, 0x4e21, @multicast1}, 0x10)
> connect$inet(r1, &(0x7f000140)={0x2, 0x104e21, @loopback}, 0x10)
> recvmmsg(r1, &(0x7f001e40)=[{{0x0, 0x0, 
> &(0x7f000100)=[{&(0x7f0005c0)=""/88, 0x58}], 0x1}}], 0x1, 
> 0x4000, 0x0)
> sendto$inet(r1, &(0x7f00)="e2f7ad5b661c761edf", 0x9, 0x8080, 0x0, 
> 0x0)
> r2 = fcntl$dupfd(r1, 0x0, r0)
> connect$unix(r2, &(0x7f0001c0)=@file={0x0, './file0\x00'}, 0x6e)
>
>>
>> It does call tcp_disconnect(), by one of the connect() call.
> 
> yes, __inet_stream_connect will call tcp_disconnect when sa_family == 
> AF_UNSPEC, in c repro if it
> passes sa_family with AF_INET it won't call disconnect, and then sk_send_head 
> won't be NULL when tcp_connect.
> 


Look again at the Syzkaller reproducer()

It definitely uses tcp_disconnect()

Do not be fooled by connect$unix(), this is a connect() call really, with 
AF_UNSPEC


Re: Reminder: 99 open syzbot bugs in net subsystem

2019-07-24 Thread Eric Dumazet



On 7/24/19 11:09 PM, Eric Biggers wrote:
> On Wed, Jul 24, 2019 at 01:09:28PM -0700, David Miller wrote:
>> From: Eric Biggers 
>> Date: Wed, 24 Jul 2019 11:37:12 -0700
>>
>>> We can argue about what words to use to describe this situation, but
>>> it doesn't change the situation itself.
>>
>> And we should argue about those words because it matters to humans and
>> effects how they feel, and humans ultimately fix these bugs.
>>
>> So please stop with the hyperbole.
>>
>> Thank you.
> 
> Okay, there are 151 bugs that syzbot saw on the mainline Linux kernel in the
> last 7 days (90.1% with reproducers).  Of those, 59 were reported over 3 
> months
> ago (89.8% with reproducers).  Of those, 12 were reported over a year ago 
> (83.3%
> with reproducers).
> 
> No opinion on whether those are small/medium/large numbers, in case it would
> hurt someone's feelings.
> 
> These numbers do *not* include bugs that are still valid but weren't seen on
> mainline in last 7 days, e.g.:
> 
> - Bugs that are seen only rarely, so by chance weren't seen in last 7 days.
> - Bugs only in linux-next and/or subsystem branches.
> - Bugs that were seen in mainline more than 7 days ago, and then only on
>   linux-next or subsystem branch in last 7 days.
> - Bugs that stopped being seen due to a change in syzkaller.
> - Bugs that stopped being seen due to a change in kernel config.
> - Bugs that stopped being seen due to other environment changes such as kernel
>   command line parameters.
> - Bugs that stopped being seen due to a kernel change that hid the bug but
>   didn't actually fix it, i.e. still reachable in other ways.
> 

We do not doubt syzkaller is an incredible tool.

But netdev@ and lkml@ are mailing lists for humans to interact,
exchange ideas, send patches and review them.

To me, an issue that was reported to netdev by a real user is _way_ more 
important
than potential issues that a bot might have found doing crazy things.

We need to keep optimal S/N on mailing lists, so any bots trying to interact
with these lists must be very cautious and damn smart.

When I have time to spare and can work on syzbot reports, I am going to a web
page where I can see them and select the ones it makes sense to fix.
I hate having to set up email filters.



Re: Reminder: 99 open syzbot bugs in net subsystem

2019-07-24 Thread Eric Dumazet
On Wed, Jul 24, 2019 at 8:37 PM Eric Biggers  wrote:

> A huge number of valid open bugs are not being fixed, which is a fact.  We can
> argue about what words to use to describe this situation, but it doesn't 
> change
> the situation itself.
>
> What is your proposed solution?

syzbot sends emails, plenty  of them, with many wrong bisection
results, increasing the noise.

If nobody is interested, I am not sure sending copies of them
repeatedly will be of any help.

Maybe a simple monthly reminder with one URL to go to the list of bugs
would be less intrusive.


Re: [PATCH 4.4 stable net] net: tcp: Fix use-after-free in tcp_write_xmit

2019-07-24 Thread Eric Dumazet



On 7/24/19 12:46 PM, maowenan wrote:
> 
> 
> On 2019/7/24 17:45, Eric Dumazet wrote:
>>
>>
>> On 7/24/19 11:17 AM, Mao Wenan wrote:
>>> There is one report about tcp_write_xmit use-after-free with version 
>>> 4.4.136:
>>>
>>> BUG: KASAN: use-after-free in tcp_skb_pcount include/net/tcp.h:796 [inline]
>>> BUG: KASAN: use-after-free in tcp_init_tso_segs net/ipv4/tcp_output.c:1619 
>>> [inline]
>>> BUG: KASAN: use-after-free in tcp_write_xmit+0x3fc2/0x4cb0 
>>> net/ipv4/tcp_output.c:2056
>>> Read of size 2 at addr 8801d6fc87b0 by task syz-executor408/4195
>>>
>>> CPU: 0 PID: 4195 Comm: syz-executor408 Not tainted 4.4.136-gfb7e319 #59
>>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
>>> Google 01/01/2011
>>>   7d8f38ecc03be946 8801d73b7710 81e0edad
>>>  ea00075bf200 8801d6fc87b0  8801d6fc87b0
>>>  dc00 8801d73b7748 815159b6 8801d6fc87b0
>>> Call Trace:
>>>  [] __dump_stack lib/dump_stack.c:15 [inline]
>>>  [] dump_stack+0xc1/0x124 lib/dump_stack.c:51
>>>  [] print_address_description+0x6c/0x216 
>>> mm/kasan/report.c:252
>>>  [] kasan_report_error mm/kasan/report.c:351 [inline]
>>>  [] kasan_report.cold.7+0x175/0x2f7 mm/kasan/report.c:408
>>>  [] __asan_report_load2_noabort+0x14/0x20 
>>> mm/kasan/report.c:427
>>>  [] tcp_skb_pcount include/net/tcp.h:796 [inline]
>>>  [] tcp_init_tso_segs net/ipv4/tcp_output.c:1619 [inline]
>>>  [] tcp_write_xmit+0x3fc2/0x4cb0 
>>> net/ipv4/tcp_output.c:2056
>>>  [] __tcp_push_pending_frames+0xa0/0x290 
>>> net/ipv4/tcp_output.c:2307
>>>  [] tcp_send_fin+0x176/0xab0 net/ipv4/tcp_output.c:2883
>>>  [] tcp_close+0xca0/0xf70 net/ipv4/tcp.c:2112
>>>  [] inet_release+0xff/0x1d0 net/ipv4/af_inet.c:435
>>>  [] sock_release+0x96/0x1c0 net/socket.c:586
>>>  [] sock_close+0x16/0x20 net/socket.c:1037
>>>  [] __fput+0x235/0x6f0 fs/file_table.c:208
>>>  [] fput+0x15/0x20 fs/file_table.c:244
>>>  [] task_work_run+0x10f/0x190 kernel/task_work.c:115
>>>  [] exit_task_work include/linux/task_work.h:21 [inline]
>>>  [] do_exit+0x9e5/0x26b0 kernel/exit.c:759
>>>  [] do_group_exit+0x111/0x330 kernel/exit.c:889
>>>  [] get_signal+0x4ec/0x14b0 kernel/signal.c:2321
>>>  [] do_signal+0x8b/0x1d30 arch/x86/kernel/signal.c:712
>>>  [] exit_to_usermode_loop+0x11a/0x160 
>>> arch/x86/entry/common.c:248
>>>  [] prepare_exit_to_usermode arch/x86/entry/common.c:283 
>>> [inline]
>>>  [] syscall_return_slowpath+0x1b5/0x1f0 
>>> arch/x86/entry/common.c:348
>>>  [] int_ret_from_sys_call+0x25/0xa3
>>>
>>> Allocated by task 4194:
>>>  [] save_stack_trace+0x26/0x50 
>>> arch/x86/kernel/stacktrace.c:63
>>>  [] save_stack+0x43/0xd0 mm/kasan/kasan.c:512
>>>  [] set_track mm/kasan/kasan.c:524 [inline]
>>>  [] kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:616
>>>  [] kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:554
>>>  [] slab_post_alloc_hook mm/slub.c:1349 [inline]
>>>  [] slab_alloc_node mm/slub.c:2615 [inline]
>>>  [] slab_alloc mm/slub.c:2623 [inline]
>>>  [] kmem_cache_alloc+0xbe/0x2a0 mm/slub.c:2628
>>>  [] kmem_cache_alloc_node include/linux/slab.h:350 
>>> [inline]
>>>  [] __alloc_skb+0xe6/0x600 net/core/skbuff.c:218
>>>  [] alloc_skb_fclone include/linux/skbuff.h:856 [inline]
>>>  [] sk_stream_alloc_skb+0xa3/0x5d0 net/ipv4/tcp.c:833
>>>  [] tcp_sendmsg+0xd34/0x2b00 net/ipv4/tcp.c:1178
>>>  [] inet_sendmsg+0x203/0x4d0 net/ipv4/af_inet.c:755
>>>  [] sock_sendmsg_nosec net/socket.c:625 [inline]
>>>  [] sock_sendmsg+0xcc/0x110 net/socket.c:635
>>>  [] SYSC_sendto+0x21c/0x370 net/socket.c:1665
>>>  [] SyS_sendto+0x40/0x50 net/socket.c:1633
>>>  [] entry_SYSCALL_64_fastpath+0x22/0x9e
>>>
>>> Freed by task 4194:
>>>  [] save_stack_trace+0x26/0x50 
>>> arch/x86/kernel/stacktrace.c:63
>>>  [] save_stack+0x43/0xd0 mm/kasan/kasan.c:512
>>>  [] set_track mm/kasan/kasan.c:524 [inline]
>>>  [] kasan_slab_free+0x72/0xc0 mm/kasan/kasan.c:589
>>>  [] slab_free_hook mm/slub.c:1383 [inline]
>>>  [] slab_free_freelist_hook mm/slub.c:1405 [inline]
>>>  [] slab_free mm/slub.c:2859 [inline]
>>>  [] kmem_cache_free+0xbe/0x340 mm/slub.c:2881
>>>  [] kfree_skbmem+0xcf/0x100 net/core/skbuff.c:635
>>>  [] __

Re: [PATCH 4.4 stable net] net: tcp: Fix use-after-free in tcp_write_xmit

2019-07-24 Thread Eric Dumazet



On 7/24/19 12:36 PM, maowenan wrote:
> Actually, I have tested 4.4.184, UAF still happen.
> 
>

Thanks for testing.

Acked-by: Eric Dumazet 




Re: [PATCH 4.4 stable net] net: tcp: Fix use-after-free in tcp_write_xmit

2019-07-24 Thread Eric Dumazet



On 7/24/19 12:01 PM, Eric Dumazet wrote:
> 
> 
> On 7/24/19 11:17 AM, Mao Wenan wrote:
>> There is one report about tcp_write_xmit use-after-free with version 4.4.136:
> 
> Current stable 4.4 is 4.4.186
> 
> Can you check the bug is still there ?
> 

BTW, I tried the C repro and another bug showed up.

It looks like 4.4.186 misses other fixes :/

[  180.811610] skbuff: skb_under_panic: text:825ec6ea len:156 put:84 
head:8837dd1f0990 data:8837dd1f098c tail:0x98 end:0xc0 dev:ip6gre0
[  180.825037] [ cut here ]
[  180.829688] kernel BUG at net/core/skbuff.c:104!
[  180.834316] invalid opcode:  [#1] SMP KASAN
[  180.839305] gsmi: Log Shutdown Reason 0x03
[  180.843426] Modules linked in: ipip bonding bridge stp llc tun veth w1_therm 
wire i2c_mux_pca954x i2c_mux cdc_acm ehci_pci ehci_hcd ip_gre mlx4_en ib_uverbs 
mlx4_ib ib_sa ib_mad ib_core ib_addr mlx4_core
[  180.862052] CPU: 22 PID: 1619 Comm: kworker/22:1 Not tainted 4.4.186-smp-DEV 
#41
[  180.869475] Hardware name: Intel BIOS 2.56.0 10/19/2018
[  180.876463] Workqueue: ipv6_addrconf addrconf_dad_work
[  180.881658] task: 8837f1f59d80 ti: 8837eeeb8000 task.ti: 
8837eeeb8000
[  180.889171] RIP: 0010:[]  [] 
skb_panic+0x14f/0x210
[  180.897162] RSP: 0018:8837eeebf4b8  EFLAGS: 00010282
[  180.902504] RAX: 0088 RBX: 8837b600 RCX: 
[  180.909645] RDX:  RSI: 0246 RDI: 83508c00
[  180.916854] RBP: 8837eeebf520 R08: 0016 R09: 
[  180.924029] R10: 881fc8abf038 R11: 0007 R12: 881fc8abe720
[  180.931213] R13: 82aa9e80 R14: 00c0 R15: 0098
[  180.938390] FS:  () GS:8837ff28() 
knlGS:
[  180.946519] CS:  0010 DS:  ES:  CR0: 80050033
[  180.952290] CR2: 7f519426f530 CR3: 0037d37f2000 CR4: 00160670
[  180.959447] Stack:
[  180.961458]  8837dd1f098c 0098 00c0 
881fc8abe720
[  180.968909]  ea00df747c00 881fff404b40 8837ff2a1a20 
8837eeebf5b8
[  180.976371]  8837b600 825ec6ea 1106fddd7eb6 
8837b600
[  180.983848] Call Trace:
[  180.986297]  [] ? ip6gre_header+0xba/0xd50
[  180.991962]  [] skb_push+0xc1/0x100
[  180.997023]  [] ip6gre_header+0xba/0xd50
[  181.002519]  [] ? memcpy+0x36/0x40
[  181.007509]  [] ? ip6gre_changelink+0x6d0/0x6d0
[  181.013629]  [] ? ndisc_constructor+0x5b1/0x770
[  181.019728]  [] ? _raw_write_unlock_bh+0x41/0x50
[  181.025924]  [] ? __neigh_create+0xe6b/0x1670
[  181.031851]  [] neigh_connected_output+0x23f/0x480
[  181.038219]  [] ip6_finish_output2+0x74c/0x1a90
[  181.044324]  [] ? print_context_stack+0x73/0xf0
[  181.050429]  [] ? ip6_xmit+0x1700/0x1700
[  181.055933]  [] ? nf_hook_slow+0x118/0x1b0
[  181.061617]  [] ip6_finish_output+0x2ba/0x580
[  181.067546]  [] ip6_output+0x139/0x380
[  181.072884]  [] ? ip6_finish_output+0x580/0x580
[  181.079004]  [] ? ip6_fragment+0x31b0/0x31b0
[  181.084852]  [] ? dst_init+0x4b1/0x820
[  181.090172]  [] ? kasan_unpoison_shadow+0x35/0x50
[  181.096437]  [] ? kasan_unpoison_shadow+0x35/0x50
[  181.102712]  [] NF_HOOK_THRESH.constprop.22+0xca/0x180
[  181.109421]  [] ? ndisc_alloc_skb+0x340/0x340
[  181.115338]  [] ? compat_ipv6_setsockopt+0x180/0x180
[  181.121874]  [] ndisc_send_skb+0x742/0xd10
[  181.127550]  [] ? NF_HOOK_THRESH.constprop.22+0x180/0x180
[  181.134516]  [] ? skb_complete_tx_timestamp+0x280/0x280
[  181.141311]  [] ? ndisc_fill_addr_option+0x193/0x260
[  181.147844]  [] ndisc_send_rs+0x179/0x2d0
[  181.153426]  [] addrconf_dad_completed+0x41f/0x7c0
[  181.159795]  [] ? pick_next_entity+0x198/0x470
[  181.165807]  [] ? addrconf_rs_timer+0x4a0/0x4a0
[  181.171918]  [] ? find_next_bit+0x18/0x20
[  181.177504]  [] ? prandom_seed+0xd9/0x160
[  181.183095]  [] addrconf_dad_work+0x375/0x9e0
[  181.189024]  [] ? addrconf_dad_completed+0x7c0/0x7c0
[  181.195576]  [] process_one_work+0x52f/0xf60
[  181.201468]  [] worker_thread+0xdd/0xe80
[  181.206977]  [] ? __schedule+0x73a/0x16d0
[  181.212550]  [] ? process_one_work+0xf60/0xf60
[  181.218572]  [] kthread+0x205/0x2b0
[  181.223633]  [] ? kthread_worker_fn+0x4e0/0x4e0
[  181.229743]  [] ? kthread_worker_fn+0x4e0/0x4e0
[  181.235834]  [] ret_from_fork+0x3f/0x70
[  181.241232]  [] ? kthread_worker_fn+0x4e0/0x4e0



Re: [PATCH 4.4 stable net] net: tcp: Fix use-after-free in tcp_write_xmit

2019-07-24 Thread Eric Dumazet



On 7/24/19 11:17 AM, Mao Wenan wrote:
> There is one report about tcp_write_xmit use-after-free with version 4.4.136:

Current stable 4.4 is 4.4.186

Can you check the bug is still there ?

List of patches between 4.4.136 and 4.4.186 (this list is not exhaustive)

46c7b5d6f2a51c355b29118814fbfbdb79c35656 tcp: refine memory limit test in 
tcp_fragment()
f938ae0ce5ef7b693125b918509b941281afc957 tcp: enforce tcp_min_snd_mss in 
tcp_mtu_probing()
e757d052f3b8ce739d068a1e890643376c16b7a9 tcp: add tcp_min_snd_mss sysctl
ad472d3a9483abc155e1644ad740cd8c039b5170 tcp: tcp_fragment() should apply sane 
memory limits
4657ee0fe05e15ab572b157f13a82e080d4b7d73 tcp: limit payload size of sacked skbs
b6d37bba0f7a7492427d7518d4be485dcdd9d5d1 tcp: tcp_grow_window() needs to 
respect tcp_space()
68337354043a3d5207cd4f055e5a8342ec4eec0f tcp: Ensure DCTCP reacts to losses
7ed7c0386ef2a5cbe58e15af5014c9302d3593eb tcp/dccp: drop SYN packets if accept 
queue is full
191aa19ab8c1459c11a5c541801f17e01dda17de tcp: handle inet_csk_reqsk_queue_add() 
failures
a466589807a13da2b7fbb2776e01634b38a4233b tcp: clear icsk_backoff in 
tcp_write_queue_purge()
122e4a30779336643614fe0f81e1f3fcbd0a371c tcp: tcp_v4_err() should be more 
careful
ed7748bcf290ad8f80020217d832f458ac9bae7f tcp: fix NULL ref in tail loss probe
eee1af4e268e10fecb76bce42a8d7343aeb2a5e6 tcp: add tcp_ooo_try_coalesce() helper
be288481479ca8c1beba02a7e779ffeaa11f1597 tcp: call tcp_drop() from 
tcp_data_queue_ofo()
352b66932a23fb11f0a7c316361220648bca3c79 tcp: free batches of packets in 
tcp_prune_ofo_queue()
e747775172a2d4dc4dae794f248f9687ba793f54 tcp: fix a stale ooo_last_skb after a 
replace
4666b6e2b27d91e05a5b8459e40e4a05dbc1c7b0 tcp: use an RB tree for ooo receive 
queue
ec7055c62714326c56dabcf7757069ac7f276bda tcp: increment sk_drops for dropped rx 
packets
86a0a00794c21b35c72d767a98fb917b5b76b513 tcp: do not restart timewait timer on 
rst reception
81970da69122fe4bf2af5bb1bb4c7f62d4744e79 tcp: identify cryptic messages as TCP 
seq # bugs
43707aa8c55fb165a1a56f590e0defb198ebdde9 tcp: remove DELAYED ACK events in DCTCP
42962538cd9fe281a6e8602f22c7b1e218ed812a tcp: Fix missing range_truesize 
enlargement in the backport
27a0762cb570834dc44155363c118cabdd024c3c tcp: add one more quick ack after 
after ECN events
cd760ab9f4e13aedccc80f19a0b7863d5c0b3c8c tcp: refactor tcp_ecn_check_ce to 
remove sk type cast
96b792d199d17545d6a53faf44b9c91d038f1ab3 tcp: do not aggressively quick ack 
after ECN events
2b30c04bc6f9e7be2d9a5e1b504faa904154c7da tcp: add max_quickacks param to 
tcp_incr_quickack and tcp_enter_quickack_mode
e2f337e2bd4efe32051a496a7fcdd94ea67c0cfa tcp: do not force quickack when 
receiving out-of-order packets
dc6ae4dffd656811dee7151b19545e4cd839d378 tcp: detect malicious patterns in 
tcp_collapse_ofo_queue()
5fbec4801264cb3279ef6ac9c70bcbe2aaef89d5 tcp: avoid collapses in 
tcp_prune_queue() if possible
255924ea891f647451af3acbc40a3730dcb3255e tcp: do not delay ACK in DCTCP upon CE 
status change
0b1d40e9e7738e3396ce414b1c62b911c285dfa3 tcp: do not cancel delay-AcK on DCTCP 
special ACK
17fea38e74ab24afb06970bbd9dc52db11a8034b tcp: helpers to send special DCTCP ack
500e03f463835e74c75890d56d9a7ab63755aa2d tcp: fix dctcp delayed ACK schedule
61c66cc52d42f78bbdd8f2e40b7c0bb9b936a12d tcp: prevent bogus FRTO undos with 
non-SACK flows
48ffbdea28808354b89447fac2d8524c29ce7ab4 tcp: verify the checksum of the first 
data segment in a new connection
4dff97920e13af3e92180eefa6b7712d4eac5e58 tcp: do not overshoot window_clamp in 
tcp_rcv_space_adjust()



Re: [PATCH 4.4 stable net] net: tcp: Fix use-after-free in tcp_write_xmit

2019-07-24 Thread Eric Dumazet



On 7/24/19 11:17 AM, Mao Wenan wrote:
> There is one report about tcp_write_xmit use-after-free with version 4.4.136:
> 
> BUG: KASAN: use-after-free in tcp_skb_pcount include/net/tcp.h:796 [inline]
> BUG: KASAN: use-after-free in tcp_init_tso_segs net/ipv4/tcp_output.c:1619 
> [inline]
> BUG: KASAN: use-after-free in tcp_write_xmit+0x3fc2/0x4cb0 
> net/ipv4/tcp_output.c:2056
> Read of size 2 at addr 8801d6fc87b0 by task syz-executor408/4195
> 
> CPU: 0 PID: 4195 Comm: syz-executor408 Not tainted 4.4.136-gfb7e319 #59
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
>   7d8f38ecc03be946 8801d73b7710 81e0edad
>  ea00075bf200 8801d6fc87b0  8801d6fc87b0
>  dc00 8801d73b7748 815159b6 8801d6fc87b0
> Call Trace:
>  [] __dump_stack lib/dump_stack.c:15 [inline]
>  [] dump_stack+0xc1/0x124 lib/dump_stack.c:51
>  [] print_address_description+0x6c/0x216 
> mm/kasan/report.c:252
>  [] kasan_report_error mm/kasan/report.c:351 [inline]
>  [] kasan_report.cold.7+0x175/0x2f7 mm/kasan/report.c:408
>  [] __asan_report_load2_noabort+0x14/0x20 
> mm/kasan/report.c:427
>  [] tcp_skb_pcount include/net/tcp.h:796 [inline]
>  [] tcp_init_tso_segs net/ipv4/tcp_output.c:1619 [inline]
>  [] tcp_write_xmit+0x3fc2/0x4cb0 net/ipv4/tcp_output.c:2056
>  [] __tcp_push_pending_frames+0xa0/0x290 
> net/ipv4/tcp_output.c:2307
>  [] tcp_send_fin+0x176/0xab0 net/ipv4/tcp_output.c:2883
>  [] tcp_close+0xca0/0xf70 net/ipv4/tcp.c:2112
>  [] inet_release+0xff/0x1d0 net/ipv4/af_inet.c:435
>  [] sock_release+0x96/0x1c0 net/socket.c:586
>  [] sock_close+0x16/0x20 net/socket.c:1037
>  [] __fput+0x235/0x6f0 fs/file_table.c:208
>  [] fput+0x15/0x20 fs/file_table.c:244
>  [] task_work_run+0x10f/0x190 kernel/task_work.c:115
>  [] exit_task_work include/linux/task_work.h:21 [inline]
>  [] do_exit+0x9e5/0x26b0 kernel/exit.c:759
>  [] do_group_exit+0x111/0x330 kernel/exit.c:889
>  [] get_signal+0x4ec/0x14b0 kernel/signal.c:2321
>  [] do_signal+0x8b/0x1d30 arch/x86/kernel/signal.c:712
>  [] exit_to_usermode_loop+0x11a/0x160 
> arch/x86/entry/common.c:248
>  [] prepare_exit_to_usermode arch/x86/entry/common.c:283 
> [inline]
>  [] syscall_return_slowpath+0x1b5/0x1f0 
> arch/x86/entry/common.c:348
>  [] int_ret_from_sys_call+0x25/0xa3
> 
> Allocated by task 4194:
>  [] save_stack_trace+0x26/0x50 
> arch/x86/kernel/stacktrace.c:63
>  [] save_stack+0x43/0xd0 mm/kasan/kasan.c:512
>  [] set_track mm/kasan/kasan.c:524 [inline]
>  [] kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:616
>  [] kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:554
>  [] slab_post_alloc_hook mm/slub.c:1349 [inline]
>  [] slab_alloc_node mm/slub.c:2615 [inline]
>  [] slab_alloc mm/slub.c:2623 [inline]
>  [] kmem_cache_alloc+0xbe/0x2a0 mm/slub.c:2628
>  [] kmem_cache_alloc_node include/linux/slab.h:350 [inline]
>  [] __alloc_skb+0xe6/0x600 net/core/skbuff.c:218
>  [] alloc_skb_fclone include/linux/skbuff.h:856 [inline]
>  [] sk_stream_alloc_skb+0xa3/0x5d0 net/ipv4/tcp.c:833
>  [] tcp_sendmsg+0xd34/0x2b00 net/ipv4/tcp.c:1178
>  [] inet_sendmsg+0x203/0x4d0 net/ipv4/af_inet.c:755
>  [] sock_sendmsg_nosec net/socket.c:625 [inline]
>  [] sock_sendmsg+0xcc/0x110 net/socket.c:635
>  [] SYSC_sendto+0x21c/0x370 net/socket.c:1665
>  [] SyS_sendto+0x40/0x50 net/socket.c:1633
>  [] entry_SYSCALL_64_fastpath+0x22/0x9e
> 
> Freed by task 4194:
>  [] save_stack_trace+0x26/0x50 
> arch/x86/kernel/stacktrace.c:63
>  [] save_stack+0x43/0xd0 mm/kasan/kasan.c:512
>  [] set_track mm/kasan/kasan.c:524 [inline]
>  [] kasan_slab_free+0x72/0xc0 mm/kasan/kasan.c:589
>  [] slab_free_hook mm/slub.c:1383 [inline]
>  [] slab_free_freelist_hook mm/slub.c:1405 [inline]
>  [] slab_free mm/slub.c:2859 [inline]
>  [] kmem_cache_free+0xbe/0x340 mm/slub.c:2881
>  [] kfree_skbmem+0xcf/0x100 net/core/skbuff.c:635
>  [] __kfree_skb+0x1d/0x20 net/core/skbuff.c:676
>  [] sk_wmem_free_skb include/net/sock.h:1447 [inline]
>  [] tcp_write_queue_purge include/net/tcp.h:1460 [inline]
>  [] tcp_connect_init net/ipv4/tcp_output.c:3122 [inline]
>  [] tcp_connect+0xb24/0x30c0 net/ipv4/tcp_output.c:3261
>  [] tcp_v4_connect+0xf31/0x1890 net/ipv4/tcp_ipv4.c:246
>  [] __inet_stream_connect+0x2a9/0xc30 net/ipv4/af_inet.c:615
>  [] inet_stream_connect+0x55/0xa0 net/ipv4/af_inet.c:676
>  [] SYSC_connect+0x1b8/0x300 net/socket.c:1557
>  [] SyS_connect+0x24/0x30 net/socket.c:1538
>  [] entry_SYSCALL_64_fastpath+0x22/0x9e
> 
> Syzkaller reproducer():
> r0 = socket$packet(0x11, 0x3, 0x300)
> r1 = socket$inet_tcp(0x2, 0x1, 0x0)
> bind$inet(r1, &(0x7f000300)={0x2, 0x4e21, @multicast1}, 0x10)
> connect$inet(r1, &(0x7f000140)={0x2, 0x104e21, @loopback}, 0x10)
> recvmmsg(r1, &(0x7f001e40)=[{{0x0, 0x0, 
> &(0x7f000100)=[{&(0x7f0005c0)=""/88, 0x58}], 0x1}}], 0x1, 0x4000, 
> 0x0)
> sendto$inet(r1, &(0x7f00)="e2f7ad5b661c761edf", 0x9, 0x8080, 0x0, 0x0)
> r2 = fcntl$dupfd(r1, 0x0, r0)
> 

Re: [RFC] performance regression with commit-id ("net: get rid of an signed integer overflow in ip_idents_reserve()")

2019-07-24 Thread Eric Dumazet



On 7/24/19 10:38 AM, Zhangshaokun wrote:
> Hi,
> 
> I've observed an significant performance regression with the following 
> commit-id 
> ("net: get rid of an signed integer overflow in ip_idents_reserve()").

Yes this UBSAN false positive has been painful



> 
> Here are my test scenes:
> Server
> Cmd: iperf3 -s xxx.xxx..xxx -p 1 -i 0 -A 0
> Kenel: 4.19.34
> Server number: 32
> Port: 1 – 10032
> CPU affinity: 0 – 32
> CPU architecture: aarch64
> NUMA node0 CPU(s): 0-23
> NUMA node1 CPU(s): 24-47
> 
> Client
> Cmd: iperf3 -u -c xxx.xxx..xxx -p 1 -l 16 -b 0 -t 0 -i 0 -A 8
> Kenel: 4.19.34
> Client number: 32
> Port: 1 – 10032
> CPU affinity: 0 – 32
> CPU architecture: aarch64
> NUMA node0 CPU(s): 0-23
> NUMA node1 CPU(s): 24-47
> 
> Firstly, With patch  ("net: get rid of an signed integer 
> overflow in ip_idents_reserve()") ,
> client’s cpu is 100%, and function ip_idents_reserve() cpu usage is very 
> high, but the result is not good.
> 03:08:32 AM IFACE   rxpck/s   txpck/srxkB/stxkB/s   rxcmp/s   
> txcmp/s  rxmcst/s   %ifutil
> 03:08:33 AM  eth0  0.00  0.00  0.00  0.00  0.00  
> 0.00  0.00  0.00
> 03:08:33 AM  eth1  0.00 3461296.00  0.00 196049.97  0.00  
> 0.00  0.00  0.00
> 03:08:33 AMlo  0.00  0.00  0.00  0.00  0.00  
> 0.00  0.00  0.00
> 
> Secondly, revert that patch, use atomic_add_return() instead, the result is 
> better, as below:
> 03:23:24 AM IFACE   rxpck/s   txpck/srxkB/stxkB/s   rxcmp/s   
> txcmp/s  rxmcst/s   %ifutil
> 03:23:25 AMlo  0.00  0.00  0.00  0.00  0.00  
> 0.00  0.00  0.00
> 03:23:25 AM  eth1  0.00 12834590.00  0.00 726959.20  0.00 
>  0.00  0.00  0.00
> 03:23:25 AM  eth0  7.00 11.00  0.40  2.95  0.00  
> 0.00  0.00  0.00
> 
> Thirdly, atomic is not used in ip_idents_reserve() completely ,while each cpu 
> core allocates its own ID segment,
> Such as: cpu core0 allocate ID 0 – 1023, cpu core1 allocate 1024 – 2047, …,etc
> the result is the best:

Not sure what you mean.

Less entropy in IPv4 ID is not going to help when fragments _are_ needed.

Send 40,000 datagrams of 2000 bytes each, add delays, reorders, and boom, most 
of the packets will be lost.

This is not because your use case does not need proper IP ID that we can mess 
with them.

If you need to send packets very fast,  maybe use AF_PACKET ?

> 03:27:06 AM IFACE   rxpck/s   txpck/srxkB/stxkB/s   rxcmp/s   
> txcmp/s  rxmcst/s   %ifutil
> 03:27:07 AMlo  0.00  0.00  0.00  0.00  0.00  
> 0.00  0.00  0.00
> 03:27:07 AM  eth1  0.00 14275505.00  0.00 808573.53  0.00 
>  0.00  0.00  0.00
> 03:27:07 AM  eth0  0.00  2.00  0.00  0.18  0.00  
> 0.00  0.00  0.00
> 
> Because atomic operation performance is bottleneck when cpu core number 
> increase, Can we revert the patch or
> use ID segment for each cpu core instead?


This has been discussed in the past.

https://lore.kernel.org/lkml/b0160f4b-b996-b0ee-405a-3d5f18662...@gmail.com/

We can revert now UBSAN has been fixed.

Or even use Peter patch : 
https://lore.kernel.org/lkml/20181101172739.ga3...@hirez.programming.kicks-ass.net/

However, you will still hit badly a shared cache line, not matter what.

Some arches are known to have terrible LL/SC implementation :/



Re: Reminder: 99 open syzbot bugs in net subsystem

2019-07-24 Thread Eric Dumazet



On 7/24/19 3:38 AM, Eric Biggers wrote:
> [This email was generated by a script.  Let me know if you have any 
> suggestions
> to make it better, or if you want it re-generated with the latest status.]
> 
> Of the currently open syzbot reports against the upstream kernel, I've 
> manually
> marked 99 of them as possibly being bugs in the net subsystem.  This category
> only includes the networking bugs that I couldn't assign to a more specific
> component (bpf, xfrm, bluetooth, tls, tipc, sctp, wireless, etc.).  I've 
> listed
> these reports below, sorted by an algorithm that tries to list first the 
> reports
> most likely to be still valid, important, and actionable.
> 
> Of these 99 bugs, 17 were seen in mainline in the last week.
> 
> Of these 99 bugs, 4 were bisected to commits from the following people:
> 
>       Florian Westphal 
>   Ilya Maximets 
>   Eric Dumazet 
>   David Ahern 
> 
> If you believe a bug is no longer valid, please close the syzbot report by
> sending a '#syz fix', '#syz dup', or '#syz invalid' command in reply to the
> original thread, as explained at https://goo.gl/tpsmEJ#status
> 
> If you believe I misattributed a bug to the net subsystem, please let me know,
> and if possible forward the report to the correct people or mailing list.
>

Some of the bugs have been fixed already, before syzbot found them.

Why force human to be gentle to bots and actually replying to them ?

I usually simply wait that syzbot is finding the bug does not repro anymore,
but now if you send these emails, we will have even more pressure on us.




<    1   2   3   4   5   6   7   8   9   10   >