Re: [PATCH v4 2/2] Remove false-positive VLAs when using max()

2018-03-16 Thread Nikolay Borisov


On 15.03.2018 21:47, Kees Cook wrote:
> As part of removing VLAs from the kernel[1], we want to build with -Wvla,
> but it is overly pessimistic and only accepts constant expressions for
> stack array sizes, instead of also constant values. The max() macro
> triggers the warning, so this refactors these uses of max() to use the
> new const_max() instead.
> 
> [1] https://lkml.org/lkml/2018/3/7/621

For the btrfs portion :

Reviewed-by: Nikolay Borisov <nbori...@suse.com>

> 
> Signed-off-by: Kees Cook <keesc...@chromium.org>
> ---
>  drivers/input/touchscreen/cyttsp4_core.c |  2 +-
>  fs/btrfs/tree-checker.c  |  3 ++-
>  lib/vsprintf.c   |  4 ++--
>  net/ipv4/proc.c  |  8 
>  net/ipv6/proc.c  | 10 --
>  5 files changed, 13 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/input/touchscreen/cyttsp4_core.c 
> b/drivers/input/touchscreen/cyttsp4_core.c
> index 727c3232517c..f89497940051 100644
> --- a/drivers/input/touchscreen/cyttsp4_core.c
> +++ b/drivers/input/touchscreen/cyttsp4_core.c
> @@ -868,7 +868,7 @@ static void cyttsp4_get_mt_touches(struct cyttsp4_mt_data 
> *md, int num_cur_tch)
>   struct cyttsp4_touch tch;
>   int sig;
>   int i, j, t = 0;
> - int ids[max(CY_TMA1036_MAX_TCH, CY_TMA4XX_MAX_TCH)];
> + int ids[const_max(CY_TMA1036_MAX_TCH, CY_TMA4XX_MAX_TCH)];
>  
>   memset(ids, 0, si->si_ofs.tch_abs[CY_TCH_T].max * sizeof(int));
>   for (i = 0; i < num_cur_tch; i++) {
> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> index c3c8d48f6618..1ddd6cc3c4fc 100644
> --- a/fs/btrfs/tree-checker.c
> +++ b/fs/btrfs/tree-checker.c
> @@ -341,7 +341,8 @@ static int check_dir_item(struct btrfs_root *root,
>*/
>   if (key->type == BTRFS_DIR_ITEM_KEY ||
>   key->type == BTRFS_XATTR_ITEM_KEY) {
> - char namebuf[max(BTRFS_NAME_LEN, XATTR_NAME_MAX)];
> + char namebuf[const_max(BTRFS_NAME_LEN,
> +XATTR_NAME_MAX)];
>  
>   read_extent_buffer(leaf, namebuf,
>   (unsigned long)(di + 1), name_len);
> diff --git a/lib/vsprintf.c b/lib/vsprintf.c
> index d7a708f82559..9d5610b643ce 100644
> --- a/lib/vsprintf.c
> +++ b/lib/vsprintf.c
> @@ -744,8 +744,8 @@ char *resource_string(char *buf, char *end, struct 
> resource *res,
>  #define FLAG_BUF_SIZE(2 * sizeof(res->flags))
>  #define DECODED_BUF_SIZE sizeof("[mem - 64bit pref window disabled]")
>  #define RAW_BUF_SIZE sizeof("[mem - flags 0x]")
> - char sym[max(2*RSRC_BUF_SIZE + DECODED_BUF_SIZE,
> -  2*RSRC_BUF_SIZE + FLAG_BUF_SIZE + RAW_BUF_SIZE)];
> + char sym[const_max(2*RSRC_BUF_SIZE + DECODED_BUF_SIZE,
> +2*RSRC_BUF_SIZE + FLAG_BUF_SIZE + RAW_BUF_SIZE)];
>  
>   char *p = sym, *pend = sym + sizeof(sym);
>   int decode = (fmt[0] == 'R') ? 1 : 0;
> diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
> index dc5edc8f7564..fad6f989004e 100644
> --- a/net/ipv4/proc.c
> +++ b/net/ipv4/proc.c
> @@ -46,7 +46,7 @@
>  #include 
>  #include 
>  
> -#define TCPUDP_MIB_MAX max_t(u32, UDP_MIB_MAX, TCP_MIB_MAX)
> +#define TCPUDP_MIB_MAX const_max(UDP_MIB_MAX, TCP_MIB_MAX)
>  
>  /*
>   *   Report socket allocation statistics [m...@utu.fi]
> @@ -404,7 +404,7 @@ static int snmp_seq_show_tcp_udp(struct seq_file *seq, 
> void *v)
>   struct net *net = seq->private;
>   int i;
>  
> - memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
> + memset(buff, 0, sizeof(buff));
>  
>   seq_puts(seq, "\nTcp:");
>   for (i = 0; snmp4_tcp_list[i].name; i++)
> @@ -421,7 +421,7 @@ static int snmp_seq_show_tcp_udp(struct seq_file *seq, 
> void *v)
>   seq_printf(seq, " %lu", buff[i]);
>   }
>  
> - memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
> + memset(buff, 0, sizeof(buff));
>  
>   snmp_get_cpu_field_batch(buff, snmp4_udp_list,
>net->mib.udp_statistics);
> @@ -432,7 +432,7 @@ static int snmp_seq_show_tcp_udp(struct seq_file *seq, 
> void *v)
>   for (i = 0; snmp4_udp_list[i].name; i++)
>   seq_printf(seq, " %lu", buff[i]);
>  
> - memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
> + memset(buff, 0, sizeof(buff));
>  
>   /* the UDP and UDP-Lite MIBs are the same */
>   seq_puts(seq, "\nUdpLite:");
> diff --git a/net/ipv6/proc.c 

Re: [PATCH v2] lockdep: Fix fs_reclaim warning.

2018-02-12 Thread Nikolay Borisov


On  8.02.2018 13:43, Tetsuo Handa wrote:
>>From 361d37a7d36978020dfb4c11ec1f4800937ccb68 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa 
> Date: Thu, 8 Feb 2018 10:35:35 +0900
> Subject: [PATCH v2] lockdep: Fix fs_reclaim warning.
> 
> Dave Jones reported fs_reclaim lockdep warnings.
> 
>   
>   WARNING: possible recursive locking detected
>   4.15.0-rc9-backup-debug+ #1 Not tainted
>   
>   sshd/24800 is trying to acquire lock:
>(fs_reclaim){+.+.}, at: [<84f438c2>] 
> fs_reclaim_acquire.part.102+0x5/0x30
> 
>   but task is already holding lock:
>(fs_reclaim){+.+.}, at: [<84f438c2>] 
> fs_reclaim_acquire.part.102+0x5/0x30
> 
>   other info that might help us debug this:
>Possible unsafe locking scenario:
> 
>  CPU0
>  
> lock(fs_reclaim);
> lock(fs_reclaim);
> 
>*** DEADLOCK ***
> 
>May be due to missing lock nesting notation
> 
>   2 locks held by sshd/24800:
>#0:  (sk_lock-AF_INET6){+.+.}, at: [<1a069652>] 
> tcp_sendmsg+0x19/0x40
>#1:  (fs_reclaim){+.+.}, at: [<84f438c2>] 
> fs_reclaim_acquire.part.102+0x5/0x30
> 
>   stack backtrace:
>   CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
>   Call Trace:
>dump_stack+0xbc/0x13f
>__lock_acquire+0xa09/0x2040
>lock_acquire+0x12e/0x350
>fs_reclaim_acquire.part.102+0x29/0x30
>kmem_cache_alloc+0x3d/0x2c0
>alloc_extent_state+0xa7/0x410
>__clear_extent_bit+0x3ea/0x570
>try_release_extent_mapping+0x21a/0x260
>__btrfs_releasepage+0xb0/0x1c0
>btrfs_releasepage+0x161/0x170
>try_to_release_page+0x162/0x1c0
>shrink_page_list+0x1d5a/0x2fb0
>shrink_inactive_list+0x451/0x940
>shrink_node_memcg.constprop.88+0x4c9/0x5e0
>shrink_node+0x12d/0x260
>try_to_free_pages+0x418/0xaf0
>__alloc_pages_slowpath+0x976/0x1790
>__alloc_pages_nodemask+0x52c/0x5c0
>new_slab+0x374/0x3f0
>___slab_alloc.constprop.81+0x47e/0x5a0
>__slab_alloc.constprop.80+0x32/0x60
>__kmalloc_track_caller+0x267/0x310
>__kmalloc_reserve.isra.40+0x29/0x80
>__alloc_skb+0xee/0x390
>sk_stream_alloc_skb+0xb8/0x340
>tcp_sendmsg_locked+0x8e6/0x1d30
>tcp_sendmsg+0x27/0x40
>inet_sendmsg+0xd0/0x310
>sock_write_iter+0x17a/0x240
>__vfs_write+0x2ab/0x380
>vfs_write+0xfb/0x260
>SyS_write+0xb6/0x140
>do_syscall_64+0x1e5/0xc05
>entry_SYSCALL64_slow_path+0x25/0x25
> 

I think I've hit another incarnation of that one. The call stack is:
http://paste.opensuse.org/3f22d013

The cleaned up callstack of all the ? entries look like:

__lock_acquire+0x2d8a/0x4b70
lock_acquire+0x110/0x330
kmem_cache_alloc+0x29/0x2c0
__clear_extent_bit+0x488/0x800
try_release_extent_mapping+0x288/0x3c0
__btrfs_releasepage+0x6c/0x140
shrink_page_list+0x227e/0x3110
shrink_inactive_list+0x414/0xdb0
shrink_node_memcg+0x7c8/0x1250
shrink_node+0x2ae/0xb50
do_try_to_free_pages+0x2b1/0xe20
try_to_free_pages+0x205/0x570
 __alloc_pages_nodemask+0xb91/0x2160
new_slab+0x27a/0x4e0
___slab_alloc+0x355/0x610
 __slab_alloc+0x4c/0xa0
kmem_cache_alloc+0x22d/0x2c0
mempool_alloc+0xe1/0x280
bio_alloc_bioset+0x1d7/0x830
ext4_mpage_readpages+0x99f/0x1000 <-
__do_page_cache_readahead+0x4be/0x840
filemap_fault+0x8c8/0xfc0
ext4_filemap_fault+0x7d/0xb0
__do_fault+0x7a/0x150
__handle_mm_fault+0x1542/0x29d0
__do_page_fault+0x557/0xa30
async_page_fault+0x4c/0x60


There is no fs stacking going on here and that is 4.15-rc9.


> This warning is caused by commit d92a8cfcb37ecd13 ("locking/lockdep: Rework
> FS_RECLAIM annotation") which replaced lockdep_set_current_reclaim_state()/
> lockdep_clear_current_reclaim_state() in __perform_reclaim() and
> lockdep_trace_alloc() in slab_pre_alloc_hook() with fs_reclaim_acquire()/
> fs_reclaim_release(). Since __kmalloc_reserve() from __alloc_skb() adds
> __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, and all reclaim path simply
> propagates __GFP_NOMEMALLOC, fs_reclaim_acquire() in slab_pre_alloc_hook()
> is trying to grab the 'fake' lock again when __perform_reclaim() already
> grabbed the 'fake' lock.
> 
> The
> 
>   /* this guy won't enter reclaim */
>   if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
>   return false;
> 
> test which causes slab_pre_alloc_hook() to try to grab the 'fake' lock
> was added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context
> (__GFP_NOFS)"). But that test is outdated because PF_MEMALLOC thread won't
> enter reclaim regardless of __GFP_NOMEMALLOC after commit 341ce06f69abfafa
> ("page allocator: calculate the alloc_flags for allocation only once")
> added the PF_MEMALLOC safeguard (
> 
>   /* Avoid recursion of direct reclaim */
>   if (p->flags & PF_MEMALLOC)
>   goto nopage;
> 
> in __alloc_pages_slowpath()).
> 
> Thus, let's fix outdated test by removing __GFP_NOMEMALLOC test and allow
> 

[PATCHv2] igmp: Fix regression caused by igmp sysctl namespace code.

2017-08-09 Thread Nikolay Borisov
Commit dcd87999d415 ("igmp: net: Move igmp namespace init to correct file")
moved the igmp sysctls initialization from tcp_sk_init to igmp_net_init. This
function is only called as part of per-namespace initialization, only if
CONFIG_IP_MULTICAST is defined, otherwise igmp_mc_init() call in ip_init is
compiled out, casuing the igmp pernet ops to not be registerd and those sysctl
being left initialized with 0. However, there are certain functions, such as
ip_mc_join_group which are always compiled and make use of some of those
sysctls. Let's do a partial revert of the aforementioned commit and move the
sysctl initialization into inet_init_net, that way they will always have
sane values.

Fixes: dcd87999d415 ("igmp: net: Move igmp namespace init to correct file")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=196595
Reported-by: Gerardo Exequiel Pozzi <vmlinuz...@gmail.com>
Cc: <sta...@vger.kernel.org> # 4.6
Signed-off-by: Nikolay Borisov <nbori...@suse.com>
---

Cahnges since v1: 
 * Moved the sysctl initialization to inet_init_net based on Eric Dumazet's 
 suggestion


 net/ipv4/af_inet.c | 7 +++
 net/ipv4/igmp.c| 6 --
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 76c2077c3f5b..2e548eca3489 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1731,6 +1731,13 @@ static __net_init int inet_init_net(struct net *net)
net->ipv4.sysctl_ip_prot_sock = PROT_SOCK;
 #endif
 
+   /* Some igmp sysctl, whose values are always used */
+   net->ipv4.sysctl_igmp_max_memberships = 20;
+   net->ipv4.sysctl_igmp_max_msf = 10;
+   /* IGMP reports for link-local multicast groups are enabled by default 
*/
+   net->ipv4.sysctl_igmp_llm_reports = 1;
+   net->ipv4.sysctl_igmp_qrv = 2;
+
return 0;
 }
 
diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index 28f14afd0dd3..498706b072fb 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -2974,12 +2974,6 @@ static int __net_init igmp_net_init(struct net *net)
goto out_sock;
}
 
-   /* Sysctl initialization */
-   net->ipv4.sysctl_igmp_max_memberships = 20;
-   net->ipv4.sysctl_igmp_max_msf = 10;
-   /* IGMP reports for link-local multicast groups are enabled by default 
*/
-   net->ipv4.sysctl_igmp_llm_reports = 1;
-   net->ipv4.sysctl_igmp_qrv = 2;
return 0;
 
 out_sock:
-- 
2.7.4



[PATCH] igmp: Fix regression caused by igmp sysctl namespace code.

2017-08-08 Thread Nikolay Borisov
Commit dcd87999d415 ("igmp: net: Move igmp namespace init to correct file")
moved the igmp sysctls initialization from tcp_sk_init to igmp_net_init. This
function is only called as part of per-namespace initialization, only if
CONFIG_IP_MULTICAST is defined, otherwise igmp_mc_init() call in ip_init is
compiled out, casuing the igmp pernet ops to not be registerd and those sysctl
being left initialized with 0. However, there are certain functions, such as
ip_mc_join_group which are always compiled and make use of some of those
sysctls. Let's do a partial revert of the aforementioned commit and move the
sysctl initialization back into tcp_sk_init, that way they will always have
sane values.

Fixes: dcd87999d415 ("igmp: net: Move igmp namespace init to correct file")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=196595
Reported-by: Gerardo Exequiel Pozzi <vmlinuz...@gmail.com>
Tested-by: Gerardo Exequiel Pozzi <vmlinuz...@gmail.com>
Signed-off-by: Nikolay Borisov <nbori...@suse.com>
Cc: <sta...@vger.kernel.org> # 4.6
---
 net/ipv4/igmp.c | 6 --
 net/ipv4/tcp_ipv4.c | 6 ++
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index 28f14afd0dd3..498706b072fb 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -2974,12 +2974,6 @@ static int __net_init igmp_net_init(struct net *net)
goto out_sock;
}
 
-   /* Sysctl initialization */
-   net->ipv4.sysctl_igmp_max_memberships = 20;
-   net->ipv4.sysctl_igmp_max_msf = 10;
-   /* IGMP reports for link-local multicast groups are enabled by default 
*/
-   net->ipv4.sysctl_igmp_llm_reports = 1;
-   net->ipv4.sysctl_igmp_qrv = 2;
return 0;
 
 out_sock:
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a20e7f03d5f7..64ba2c93d396 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2528,6 +2528,12 @@ static int __net_init tcp_sk_init(struct net *net)
net->ipv4.sysctl_tcp_window_scaling = 1;
net->ipv4.sysctl_tcp_timestamps = 1;
 
+   net->ipv4.sysctl_igmp_max_memberships = 20;
+   net->ipv4.sysctl_igmp_max_msf = 10;
+   /* IGMP reports for link-local multicast groups are enabled by default 
*/
+   net->ipv4.sysctl_igmp_llm_reports = 1;
+   net->ipv4.sysctl_igmp_qrv = 2;
+
return 0;
 fail:
tcp_sk_exit(net);
-- 
2.7.4



Re: net: BUG in unix_notinflight

2017-03-07 Thread Nikolay Borisov

>>
>>
>> New report from linux-next/c0b7b2b33bd17f7155956d0338ce92615da686c9
>>
>> [ cut here ]
>> kernel BUG at net/unix/garbage.c:149!
>> invalid opcode:  [#1] SMP KASAN
>> Dumping ftrace buffer:
>>(ftrace buffer empty)
>> Modules linked in:
>> CPU: 0 PID: 1806 Comm: syz-executor7 Not tainted 4.10.0-next-20170303+ #6
>> Hardware name: Google Google Compute Engine/Google Compute Engine,
>> BIOS Google 01/01/2011
>> task: 880121c64740 task.stack: 88012c9e8000
>> RIP: 0010:unix_notinflight+0x417/0x5d0 net/unix/garbage.c:149
>> RSP: 0018:88012c9ef0f8 EFLAGS: 00010297
>> RAX: 880121c64740 RBX: 11002593de23 RCX: 8801c490c628
>> RDX:  RSI: 11002593de27 RDI: 8557e504
>> RBP: 88012c9ef220 R08: 0001 R09: 
>> R10: dc00 R11: ed002593de55 R12: 8801c490c0c0
>> R13: 88012c9ef1f8 R14: 85101620 R15: dc00
>> FS:  013d3940() GS:8801dbe0() knlGS:
>> CS:  0010 DS:  ES:  CR0: 80050033
>> CR2: 01fd8cd8 CR3: 0001cce69000 CR4: 001426f0
>> Call Trace:
>>  unix_detach_fds.isra.23+0xfa/0x170 net/unix/af_unix.c:1490
>>  unix_destruct_scm+0xf4/0x200 net/unix/af_unix.c:1499
> 
> The problem here is there is no lock protecting concurrent unix_detach_fds()
> even though unix_notinflight() is already serialized, if we call
> unix_notinflight()
> twice on the same file pointer, we trigger this bug...
> 
> I don't know what is the right lock here to serialize it.
> 


I reported something similar a while ago
https://lists.gt.net/linux/kernel/2534612

And Miklos Szeredi then produced the following patch :

https://patchwork.kernel.org/patch/9305121/

However, this was never applied. I wonder if the patch makes sense?


Re: [PATCH] ipv4: Namespaceify tcp_tw_reuse knob

2016-12-24 Thread Nikolay Borisov


On 24.12.2016 14:43, Haishuang Yan wrote:
> Signed-off-by: Haishuang Yan <yanhaishu...@cmss.chinamobile.com>

Reviewed-by: Nikolay Borisov <n.borisov.l...@gmail.com>



Re: kernel BUG at net/unix/garbage.c:149!"

2016-09-27 Thread Nikolay Borisov
[Added Dave Miller to see what's the status of this patch]

On 08/30/2016 12:18 PM, Miklos Szeredi wrote:
> On Tue, Aug 30, 2016 at 12:37 AM, Miklos Szeredi  wrote:
>> On Sat, Aug 27, 2016 at 11:55 AM, Miklos Szeredi  wrote:
> 
>> crash> list -H gc_inflight_list unix_sock.link -s unix_sock.inflight |
>> grep counter | cut -d= -f2 | awk '{s+=$1} END {print s}'
>> 130
>> crash> p unix_tot_inflight
>> unix_tot_inflight = $2 = 135
>>
>> We've lost track of a total of five inflight sockets, so it's not a
>> one-off thing.  Really weird...  Now off to sleep, maybe I'll dream of
>> the solution.
> 
> Okay, found one bug: gc assumes that in-flight sockets that don't have
> an external ref can't gain one while unix_gc_lock is held.  That is
> true because unix_notinflight() will be called before detaching fds,
> which takes unix_gc_lock.  Only MSG_PEEK was somehow overlooked.  That
> one also clones the fds, also keeping them in the skb.  But through
> MSG_PEEK an external reference can definitely be gained without ever
> touching unix_gc_lock.
> 
> Not sure whether the reported bug can be explained by this.  Can you
> confirm the MSG_PEEK was used in the setup?
> 
> Does someone want to write a stress test for SCM_RIGHTS + MSG_PEEK?
> 
> Anyway, attaching a fix that works by acquiring unix_gc_lock in case
> of MSG_PEEK also.  It is trivially correct, but I haven't tested it.
> 
> Thanks,
> Miklos
> 


Dave,

What's the status of https://patchwork.ozlabs.org/patch/664062/ , is
this going to be picked up ?

Regards,
Nikolay


Re: kernel BUG at net/unix/garbage.c:149!"

2016-08-30 Thread Nikolay Borisov


On 08/30/2016 12:18 PM, Miklos Szeredi wrote:
> On Tue, Aug 30, 2016 at 12:37 AM, Miklos Szeredi  wrote:
>> On Sat, Aug 27, 2016 at 11:55 AM, Miklos Szeredi  wrote:
> 
>> crash> list -H gc_inflight_list unix_sock.link -s unix_sock.inflight |
>> grep counter | cut -d= -f2 | awk '{s+=$1} END {print s}'
>> 130
>> crash> p unix_tot_inflight
>> unix_tot_inflight = $2 = 135
>>
>> We've lost track of a total of five inflight sockets, so it's not a
>> one-off thing.  Really weird...  Now off to sleep, maybe I'll dream of
>> the solution.
> 
> Okay, found one bug: gc assumes that in-flight sockets that don't have
> an external ref can't gain one while unix_gc_lock is held.  That is
> true because unix_notinflight() will be called before detaching fds,
> which takes unix_gc_lock.  Only MSG_PEEK was somehow overlooked.  That
> one also clones the fds, also keeping them in the skb.  But through
> MSG_PEEK an external reference can definitely be gained without ever
> touching unix_gc_lock.
> 
> Not sure whether the reported bug can be explained by this.  Can you
> confirm the MSG_PEEK was used in the setup?
> 
> Does someone want to write a stress test for SCM_RIGHTS + MSG_PEEK?
> 
> Anyway, attaching a fix that works by acquiring unix_gc_lock in case
> of MSG_PEEK also.  It is trivially correct, but I haven't tested it.

I have no way of being 100% sure but looking through nginx's source code
it seems they do utilize MSG_PEEK on several occasions. This issue has
been apparently very hard to reproduce since I have 100s of servers
running a lot of  NGINX processes and this has been triggered only once.

On a different note - if I inspect a live node without this patch should
the discrepancy between the gc_inflight_list and the unix_tot_inflight
be present VS with this patch applied?

> 
> Thanks,
> Miklos
> 


Re: kernel BUG at net/unix/garbage.c:149!"

2016-08-24 Thread Nikolay Borisov
On Thu, Aug 25, 2016 at 12:40 AM, Hannes Frederic Sowa
<han...@stressinduktion.org> wrote:
> On 24.08.2016 16:24, Nikolay Borisov wrote:
[SNIP]
>
> One commit which could have to do with that is
>
> commit fc64869c48494a401b1fb627c9ecc4e6c1d74b0d
> Author: Andrey Ryabinin <aryabi...@virtuozzo.com>
> Date:   Wed May 18 19:19:27 2016 +0300
>
> net: sock: move ->sk_shutdown out of bitfields.
>
> but that is only a wild guess.
>
> Which unix_sock did you extract specifically in the url you provided? In
> unix_notinflight we are specifically checking an unix domain socket that
> is itself being transferred over another af_unix domain socket and not
> the unix domain socket being released at this point.

So this is the state of the socket that is being passed to
unix_notinflight. I have a complete crashdump so if you need more info
to diagnose it I'm happy to provide it. I'm not too familiar with the
code in question so I will need a bit of time to grasp what actually
is happening.

>
> Can you reproduce this and maybe also with a newer kernel?

Unfortunately I cannot reproduce this since it happened on a
production server nor can I change the kernel. But clearly there is
something wrong, and given that this is a stable kernel and no
relevant changes have gone in latest stable I believe the problem
(albeit hardly reproducible) would still persist.

>
> Thanks for the report,
> Hannes
>


kernel BUG at net/unix/garbage.c:149!"

2016-08-24 Thread Nikolay Borisov
Hello, 

I hit the following BUG: 

[1851513.239831] [ cut here ]
[1851513.240079] kernel BUG at net/unix/garbage.c:149!
[1851513.240313] invalid opcode:  [#1] SMP 
[1851513.248320] CPU: 37 PID: 11683 Comm: nginx Tainted: G   O
4.4.14-clouder3 #26
[1851513.248719] Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015
[1851513.248966] task: 883b0f6f ti: 880189cf task.ti: 
880189cf
[1851513.249361] RIP: 0010:[]  [] 
unix_notinflight+0x8d/0x90
[1851513.249846] RSP: 0018:880189cf3cf8  EFLAGS: 00010246
[1851513.250082] RAX: 883b05491968 RBX: 883b05491680 RCX: 
8807f9967330
[1851513.250476] RDX: 0001 RSI: 882e6d8bae00 RDI: 
82073f10
[1851513.250886] RBP: 880189cf3d08 R08: 880cbc70e200 R09: 
00018021
[1851513.251280] R10: 883fff3b9dc0 R11: ea0032f1c380 R12: 
883fbaf5
[1851513.251674] R13: 815f6354 R14: 881a7c77b140 R15: 
881a7c7792c0
[1851513.252083] FS:  7f4f19573720() GS:883fff3a() 
knlGS:
[1851513.252481] CS:  0010 DS:  ES:  CR0: 80050033
[1851513.252717] CR2: 013062d8 CR3: 001712f32000 CR4: 
001406e0
[1851513.253116] Stack:
[1851513.253345]   880189cf3d40 880189cf3d28 
815f4383
[1851513.254022]  8839ee11a800 8839ee11a800 880189cf3d60 
815f53b8
[1851513.254685]   883406788de0  

[1851513.255360] Call Trace:
[1851513.255594]  [] unix_detach_fds.isra.19+0x43/0x50
[1851513.255851]  [] unix_destruct_scm+0x48/0x80
[1851513.256090]  [] skb_release_head_state+0x4f/0xb0
[1851513.256328]  [] skb_release_all+0x12/0x30
[1851513.256564]  [] kfree_skb+0x32/0xa0
[1851513.256810]  [] unix_release_sock+0x1e4/0x2c0
[1851513.257046]  [] unix_release+0x20/0x30
[1851513.257284]  [] sock_release+0x1f/0x80
[1851513.257521]  [] sock_close+0x12/0x20
[1851513.257769]  [] __fput+0xea/0x1f0
[1851513.258005]  [] fput+0xe/0x10
[1851513.258244]  [] task_work_run+0x7f/0xb0
[1851513.258488]  [] exit_to_usermode_loop+0xc0/0xd0
[1851513.258728]  [] syscall_return_slowpath+0x80/0xf0
[1851513.258983]  [] int_ret_from_sys_call+0x25/0x9f
[1851513.259222] Code: 7e 5b 41 5c 5d c3 48 8b 8b e8 02 00 00 48 8b 93 f0 02 00 
00 48 89 51 08 48 89 0a 48 89 83 e8 02 00 00 48 89 83 f0 02 00 00 eb b8 <0f> 0b 
90 0f 1f 44 00 00 55 48 c7 c7 10 3f 07 82 48 89 e5 41 54 
[1851513.268473] RIP  [] unix_notinflight+0x8d/0x90
[1851513.268793]  RSP 

That's essentially BUG_ON(list_empty(>link));

I see that all the code involving the ->link member hasn't really been 
touched since it was introduced in 2007. So this must be a latent bug. 
This is the first time I've observed it. The state 
of the struct unix_sock can be found here http://sprunge.us/WCMW . Evidently, 
there are no inflight sockets. 

Regards, 
Nikolay 

 


Slow veth performance over ipoib interface on 4.7.0 (and earlier) (Was Re: [IPOIB] Excessive TX packet drops due to IPOIB_MAX_PATH_REC_QUEUE)

2016-08-04 Thread Nikolay Borisov


On 08/01/2016 11:56 AM, Erez Shitrit wrote:
> The GID (9000:0:2800:0:bc00:7500:6e:d8a4) is not regular, not from
> local subnet prefix.
> why is that?
>

So I managed to debug this and it tuns out the problem lies between veth
and ipoib interaction:

I've discovered the following strange thing. If I have a vethpair where
the 2 devices are in a different net namespaces as shown in the scripts
I have attached then the performance of sending a file, originating from
the veth interface inside the non-init netnamespace, going across the
ipoib interface is very slow (100kb). For simple reproduction I'm attaching
2 scripts which have to be run on 2 machine and the respective ip addresses
set on them. Then sending node woult initiate a simple file copy over NC.
I've observed this behavior on upstream 4.4, 4.5.4 and 4.7.0 kernels both
with ipv4 and ipv6 addresses. Here is what the debug log of the ipoib
module shows:

ib%d: max_srq_sge=128
ib%d: max_cm_mtu = 0xfff0, num_frags=16
ib0: enabling connected mode will cause multicast packet drops
ib0: mtu > 4092 will cause multicast packet drops.
ib0: bringing up interface
ib0: starting multicast thread
ib0: joining MGID ff12:401b::::::
ib0: restarting multicast task
ib0: adding multicast entry for mgid ff12:601b::::::0001
ib0: restarting multicast task
ib0: adding multicast entry for mgid ff12:401b::::::0001
ib0: join completion for ff12:401b:::::: (status 0)
ib0: Created ah 88081063ea80
ib0: MGID ff12:401b:::::: AV 88081063ea80, LID 
0xc000, SL 0
ib0: joining MGID ff12:601b::::::0001
ib0: joining MGID ff12:401b::::::0001
ib0: successfully started all multicast joins
ib0: join completion for ff12:601b::::::0001 (status 0)
ib0: Created ah 880839084680
ib0: MGID ff12:601b::::::0001 AV 880839084680, LID 
0xc002, SL 0
ib0: join completion for ff12:401b::::::0001 (status 0)
ib0: Created ah 88081063e280
ib0: MGID ff12:401b::::::0001 AV 88081063e280, LID 
0xc004, SL 0

When the transfer is initiated I can see the following errors
on the sending node:

ib0: PathRec status -22 for GID 0401::1400::a0a8::1c01:4d36
ib0: neigh free for 03 0401::1400::a0a8::1c01:4d36
ib0: Start path record lookup for 0401::1400::a0a8::1c01:4d36
ib0: PathRec status -22 for GID 0401::1400::a0a8::1c01:4d36
ib0: neigh free for 03 0401::1400::a0a8::1c01:4d36
ib0: Start path record lookup for 0401::1400::a0a8::1c01:4d36
ib0: PathRec status -22 for GID 0401::1400::a0a8::1c01:4d36
ib0: neigh free for 03 0401::1400::a0a8::1c01:4d36
ib0: Start path record lookup for 0401::1400::a0a8::1c01:4d36
ib0: PathRec status -22 for GID 0401::1400::a0a8::1c01:4d36
ib0: Start path record lookup for 0401::1400::a0a8::1c01:4d36
ib0: PathRec status -22 for GID 0401::1400::a0a8::1c01:4d36
ib0: neigh free for 03 0401::1400::a0a8::1c01:4d36
ib0: neigh free for 03 0401::1400::a0a8::1c01:4d36

Here is the port guid of the sending node: 0x001175772664 and
on the receiving one: 0x001175774d36

Here is how the paths look like on the sending node, 
clearly the paths being requested from the veth interface

cat /sys/kernel/debug/ipoib/ib0_path
GID: 401:0:1400:0:a0a8::1c01:4d36
complete: no

GID: 401:0:1400:0:a410::1c01:4d36
complete: no

GID: fe80:0:0:0:11:7500:77:2a1a
complete: yes
DLID: 0x0004
SL: 0
rate: 40.0 Gb/sec

GID: fe80:0:0:0:11:7500:77:4d36
complete: yes
DLID: 0x000a
SL: 0
rate: 40.0 Gb/sec

Testing the same scenario but instead of using veth devices I create
the device in the non-init netnamespace via the following commands
I can achieve sensible speeds:
ip link add link ib0 name ip1 type ipoib
ip link set dev ip1 netns test-netnamespace




 
[Snipped a lot of useless stuff]


receive-node.sh
Description: application/shellscript


sending-node.sh
Description: application/shellscript


ref count of ib_ipoib.ko not incremented when an ip address is set

2016-07-22 Thread Nikolay Borisov
Hello, 

I accidentally saw that even having an ip address on an 
ipoib interface doesn't increment the usage count of the 
ib_ipoib.ko module: 

ip a l dev ib0
14: ib0:  mtu 65520 qdisc pfifo_fast state UP 
qlen 256
link/infiniband 80:00:02:d4:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:00:f8:31 
brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
inet 172.16.0.150/24 brd 172.16.0.255 scope global ib0
   valid_lft forever preferred_lft forever

lsmod | grep ib_ipoib
ib_ipoib   79026  0 
ib_cm  34144  3 ib_ipoib,ib_ucm,rdma_cm
ib_sa  26598  5 ib_ipoib,rdma_ucm,rdma_cm,ib_cm,mlx4_ib
ib_core97620  10 
ib_ipoib,ib_ucm,ib_uverbs,rdma_cm,iw_cm,ib_umad,ib_cm,mlx4_ib,ib_sa,ib_mad
ipv6  374806  259 ib_ipoib,rdma_cm,ib_addr,[permanent]


In this case I can rmmod ib_ipoib.ko which would remove, 
but in practice I've observed that when an ip address is set
the underlying module providing the interface usually gets
its ref count incremented. Is this normal or is a 
refcount increment is amiss?

Regards, 
Nikolay 


Re: [PATCH 1/4] inotify: Add infrastructure to account inotify limits per-namespace

2016-06-06 Thread Nikolay Borisov


On 06/06/2016 11:05 AM, Cyrill Gorcunov wrote:
> On Wed, Jun 01, 2016 at 10:52:57AM +0300, Nikolay Borisov wrote:
>> This patch adds the necessary members to user_struct. The idea behind
>> the solution is really simple - user the userns pointers as keys into
>> a hash table which holds the inotify instances/watches counts. This
>> allows to account the limits per userns rather than per real user,
>> which makes certain scenarios such as a single mapped user in a
>> container deplete the inotify resources for all other users, which
>> map to the exact same real user.
>>
>> Signed-off-by: Nikolay Borisov <ker...@kyup.com>
> ...
>> +static inline unsigned long inotify_dec_return_dev(struct user_struct *user,
>> +   void *key)
>> +{
>> +struct inotify_state *state;
>> +unsigned long ret;
>> +
>> +spin_lock(>inotify_lock);
>> +state = __find_inotify_state(user, key);
>> +ret = --state->inotify_devs;
>> +spin_unlock(>inotify_lock);
>> +
>> +return ret;
>> +}
> 
> Hi Nikolay! Could you please explain why this new function is not used 
> anywhere
> in other patches or I miss something obvious?

Hi Cyrill,

It seems this is a left-over from an earlier, internal version of this
patchset. You can disregard it. Also, given the direction that the
discussion with Eric took I think I will be redesigning the solution
entirely. Thanks for taking the time to read the code!


Nikolay


Re: [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns

2016-06-06 Thread Nikolay Borisov


On 06/03/2016 11:41 PM, Eric W. Biederman wrote:
> Nikolay Borisov <ker...@kyup.com> writes:
> 
>> On 06/02/2016 07:58 PM, Eric W. Biederman wrote:
>>>
>>> Nikolay please see my question for you at the end.
> [snip] 
>>> All of that said there is definitely a practical question that needs to
>>> be asked.  Nikolay how did you get into this situation?  A typical user
>>> namespace configuration will set up uid and gid maps with the help of a
>>> privileged program and not map the uid of the user who created the user
>>> namespace.  Thus avoiding exhausting the limits of the user who created
>>> the container.
>>
>> Right but imagine having multiple containers with identical uid/gid maps
>> for LXC-based setups imagine this:
>>
>> lxc.id_map = u 0 1337 65536
> 
> So I am only moderately concerned when the containers have overlapping
> ids.  Because at some level overlapping ids means they are the same
> user.  This is certainly true for file permissions and for other
> permissions.  To isolate one container from another it fundamentally
> needs to have separate uids and gids on the host system.
> 
>> Now all processes which are running with the same user on different
>> containers will actually share the underlying user_struct thus the
>> inotify limits. In such cases even running multiple instances of 'tail'
>> in one container will eventually use all allowed inotify/mark instances.
>> For this to happen you needn't also have complete overlap of the uid
>> map, it's enough to have at least one UID between 2 containers overlap.
>>
>>
>> So the risk of exhaustion doesn't apply to the privileged user that
>> created the container and the uid mapping, but rather the users under
>> which the various processes in the container are running. Does that make
>> it clear?
> 
> Yes.  That is clear.
> 
>>> Which makes me personally more worried about escaping the existing
>>> limits than exhausting the limits of a particular user.
>>
>> So I thought bit about it and I guess a solution can be concocted which
>> utilize the hierarchical nature of page counter, and the inotify limits
>> are set per namespace if you have capable(CAP_SYS_ADMIN). That way the
>> admin can set one fairly large on the init_user_ns and then in every
>> namespace created one can set smaller limits. That way for a branch in
>> the tree (in the nomenclature you used in your previous reply to me) you
>> will really be upper-bound to the limit set in the namespace which have
>> ->level = 1. For the width of the tree, you will be bound by the
>> "global" init_user_ns limits. How does that sound?
> 
> As a addendum to that design.  I think there should be an additional
> sysctl or two that specifies how much the limit decreases when creating
> a new user namespace and when creating a new user in that user
> namespace.  That way with a good selection of limits and a limit
> decrease people can use the kernel defaults without needing to change
> them.

I agree that a sysctl which controls how the limits are set for new
namespaces is a good idea. I think it's best if this is in % rather than
some absolute value. Also I'm not sure about the sysctl when a user is
added in a namespace since just adding a new user should fall under the
limits of the current userns.

Also should those sysctls be global or should they be per-namespace? At
this point I'm more inclined to have global sysctl and maybe refine it
in the future if the need arises?


> 
> Having default settings that are good enough 99% of the time and that
> people don't need to tune, would be my biggest requirement (aside from
> being light-weight) for merging something like this.
> 
> If things are set and forget and even the continer case does not need to
> be aware then I think we have a design sufficiently robust and different
> from what cgroups is doing to make it worth while to have a userns based
> solution.

Provided that we agree on the overall design, so far it seems we just
need to iron out the details with the sysctl I'll be happy to implement
this.


> 
> I can see a lot of different limits implemented this way.
> 
> Eric
> ___
> Containers mailing list
> contain...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
> 


Re: [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns

2016-06-03 Thread Nikolay Borisov


On 06/02/2016 07:58 PM, Eric W. Biederman wrote:
> 
> Nikolay please see my question for you at the end.
> 
> Jan Kara <j...@suse.cz> writes:
> 
>> On Wed 01-06-16 11:00:06, Eric W. Biederman wrote:
>>> Cc'd the containers list.
>>>
>>> Nikolay Borisov <ker...@kyup.com> writes:
>>>
>>>> Currently the inotify instances/watches are being accounted in the 
>>>> user_struct structure. This means that in setups where multiple 
>>>> users in unprivileged containers map to the same underlying 
>>>> real user (e.g. user_struct) the inotify limits are going to be 
>>>> shared as well which can lead to unplesantries. This is a problem 
>>>> since any user inside any of the containers can potentially exhaust 
>>>> the instance/watches limit which in turn might prevent certain 
>>>> services from other containers from starting.
>>>
>>> On a high level this is a bit problematic as it appears to escapes the
>>> current limits and allows anyone creating a user namespace to have their
>>> own fresh set of limits.  Given that anyone should be able to create a
>>> user namespace whenever they feel like escaping limits is a problem.
>>> That however is solvable.
>>>
>>> A practical question.  What kind of limits are we looking at here?
>>>
>>> Are these loose limits for detecting buggy programs that have gone
>>> off their rails?
>>>
>>> Are these tight limits to ensure multitasking is possible?
>>
>> The original motivation for these limits is to limit resource usage.  There
>> is in-kernel data structure that is associated with each notification mark
>> you create and we don't want users to be able to DoS the system by creating
>> too many of them. Thus we limit number of notification marks for each user.
>> There is also a limit on the number of notification instances - those are
>> naturally limited by the number of open file descriptors but admin may want
>> to limit them more...
>>
>> So cgroups would be probably the best fit for this but I'm not sure whether
>> it is not an overkill...
> 
> There is some level of kernel memory accounting in the memory cgroup.
> 
> That said my experience with cgroups is that while they are good for
> some things the semantics that derive from the userspace API are
> problematic.
> 
> In the cgroup model objects in the kernel don't belong to a cgroup they
> belong to a task/process.  Those processes belong to a cgroup.
> Processes under control of a sufficiently privileged parent are allowed
> to switch cgroups.  This causes implementation challenges and sematic
> mismatch in a world where things are typically considered to have an
> owner.
> 
> Right now fs_notify groups (upon which all of the rest of the inotify
> accounting is built upon) belong to a user.  So there is a semantic
> mismatch with cgroups right out of the gate.
> 
> Given that cgroups have not choosen to account for individual kernel
> objects or give that level of control, I think it reasonable to look to
> other possible solutions.  Assuming the overhead can be kept under
> control.
> 
> The implementation of a hierarchical counter in mm/page_counter.c
> strongly suggests to me that the overhead can be kept under control.
> 
> And yes.  I am thinking of the problem space where you have a limit
> based on the problem domain where if an application consumes more than
> the limit, the application is likely bonkers.  Which does prevent a DOS
> situation in kernel memory.  But is different from the problem I have
> seen cgroups solve.
> 
> The problem I have seen cgroups solve looks like.  Hmm.  I have 8GB of
> ram.  I have 3 containers.  Container A can have 4GB, Container B can
> have 1GB and container C can have 3GB.  Then I know one container won't
> push the other containers into swap.
> 
> Perhaps that would tend to be a top down/vs a bottom up approach to
> coming up with limits.  As DOS preventions limits like the inotify ones
> are generally written from the perspective of if you have more than X
> you are crazy.  While cgroup limits tend to be thought about top down
> from a total system management point of view.
> 
> So I think there is definitely something to look at.
> 
> 
> All of that said there is definitely a practical question that needs to
> be asked.  Nikolay how did you get into this situation?  A typical user
> namespace configuration will set up uid and gid maps with the help of a
> privileged program and not map the uid of the user who created the user
> namespace.  Thus avoiding exhausting the limits

Re: [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns

2016-06-02 Thread Nikolay Borisov


On 06/01/2016 07:00 PM, Eric W. Biederman wrote:
> Cc'd the containers list.
> 
> 
> Nikolay Borisov <ker...@kyup.com> writes:
> 
>> Currently the inotify instances/watches are being accounted in the 
>> user_struct structure. This means that in setups where multiple 
>> users in unprivileged containers map to the same underlying 
>> real user (e.g. user_struct) the inotify limits are going to be 
>> shared as well which can lead to unplesantries. This is a problem 
>> since any user inside any of the containers can potentially exhaust 
>> the instance/watches limit which in turn might prevent certain 
>> services from other containers from starting.
> 
> On a high level this is a bit problematic as it appears to escapes the
> current limits and allows anyone creating a user namespace to have their
> own fresh set of limits.  Given that anyone should be able to create a
> user namespace whenever they feel like escaping limits is a problem.
> That however is solvable.

This is indeed a problem and the presented solution is rather dumb in
that regard. I'm happy to work with you on suggestions so that I arrive
at a solution that is upstreamable.

> 
> A practical question.  What kind of limits are we looking at here?
> 
> Are these loose limits for detecting buggy programs that have gone
> off their rails?

Loose limits.

> 
> Are these tight limits to ensure multitasking is possible?
> 
> 
> 
> For tight limits where something is actively controlling the limits you
> probably want a cgroup base solution.
> 
> For loose limits that are the kind where you set a good default and
> forget about I think a user namespace based solution is reasonable.

That's exactly the use case I had in mind.

> 
>> The solution I propose is rather simple, instead of accounting the 
>> watches/instances per user_struct, start accounting them in a hashtable, 
>> where the index used is the hashed pointer of the userns. This way
>> the administrator needn't set the inotify limits very high and also 
>> the risk of one container breaching the limits and affecting every 
>> other container is alleviated.
> 
> I don't think this is the right data structure for a user namespace
> based solution, at least in part because it does not account for users
> escaping.

Admittedly this is a naive solution, what are you ideas on something
which achieves my initial aim of having limits per users, yet not
allowing them to just create another namespace and escape them. The
current namespace code has a hard-coded limit of 32 for nesting user
namespaces. So currently at the worst case one can escape the limits up
to 32 * current_limits.



[PATCH 2/4] inotify: Convert inotify limits to be accounted per-realuser/per-namespace

2016-06-01 Thread Nikolay Borisov
Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---
 fs/notify/inotify/inotify_fsnotify.c | 14 +-
 fs/notify/inotify/inotify_user.c | 23 +++
 include/linux/sched.h|  2 --
 3 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/fs/notify/inotify/inotify_fsnotify.c 
b/fs/notify/inotify/inotify_fsnotify.c
index 2cd900c2c737..efaeec3f2e26 100644
--- a/fs/notify/inotify/inotify_fsnotify.c
+++ b/fs/notify/inotify/inotify_fsnotify.c
@@ -166,7 +166,19 @@ static void inotify_free_group_priv(struct fsnotify_group 
*group)
idr_for_each(>inotify_data.idr, idr_callback, group);
idr_destroy(>inotify_data.idr);
if (group->inotify_data.user) {
-   atomic_dec(>inotify_data.user->inotify_devs);
+   struct user_struct *user = group->inotify_data.user;
+   void *key = group->inotify_data.userns_ptr;
+   struct inotify_state *state;
+
+   spin_lock(>inotify_lock);
+   state = __find_inotify_state(user, key);
+   if (--state->inotify_devs == 0)
+   hash_del(>node);
+   spin_unlock(>inotify_lock);
+
+   if (state->inotify_devs == 0)
+   kfree(state);
+
free_uid(group->inotify_data.user);
}
 }
diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index ae7ec2414252..e7cc4eaa838f 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -94,7 +94,7 @@ static int inotify_init_state(struct user_struct *user,
int ret = 0;
 
spin_lock(>inotify_lock);
-   state =  __find_inotify_count(user, key);
+   state =  __find_inotify_state(user, key);
 
if (!state) {
spin_unlock(>inotify_lock);
@@ -536,7 +536,8 @@ void inotify_ignored_and_remove_idr(struct fsnotify_mark 
*fsn_mark,
/* remove this mark from the idr */
inotify_remove_from_idr(group, i_mark);
 
-   atomic_dec(>inotify_data.user->inotify_watches);
+   inotify_dec_watches(group->inotify_data.user,
+   group->inotify_data.userns_ptr);
 }
 
 /* ding dong the mark is dead */
@@ -609,6 +610,8 @@ static int inotify_new_watch(struct fsnotify_group *group,
int ret;
struct idr *idr = >inotify_data.idr;
spinlock_t *idr_lock = >inotify_data.idr_lock;
+   struct user_struct *user = group->inotify_data.user;
+   void *key = group->inotify_data.userns_ptr;
 
mask = inotify_arg_to_mask(arg);
 
@@ -621,7 +624,7 @@ static int inotify_new_watch(struct fsnotify_group *group,
tmp_i_mark->wd = -1;
 
ret = -ENOSPC;
-   if (atomic_read(>inotify_data.user->inotify_watches) >= 
inotify_max_user_watches)
+   if (inotify_read_watches(user, key) >= inotify_max_user_watches)
goto out_err;
 
ret = inotify_add_to_idr(idr, idr_lock, tmp_i_mark);
@@ -638,7 +641,7 @@ static int inotify_new_watch(struct fsnotify_group *group,
}
 
/* increment the number of watches the user has */
-   atomic_inc(>inotify_data.user->inotify_watches);
+   inotify_inc_watches(user, key);
 
/* return the watch descriptor for this new mark */
ret = tmp_i_mark->wd;
@@ -669,6 +672,9 @@ static struct fsnotify_group *inotify_new_group(unsigned 
int max_events)
 {
struct fsnotify_group *group;
struct inotify_event_info *oevent;
+   struct user_struct *user = get_current_user();
+   void *key = current_user_ns();
+   int ret;
 
group = fsnotify_alloc_group(_fsnotify_ops);
if (IS_ERR(group))
@@ -689,12 +695,13 @@ static struct fsnotify_group *inotify_new_group(unsigned 
int max_events)
 
spin_lock_init(>inotify_data.idr_lock);
idr_init(>inotify_data.idr);
-   group->inotify_data.user = get_current_user();
+   group->inotify_data.user = user;
+   group->inotify_data.userns_ptr = key;
 
-   if (atomic_inc_return(>inotify_data.user->inotify_devs) >
-   inotify_max_user_instances) {
+   ret = inotify_init_state(user, key);
+   if (ret < 0) {
fsnotify_destroy_group(group);
-   return ERR_PTR(-EMFILE);
+   return ERR_PTR(ret);
}
 
return group;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0c55d951d0bb..8f589b32ed15 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -842,8 +842,6 @@ struct user_struct {
 #ifdef CONFIG_INOTIFY_USER
spinlock_t inotify_lock;
DECLARE_HASHTABLE(inotify_tbl, 6);
-   atomic_t inotify_watches; /* How many inotify watches does this user 
have? */
-   atomic_t inotify_devs;  /* How many inotify devs does this user have 
opened? */
 #endif
 #ifdef CONFIG_FANOTIFY
atomic_t fanotify_listeners;
-- 
2.5.0



[RFC PATCH 0/4] Make inotify instance/watches be accounted per userns

2016-06-01 Thread Nikolay Borisov
Currently the inotify instances/watches are being accounted in the 
user_struct structure. This means that in setups where multiple 
users in unprivileged containers map to the same underlying 
real user (e.g. user_struct) the inotify limits are going to be 
shared as well which can lead to unplesantries. This is a problem 
since any user inside any of the containers can potentially exhaust 
the instance/watches limit which in turn might prevent certain 
services from other containers from starting. 

The solution I propose is rather simple, instead of accounting the 
watches/instances per user_struct, start accounting them in a hashtable, 
where the index used is the hashed pointer of the userns. This way
the administrator needn't set the inotify limits very high and also 
the risk of one container breaching the limits and affecting every 
other container is alleviated. 

I have performed functional testing to validate that limits in 
different namespaces are indeed separate, as well as running 
multiple inotify stressers from stress-ng to ensure I haven't 
introduced any race conditions. 

This series  is based on 4.7-rc1 (and applies cleanly on 4.4.10) and 
consist of the following 4 patches: 

Patch 1: This introduces the necessary structure and code changes. Including
hashtable.h to sched.h causes some warnings in files which define HAS_SIZE 
macro, 
patch 3 fixes this by doing mechanical rename. 

Patch 2: This patch flips the inotify code to user the new infrastructure.

Patch 3: This is a simple mechanical rename of conflicting definitions with 
hashtable.h's HASH_SIZE macro. I'm happy about comments how I should go 
about this. 

Patch 4: This is a rather self-container patch and can go irrespective of 
whether the series is accepted, it's needed so that building the kernel 
with !CONFIG_INOTIFY_USER doesn't fail (with patch 1 being applied). 
However, fdinfo.c doesn't really need inotify.h  

Nikolay Borisov (4):
  inotify: Add infrastructure to account inotify limits per-namespace
  inotify: Convert inotify limits to be accounted
per-realuser/per-namespace
  misc: Rename the HASH_SIZE macro
  inotify: Don't include inotify.h when !CONFIG_INOTIFY_USER

 fs/logfs/dir.c   |  6 +--
 fs/notify/fdinfo.c   |  3 ++
 fs/notify/inotify/inotify.h  | 68 
 fs/notify/inotify/inotify_fsnotify.c | 14 ++-
 fs/notify/inotify/inotify_user.c | 57 ++
 include/linux/fsnotify_backend.h |  1 +
 include/linux/sched.h|  5 ++-
 kernel/user.c| 13 ++
 net/ipv6/ip6_gre.c   |  8 ++--
 net/ipv6/ip6_tunnel.c| 10 ++---
 net/ipv6/ip6_vti.c   | 10 ++---
 net/ipv6/sit.c   | 10 ++---
 security/keys/encrypted-keys/encrypted.c | 32 +++
 13 files changed, 189 insertions(+), 48 deletions(-)

-- 
2.5.0



[PATCH 3/4] misc: Rename the HASH_SIZE macro

2016-06-01 Thread Nikolay Borisov
This change is required since the inotify-per-namespace code added
hashtable.h to the include list of sched.h. This in turn causes
compiler warnings since HASH_SIZE is being defined in multiple
locations

Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---
 fs/logfs/dir.c   |  6 +++---
 net/ipv6/ip6_gre.c   |  8 
 net/ipv6/ip6_tunnel.c| 10 +-
 net/ipv6/ip6_vti.c   | 10 +-
 net/ipv6/sit.c   | 10 +-
 security/keys/encrypted-keys/encrypted.c | 32 
 6 files changed, 38 insertions(+), 38 deletions(-)

diff --git a/fs/logfs/dir.c b/fs/logfs/dir.c
index 2d5336bd4efd..bcd754d216bd 100644
--- a/fs/logfs/dir.c
+++ b/fs/logfs/dir.c
@@ -95,7 +95,7 @@ static int beyond_eof(struct inode *inode, loff_t bix)
  * of each character and pick a prime nearby, preferably a bit-sparse
  * one.
  */
-static u32 hash_32(const char *s, int len, u32 seed)
+static u32 logfs_hash_32(const char *s, int len, u32 seed)
 {
u32 hash = seed;
int i;
@@ -159,7 +159,7 @@ static struct page *logfs_get_dd_page(struct inode *dir, 
struct dentry *dentry)
struct qstr *name = >d_name;
struct page *page;
struct logfs_disk_dentry *dd;
-   u32 hash = hash_32(name->name, name->len, 0);
+   u32 hash = logfs_hash_32(name->name, name->len, 0);
pgoff_t index;
int round;
 
@@ -370,7 +370,7 @@ static int logfs_write_dir(struct inode *dir, struct dentry 
*dentry,
 {
struct page *page;
struct logfs_disk_dentry *dd;
-   u32 hash = hash_32(dentry->d_name.name, dentry->d_name.len, 0);
+   u32 hash = logfs_hash_32(dentry->d_name.name, dentry->d_name.len, 0);
pgoff_t index;
int round, err;
 
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index af503f518278..b73b4dc5c7ad 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -62,11 +62,11 @@ module_param(log_ecn_error, bool, 0644);
 MODULE_PARM_DESC(log_ecn_error, "Log packets received with corrupted ECN");
 
 #define HASH_SIZE_SHIFT  5
-#define HASH_SIZE (1 << HASH_SIZE_SHIFT)
+#define IP6G_HASH_SIZE (1 << HASH_SIZE_SHIFT)
 
 static int ip6gre_net_id __read_mostly;
 struct ip6gre_net {
-   struct ip6_tnl __rcu *tunnels[4][HASH_SIZE];
+   struct ip6_tnl __rcu *tunnels[4][IP6G_HASH_SIZE];
 
struct net_device *fb_tunnel_dev;
 };
@@ -96,7 +96,7 @@ static void ip6gre_tnl_link_config(struct ip6_tnl *t, int 
set_mtu);
will match fallback tunnel.
  */
 
-#define HASH_KEY(key) (((__force u32)key^((__force u32)key>>4))&(HASH_SIZE - 
1))
+#define HASH_KEY(key) (((__force u32)key^((__force 
u32)key>>4))&(IP6G_HASH_SIZE - 1))
 static u32 HASH_ADDR(const struct in6_addr *addr)
 {
u32 hash = ipv6_addr_hash(addr);
@@ -1086,7 +1086,7 @@ static void ip6gre_destroy_tunnels(struct net *net, 
struct list_head *head)
 
for (prio = 0; prio < 4; prio++) {
int h;
-   for (h = 0; h < HASH_SIZE; h++) {
+   for (h = 0; h < IP6G_HASH_SIZE; h++) {
struct ip6_tnl *t;
 
t = rtnl_dereference(ign->tunnels[prio][h]);
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 7b0481e3738f..50b57a435f05 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -64,8 +64,8 @@ MODULE_LICENSE("GPL");
 MODULE_ALIAS_RTNL_LINK("ip6tnl");
 MODULE_ALIAS_NETDEV("ip6tnl0");
 
-#define HASH_SIZE_SHIFT  5
-#define HASH_SIZE (1 << HASH_SIZE_SHIFT)
+#define IP6_HASH_SIZE_SHIFT  5
+#define IP6_HASH_SIZE (1 << IP6_HASH_SIZE_SHIFT)
 
 static bool log_ecn_error = true;
 module_param(log_ecn_error, bool, 0644);
@@ -75,7 +75,7 @@ static u32 HASH(const struct in6_addr *addr1, const struct 
in6_addr *addr2)
 {
u32 hash = ipv6_addr_hash(addr1) ^ ipv6_addr_hash(addr2);
 
-   return hash_32(hash, HASH_SIZE_SHIFT);
+   return hash_32(hash, IP6_HASH_SIZE_SHIFT);
 }
 
 static int ip6_tnl_dev_init(struct net_device *dev);
@@ -87,7 +87,7 @@ struct ip6_tnl_net {
/* the IPv6 tunnel fallback device */
struct net_device *fb_tnl_dev;
/* lists for storing tunnels in use */
-   struct ip6_tnl __rcu *tnls_r_l[HASH_SIZE];
+   struct ip6_tnl __rcu *tnls_r_l[IP6_HASH_SIZE];
struct ip6_tnl __rcu *tnls_wc[1];
struct ip6_tnl __rcu **tnls[2];
 };
@@ -2031,7 +2031,7 @@ static void __net_exit ip6_tnl_destroy_tunnels(struct net 
*net)
if (dev->rtnl_link_ops == _link_ops)
unregister_netdevice_queue(dev, );
 
-   for (h = 0; h < HASH_SIZE; h++) {
+   for (h = 0; h < IP6_HASH_SIZE; h++) {
t = rtnl_dereference(ip6n->tnls_r_l[h]);
while (t) {
/* If dev is in the same netns, it has

[PATCH 4/4] inotify: Don't include inotify.h when !CONFIG_INOTIFY_USER

2016-06-01 Thread Nikolay Borisov
Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---
 fs/notify/fdinfo.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/notify/fdinfo.c b/fs/notify/fdinfo.c
index fd98e5100cab..62068f89d144 100644
--- a/fs/notify/fdinfo.c
+++ b/fs/notify/fdinfo.c
@@ -13,7 +13,10 @@
 #include 
 #include 
 
+#ifdef CONFIG_INOTIFY_USER
 #include "inotify/inotify.h"
+#endif
+
 #include "../fs/mount.h"
 
 #if defined(CONFIG_PROC_FS)
-- 
2.5.0



[PATCH 1/4] inotify: Add infrastructure to account inotify limits per-namespace

2016-06-01 Thread Nikolay Borisov
This patch adds the necessary members to user_struct. The idea behind
the solution is really simple - user the userns pointers as keys into
a hash table which holds the inotify instances/watches counts. This
allows to account the limits per userns rather than per real user,
which makes certain scenarios such as a single mapped user in a
container deplete the inotify resources for all other users, which
map to the exact same real user.

Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---
 fs/notify/inotify/inotify.h  | 68 
 fs/notify/inotify/inotify_user.c | 36 +
 include/linux/fsnotify_backend.h |  1 +
 include/linux/sched.h|  3 ++
 kernel/user.c| 13 
 5 files changed, 121 insertions(+)

diff --git a/fs/notify/inotify/inotify.h b/fs/notify/inotify/inotify.h
index ed855ef6f077..e069e1e4262a 100644
--- a/fs/notify/inotify/inotify.h
+++ b/fs/notify/inotify/inotify.h
@@ -1,6 +1,7 @@
 #include 
 #include 
 #include  /* struct kmem_cache */
+#include 
 
 struct inotify_event_info {
struct fsnotify_event fse;
@@ -15,6 +16,13 @@ struct inotify_inode_mark {
int wd;
 };
 
+struct inotify_state {
+   struct hlist_node node;
+   void *key; /* user_namespace ptr */
+   u32 inotify_watches; /* How many inotify watches does this user have? */
+   u32 inotify_devs;  /* How many inotify devs does this user have opened? 
*/
+};
+
 static inline struct inotify_event_info *INOTIFY_E(struct fsnotify_event *fse)
 {
return container_of(fse, struct inotify_event_info, fse);
@@ -30,3 +38,63 @@ extern int inotify_handle_event(struct fsnotify_group *group,
const unsigned char *file_name, u32 cookie);
 
 extern const struct fsnotify_ops inotify_fsnotify_ops;
+
+/* Helpers for manipulating various inotify state, stored in user_struct */
+static inline struct inotify_state *__find_inotify_state(struct user_struct 
*user,
+ void *key)
+{
+   struct inotify_state *state;
+
+   hash_for_each_possible(user->inotify_tbl, state, node, (unsigned 
long)key)
+   if (state->key == key)
+   return state;
+
+   return NULL;
+}
+
+static inline void inotify_inc_watches(struct user_struct *user, void *key)
+{
+   struct inotify_state *state;
+
+   spin_lock(>inotify_lock);
+   state = __find_inotify_state(user, key);
+   state->inotify_watches++;
+   spin_unlock(>inotify_lock);
+}
+
+
+static inline void inotify_dec_watches(struct user_struct *user, void *key)
+{
+   struct inotify_state *state;
+
+   spin_lock(>inotify_lock);
+   state = __find_inotify_state(user, key);
+   state->inotify_watches--;
+   spin_unlock(>inotify_lock);
+}
+
+static inline int inotify_read_watches(struct user_struct *user, void *key)
+{
+   struct inotify_state *state;
+   int ret;
+
+   spin_lock(>inotify_lock);
+   state = __find_inotify_state(user, key);
+   ret = state->inotify_watches;
+   spin_unlock(>inotify_lock);
+   return ret;
+}
+
+static inline unsigned long inotify_dec_return_dev(struct user_struct *user,
+  void *key)
+{
+   struct inotify_state *state;
+   unsigned long ret;
+
+   spin_lock(>inotify_lock);
+   state = __find_inotify_state(user, key);
+   ret = --state->inotify_devs;
+   spin_unlock(>inotify_lock);
+
+   return ret;
+}
diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index b8d08d0d0a4d..ae7ec2414252 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -86,6 +86,42 @@ struct ctl_table inotify_table[] = {
 };
 #endif /* CONFIG_SYSCTL */
 
+
+static int inotify_init_state(struct user_struct *user,
+ void *key)
+{
+   struct inotify_state *state;
+   int ret = 0;
+
+   spin_lock(>inotify_lock);
+   state =  __find_inotify_count(user, key);
+
+   if (!state) {
+   spin_unlock(>inotify_lock);
+   state = kzalloc(sizeof(struct inotify_state), GFP_KERNEL);
+   if (!state)
+   return -ENOMEM;
+
+   state->key = current_user_ns();
+   state->inotify_watches = 0;
+   state->inotify_devs = 1;
+
+   spin_lock(>inotify_lock);
+   hash_add(user->inotify_tbl, >node, (unsigned long)key);
+
+   goto out;
+   } else {
+
+   if (++state->inotify_devs > inotify_max_user_instances) {
+   ret = -EMFILE;
+   goto out;
+   }
+   }
+out:
+   spin_unlock(>inotify_lock);
+   return ret;
+}
+
 static inline __u32 inotify_arg_to_mas

Re: ipv6 not bringing up due to qdisc_tx_is_noop failing

2016-03-19 Thread Nikolay Borisov
On Wed, Mar 16, 2016 at 7:07 PM, Hannes Frederic Sowa
<han...@stressinduktion.org> wrote:
> Hello,

Hi,

>
> On 16.03.2016 16:29, Nikolay Borisov wrote:
>>
>> I have stack traces which do show this sequence of events, so my
>> questions now are:
>>
>> 1. What's the difference between netdev_queue->qdisc and
>> netdev_queue->qdisc_sleeping. Git blaming indicates those member haves
>> existed even before the git history was started.
>
> qdisc_sleeping is the qdisc you configure before the device is brought up.
> It should transition during carrier up to the normal qdisc.
>
>> 2. Shouldn't the netdev_queue->qdisc also be updated during
>> attach_one_default_qdisc?
>
> Yes, do you have carrier up on your card?

Actually no, the interface indeed shows no carrier yet ibping to other
hosts on the infiniband network works and ibstats shows the link as
being in up state. Do you have any ideas how to debug this further?

>
> Is this a regression, did this work for you and stopped working with a
> specific kernel version?

I don't think this is a regression.

>
> Thanks and bye,
> Hannes
>


ipv6 not bringing up due to qdisc_tx_is_noop failing

2016-03-19 Thread Nikolay Borisov
Hello Dave,

I've been chasing a rather strange problem and I saw you were the person
that authored most of the code involved so I'm addresing you, but will
be happy to receive assistance from any one feeling knowledgeable enough
on the issue.

Basically I have an infiniband card on which I want to run ipv6 to this
effect I load modules ib_qib (the infiniband card is qlogic QLE7342) and
then I load module ib_ipoib and I get :
IPv6: ADDRCONF(NETDEV_UP): ib0: link is not ready

even though for example ibping and all that works. This happens because
the check if (!addrconf_qdisc_ok(dev)) in addrconf_notify fails, since
the dev's txq ->qdisc points to noop_qdisc.

Now, here is what happens :

1. When the ib_ipoib module is loaded
register_netdevice->dev_init_scheduler is called which sets the device's
qdisc to noop_qdisc

2. Then via a netlink message the device is being activate, which calls
into dev_activate->attach_one_default_qdisc which attaches the newly
created default qdisc to the dev->sleeping_qdisc member

3. The addrconf_notify is invoked which fails the check since the
netdev_queue's qdisk member was never updated (just the sleeping_qdisc)
to anything different than the initial state (which is noop_qdisc).

I have stack traces which do show this sequence of events, so my
questions now are:

1. What's the difference between netdev_queue->qdisc and
netdev_queue->qdisc_sleeping. Git blaming indicates those member haves
existed even before the git history was started.

2. Shouldn't the netdev_queue->qdisc also be updated during
attach_one_default_qdisc?


Regards,
Nikolay


Re: [PATCH 0/4] Namespacify inet_peer_* sysctl knobs

2016-02-18 Thread Nikolay Borisov


On 02/17/2016 09:15 PM, Eric W. Biederman wrote:
> Nikolay Borisov <ker...@kyup.com> writes:
> 
>> This series make the inet_peer ttl sysctls to be namespace aware. 
>>
>> Patch 1 adds a namespace association to the inet_peer_base struct, 
>> which in turn is used to make the sysctls namespace aware. The 
>> rest of the patches are straightforward.
> 
> At a quick skim I am not certain I am comfortable with this change.
> 
> The issue is that these are not packet parameters you are tuning but
> lifetimes for data structures.

Right, I though the inet peer expiration might have repercussion on the
way the networking stack worked. But apparently that's not case.
> 
> Generally there are challenges making this kind of thing per namespace
> because resource control can lead to DOS attack from one namespace
> being able to arbitrarly control it's own resource consumption.
> 
> Is this something that is actually worth making per namespace?

I guess the series can be dropped if it's deemed unnecessary.


> 
> Eric
> 
>> Nikolay Borisov (4):
>>   inetpeer: Add net namespace assosication in inet_peer_base
>>   inetpeer: Namespacify inet_peer_maxttl sysctl knob
>>   inetpeer: Namespacify inet_peer_minttl sysctl knob
>>   inetpeer: Namespacify inet_peer_threshold sysctl knob
>>
>>  include/net/inetpeer.h |  1 +
>>  include/net/ip.h   |  5 -
>>  include/net/netns/ipv4.h   |  4 
>>  net/ipv4/inetpeer.c| 15 ++-
>>  net/ipv4/route.c   |  1 +
>>  net/ipv4/sysctl_net_ipv4.c | 47 
>> --
>>  6 files changed, 37 insertions(+), 36 deletions(-)


[PATCH 3/4] inetpeer: Namespacify inet_peer_minttl sysctl knob

2016-02-17 Thread Nikolay Borisov
Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---
 include/net/ip.h   |  1 -
 include/net/netns/ipv4.h   |  1 +
 net/ipv4/inetpeer.c|  2 +-
 net/ipv4/sysctl_net_ipv4.c | 15 ---
 4 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index c33d53176d3c..11a20a3e60c6 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -242,7 +242,6 @@ static inline int inet_is_local_reserved_port(struct net 
*net, int port)
 
 /* From inetpeer.c */
 extern int inet_peer_threshold;
-extern int inet_peer_minttl;
 
 void ipfrag_init(void);
 
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index b0623c4e2f0a..1bc51c22ef42 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -89,6 +89,7 @@ struct netns_ipv4 {
int sysctl_ip_early_demux;
 
int sysctl_inet_peer_maxttl;
+   int sysctl_inet_peer_minttl;
 
int sysctl_fwmark_reflect;
int sysctl_tcp_fwmark_accept;
diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index a9245ada56c2..97e834eae90c 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -81,7 +81,6 @@ EXPORT_SYMBOL_GPL(inet_peer_base_init);
 /* Exported for sysctl_net_ipv4.  */
 int inet_peer_threshold __read_mostly = 65536 + 128;   /* start to throw 
entries more
 * aggressively at this stage */
-int inet_peer_minttl __read_mostly = 120 * HZ; /* TTL under high load: 120 sec 
*/
 
 static void inetpeer_gc_worker(struct work_struct *work)
 {
@@ -369,6 +368,7 @@ static int inet_peer_gc(struct inet_peer_base *base,
 {
struct inet_peer *p, *gchead = NULL;
int inet_peer_maxttl = base->net->ipv4.sysctl_inet_peer_maxttl;
+   int inet_peer_minttl = base->net->ipv4.sysctl_inet_peer_minttl;
__u32 delta, ttl;
int cnt = 0;
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 2aaa049cbf9d..9b55ca56b99f 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -352,13 +352,6 @@ static struct ctl_table ipv4_table[] = {
.proc_handler   = proc_dointvec
},
{
-   .procname   = "inet_peer_minttl",
-   .data   = _peer_minttl,
-   .maxlen = sizeof(int),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec_jiffies,
-   },
-   {
.procname   = "tcp_fack",
.data   = _tcp_fack,
.maxlen = sizeof(int),
@@ -737,6 +730,13 @@ static struct ctl_table ipv4_net_table[] = {
.proc_handler   = proc_dointvec_jiffies,
},
{
+   .procname   = "inet_peer_minttl",
+   .data   = _net.ipv4.sysctl_inet_peer_minttl,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_jiffies,
+   },
+   {
.procname   = "ip_early_demux",
.data   = _net.ipv4.sysctl_ip_early_demux,
.maxlen = sizeof(int),
@@ -992,6 +992,7 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
net->ipv4.sysctl_ip_dynaddr = 0;
net->ipv4.sysctl_ip_early_demux = 1;
net->ipv4.sysctl_inet_peer_maxttl = 10 * 60 * HZ;  /* usual time to 
live: 10 min */
+   net->ipv4.sysctl_inet_peer_minttl = 120 * HZ;  /* TTL under high 
load: 120 sec */
 
return 0;
 
-- 
2.5.0



[PATCH 4/4] inetpeer: Namespacify inet_peer_threshold sysctl knob

2016-02-17 Thread Nikolay Borisov
Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---
 include/net/ip.h   |  3 ---
 include/net/netns/ipv4.h   |  1 +
 net/ipv4/inetpeer.c| 11 ---
 net/ipv4/sysctl_net_ipv4.c | 17 +
 4 files changed, 14 insertions(+), 18 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 11a20a3e60c6..b9832da7e636 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -240,9 +240,6 @@ static inline int inet_is_local_reserved_port(struct net 
*net, int port)
 }
 #endif
 
-/* From inetpeer.c */
-extern int inet_peer_threshold;
-
 void ipfrag_init(void);
 
 void ip_static_sysctl_init(void);
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 1bc51c22ef42..c0d85ba9e5f7 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -90,6 +90,7 @@ struct netns_ipv4 {
 
int sysctl_inet_peer_maxttl;
int sysctl_inet_peer_minttl;
+   int sysctl_inet_peer_threshold;
 
int sysctl_fwmark_reflect;
int sysctl_tcp_fwmark_accept;
diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index 97e834eae90c..6a51d3abd797 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -78,10 +78,6 @@ EXPORT_SYMBOL_GPL(inet_peer_base_init);
 
 #define PEER_MAXDEPTH 40 /* sufficient for about 2^27 nodes */
 
-/* Exported for sysctl_net_ipv4.  */
-int inet_peer_threshold __read_mostly = 65536 + 128;   /* start to throw 
entries more
-* aggressively at this stage */
-
 static void inetpeer_gc_worker(struct work_struct *work)
 {
struct inet_peer *p, *n, *c;
@@ -141,11 +137,11 @@ void __init inet_initpeers(void)
 * myself.  --SAW
 */
if (si.totalram <= (32768*1024)/PAGE_SIZE)
-   inet_peer_threshold >>= 1; /* max pool size about 1MB on IA32 */
+   init_net.ipv4.sysctl_inet_peer_threshold >>= 1; /* max pool 
size about 1MB on IA32 */
if (si.totalram <= (16384*1024)/PAGE_SIZE)
-   inet_peer_threshold >>= 1; /* about 512KB */
+   init_net.ipv4.sysctl_inet_peer_threshold >>= 1; /* about 512KB 
*/
if (si.totalram <= (8192*1024)/PAGE_SIZE)
-   inet_peer_threshold >>= 2; /* about 128KB */
+   init_net.ipv4.sysctl_inet_peer_threshold >>= 2; /* about 128KB 
*/
 
peer_cachep = kmem_cache_create("inet_peer_cache",
sizeof(struct inet_peer),
@@ -369,6 +365,7 @@ static int inet_peer_gc(struct inet_peer_base *base,
struct inet_peer *p, *gchead = NULL;
int inet_peer_maxttl = base->net->ipv4.sysctl_inet_peer_maxttl;
int inet_peer_minttl = base->net->ipv4.sysctl_inet_peer_minttl;
+   int inet_peer_threshold = base->net->ipv4.sysctl_inet_peer_threshold;
__u32 delta, ttl;
int cnt = 0;
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 9b55ca56b99f..36d206209879 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -345,13 +345,6 @@ static struct ctl_table ipv4_table[] = {
.proc_handler   = proc_dointvec
},
{
-   .procname   = "inet_peer_threshold",
-   .data   = _peer_threshold,
-   .maxlen = sizeof(int),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec
-   },
-   {
.procname   = "tcp_fack",
.data   = _tcp_fack,
.maxlen = sizeof(int),
@@ -737,6 +730,13 @@ static struct ctl_table ipv4_net_table[] = {
.proc_handler   = proc_dointvec_jiffies,
},
{
+   .procname   = "inet_peer_threshold",
+   .data   = _net.ipv4.sysctl_inet_peer_threshold,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec
+   },
+   {
.procname   = "ip_early_demux",
.data   = _net.ipv4.sysctl_ip_early_demux,
.maxlen = sizeof(int),
@@ -993,7 +993,8 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
net->ipv4.sysctl_ip_early_demux = 1;
net->ipv4.sysctl_inet_peer_maxttl = 10 * 60 * HZ;  /* usual time to 
live: 10 min */
net->ipv4.sysctl_inet_peer_minttl = 120 * HZ;  /* TTL under high 
load: 120 sec */
-
+   net->ipv4.sysctl_inet_peer_threshold = 65536 + 128;/* start to throw 
entries more
+   * aggressively at 
this stage */
return 0;
 
 err_ports:
-- 
2.5.0



[PATCH 0/4] Namespacify inet_peer_* sysctl knobs

2016-02-17 Thread Nikolay Borisov
This series make the inet_peer ttl sysctls to be namespace aware. 

Patch 1 adds a namespace association to the inet_peer_base struct, 
which in turn is used to make the sysctls namespace aware. The 
rest of the patches are straightforward. 

Nikolay Borisov (4):
  inetpeer: Add net namespace assosication in inet_peer_base
  inetpeer: Namespacify inet_peer_maxttl sysctl knob
  inetpeer: Namespacify inet_peer_minttl sysctl knob
  inetpeer: Namespacify inet_peer_threshold sysctl knob

 include/net/inetpeer.h |  1 +
 include/net/ip.h   |  5 -
 include/net/netns/ipv4.h   |  4 
 net/ipv4/inetpeer.c| 15 ++-
 net/ipv4/route.c   |  1 +
 net/ipv4/sysctl_net_ipv4.c | 47 --
 6 files changed, 37 insertions(+), 36 deletions(-)

-- 
2.5.0



[PATCH 2/4] inetpeer: Namespacify inet_peer_maxttl sysctl knob

2016-02-17 Thread Nikolay Borisov
Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---
 include/net/ip.h   |  1 -
 include/net/netns/ipv4.h   |  2 ++
 net/ipv4/inetpeer.c|  2 +-
 net/ipv4/sysctl_net_ipv4.c | 15 ---
 4 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index cbb134b2f0e4..c33d53176d3c 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -243,7 +243,6 @@ static inline int inet_is_local_reserved_port(struct net 
*net, int port)
 /* From inetpeer.c */
 extern int inet_peer_threshold;
 extern int inet_peer_minttl;
-extern int inet_peer_maxttl;
 
 void ipfrag_init(void);
 
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index a69cde3ce460..b0623c4e2f0a 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -88,6 +88,8 @@ struct netns_ipv4 {
int sysctl_ip_dynaddr;
int sysctl_ip_early_demux;
 
+   int sysctl_inet_peer_maxttl;
+
int sysctl_fwmark_reflect;
int sysctl_tcp_fwmark_accept;
 #ifdef CONFIG_NET_L3_MASTER_DEV
diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index 86fa45809540..a9245ada56c2 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -82,7 +82,6 @@ EXPORT_SYMBOL_GPL(inet_peer_base_init);
 int inet_peer_threshold __read_mostly = 65536 + 128;   /* start to throw 
entries more
 * aggressively at this stage */
 int inet_peer_minttl __read_mostly = 120 * HZ; /* TTL under high load: 120 sec 
*/
-int inet_peer_maxttl __read_mostly = 10 * 60 * HZ; /* usual time to live: 
10 min */
 
 static void inetpeer_gc_worker(struct work_struct *work)
 {
@@ -369,6 +368,7 @@ static int inet_peer_gc(struct inet_peer_base *base,
struct inet_peer __rcu ***stackptr)
 {
struct inet_peer *p, *gchead = NULL;
+   int inet_peer_maxttl = base->net->ipv4.sysctl_inet_peer_maxttl;
__u32 delta, ttl;
int cnt = 0;
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 1e1fe6086dd9..2aaa049cbf9d 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -359,13 +359,6 @@ static struct ctl_table ipv4_table[] = {
.proc_handler   = proc_dointvec_jiffies,
},
{
-   .procname   = "inet_peer_maxttl",
-   .data   = _peer_maxttl,
-   .maxlen = sizeof(int),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec_jiffies,
-   },
-   {
.procname   = "tcp_fack",
.data   = _tcp_fack,
.maxlen = sizeof(int),
@@ -737,6 +730,13 @@ static struct ctl_table ipv4_net_table[] = {
.proc_handler   = proc_dointvec
},
{
+   .procname   = "inet_peer_maxttl",
+   .data   = _net.ipv4.sysctl_inet_peer_maxttl,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_jiffies,
+   },
+   {
.procname   = "ip_early_demux",
.data   = _net.ipv4.sysctl_ip_early_demux,
.maxlen = sizeof(int),
@@ -991,6 +991,7 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
net->ipv4.sysctl_ip_default_ttl = IPDEFTTL;
net->ipv4.sysctl_ip_dynaddr = 0;
net->ipv4.sysctl_ip_early_demux = 1;
+   net->ipv4.sysctl_inet_peer_maxttl = 10 * 60 * HZ;  /* usual time to 
live: 10 min */
 
return 0;
 
-- 
2.5.0



[PATCH 1/4] inetpeer: Add net namespace assosication in inet_peer_base

2016-02-17 Thread Nikolay Borisov
This is required so that the inet_peer_* sysctls can be
namespacified

Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---
 include/net/inetpeer.h | 1 +
 net/ipv4/route.c   | 1 +
 2 files changed, 2 insertions(+)

diff --git a/include/net/inetpeer.h b/include/net/inetpeer.h
index 235c7811a86a..287bb54cda58 100644
--- a/include/net/inetpeer.h
+++ b/include/net/inetpeer.h
@@ -67,6 +67,7 @@ struct inet_peer_base {
struct inet_peer __rcu  *root;
seqlock_t   lock;
int total;
+   struct net  *net;
 };
 
 void inet_peer_base_init(struct inet_peer_base *);
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 85f184e429c6..4bb45e52411c 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2775,6 +2775,7 @@ static int __net_init ipv4_inetpeer_init(struct net *net)
return -ENOMEM;
inet_peer_base_init(bp);
net->ipv4.peers = bp;
+   bp->net = net;
return 0;
 }
 
-- 
2.5.0



Re: [PATCH] net: igmp: use IS_ENABLED(CONFIG_IP_MULTICAST) instead of ifdef

2016-02-16 Thread Nikolay Borisov
c->sources; psf; psf = psf->sf_next)
> + psf->sf_crcount = 0;
> + igmp_ifc_event(in_dev);
> + }
> + } else if (IS_ENABLED(CONFIG_IP_MULTICAST) && sf_setstate(pmc)) {
>   igmp_ifc_event(in_dev);
> -#endif
>   }
>   spin_unlock_bh(>lock);
>   return err;
> @@ -2711,13 +2691,10 @@ static int igmp_mc_seq_show(struct seq_file *seq, 
> void *v)
>   char   *querier;
>   long delta;
>  
> -#ifdef CONFIG_IP_MULTICAST
> - querier = IGMP_V1_SEEN(state->in_dev) ? "V1" :
> + querier = !IS_ENABLED(CONFIG_IP_MULTICAST) ? "NONE" :
> +   IGMP_V1_SEEN(state->in_dev) ? "V1" :
> IGMP_V2_SEEN(state->in_dev) ? "V2" :
> "V3";
> -#else
> - querier = "NONE";
> -#endif
>  
>   if (rcu_access_pointer(state->in_dev->mc_list) == im) {
>   seq_printf(seq, "%d\t%-10s: %5d %7s\n",
> 

Reviewed-by: Nikolay Borisov <ker...@kyup.com>


[PATCH 3/6] ipv4: Namespacify ip_dynaddr sysctl knob

2016-02-15 Thread Nikolay Borisov
Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---
 include/net/ip.h   |  3 ---
 include/net/netns/ipv4.h   |  2 ++
 net/ipv4/af_inet.c | 10 ++
 net/ipv4/sysctl_net_ipv4.c | 15 ---
 4 files changed, 12 insertions(+), 18 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 1a98f1ca1638..e3fb25d76421 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -248,9 +248,6 @@ extern int inet_peer_maxttl;
 /* From ip_input.c */
 extern int sysctl_ip_early_demux;
 
-/* From ip_output.c */
-extern int sysctl_ip_dynaddr;
-
 void ipfrag_init(void);
 
 void ip_static_sysctl_init(void);
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index bc8f7f94abcb..b7e3fb2587da 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -84,6 +84,8 @@ struct netns_ipv4 {
int sysctl_ip_no_pmtu_disc;
int sysctl_ip_fwd_use_pmtu;
int sysctl_ip_nonlocal_bind;
+   /* Shall we try to damage output packets if routing dev changes? */
+   int sysctl_ip_dynaddr;
 
int sysctl_fwmark_reflect;
int sysctl_tcp_fwmark_accept;
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index eade66db214e..209d1ed28954 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1095,12 +1095,6 @@ void inet_unregister_protosw(struct inet_protosw *p)
 }
 EXPORT_SYMBOL(inet_unregister_protosw);
 
-/*
- *  Shall we try to damage output packets if routing dev changes?
- */
-
-int sysctl_ip_dynaddr __read_mostly;
-
 static int inet_sk_reselect_saddr(struct sock *sk)
 {
struct inet_sock *inet = inet_sk(sk);
@@ -1131,7 +1125,7 @@ static int inet_sk_reselect_saddr(struct sock *sk)
if (new_saddr == old_saddr)
return 0;
 
-   if (sysctl_ip_dynaddr > 1) {
+   if (sock_net(sk)->ipv4.sysctl_ip_dynaddr > 1) {
pr_info("%s(): shifting inet->saddr from %pI4 to %pI4\n",
__func__, _saddr, _saddr);
}
@@ -1186,7 +1180,7 @@ int inet_sk_rebuild_header(struct sock *sk)
 * Other protocols have to map its equivalent state to 
TCP_SYN_SENT.
 * DCCP maps its DCCP_REQUESTING state to TCP_SYN_SENT. -acme
 */
-   if (!sysctl_ip_dynaddr ||
+   if (!sock_net(sk)->ipv4.sysctl_ip_dynaddr ||
sk->sk_state != TCP_SYN_SENT ||
(sk->sk_userlocks & SOCK_BINDADDR_LOCK) ||
(err = inet_sk_reselect_saddr(sk)) != 0)
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index a833a9f9e4cd..04ac5b763385 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -304,13 +304,6 @@ static struct ctl_table ipv4_table[] = {
.proc_handler   = proc_dointvec
},
{
-   .procname   = "ip_dynaddr",
-   .data   = _ip_dynaddr,
-   .maxlen = sizeof(int),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec
-   },
-   {
.procname   = "tcp_fastopen",
.data   = _tcp_fastopen,
.maxlen = sizeof(int),
@@ -744,6 +737,13 @@ static struct ctl_table ipv4_net_table[] = {
.proc_handler   = proc_dointvec
},
{
+   .procname   = "ip_dynaddr",
+   .data   = _net.ipv4.sysctl_ip_dynaddr,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec
+   },
+   {
.procname   = "ip_default_ttl",
.data   = _net.ipv4.sysctl_ip_default_ttl,
.maxlen = sizeof(int),
@@ -989,6 +989,7 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
goto err_ports;
 
net->ipv4.sysctl_ip_default_ttl = IPDEFTTL;
+   net->ipv4.sysctl_ip_dynaddr = 0;
 
return 0;
 
-- 
2.5.0



[PATCH 1/6] ipv4: Namespaceify ip_default_ttl sysctl knob

2016-02-15 Thread Nikolay Borisov
Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---
 include/net/netns/ipv4.h |  1 +
 include/net/route.h  |  5 ++---
 net/bridge/netfilter/nft_reject_bridge.c |  8 +---
 net/ipv4/ip_output.c |  3 ---
 net/ipv4/ip_sockglue.c   |  5 -
 net/ipv4/netfilter/ipt_SYNPROXY.c|  3 ++-
 net/ipv4/proc.c  |  2 +-
 net/ipv4/sysctl_net_ipv4.c   | 20 +++-
 8 files changed, 26 insertions(+), 21 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 848fe8056534..bc8f7f94abcb 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -80,6 +80,7 @@ struct netns_ipv4 {
int sysctl_tcp_ecn;
int sysctl_tcp_ecn_fallback;
 
+   int sysctl_ip_default_ttl;
int sysctl_ip_no_pmtu_disc;
int sysctl_ip_fwd_use_pmtu;
int sysctl_ip_nonlocal_bind;
diff --git a/include/net/route.h b/include/net/route.h
index a3b9ef74a389..9b0a523bb428 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -329,14 +329,13 @@ static inline int inet_iif(const struct sk_buff *skb)
return skb->skb_iif;
 }
 
-extern int sysctl_ip_default_ttl;
-
 static inline int ip4_dst_hoplimit(const struct dst_entry *dst)
 {
int hoplimit = dst_metric_raw(dst, RTAX_HOPLIMIT);
+   struct net *net = dev_net(dst->dev);
 
if (hoplimit == 0)
-   hoplimit = sysctl_ip_default_ttl;
+   hoplimit = net->ipv4.sysctl_ip_default_ttl;
return hoplimit;
 }
 
diff --git a/net/bridge/netfilter/nft_reject_bridge.c 
b/net/bridge/netfilter/nft_reject_bridge.c
index fdba3d9fbff3..adc8d7221dbb 100644
--- a/net/bridge/netfilter/nft_reject_bridge.c
+++ b/net/bridge/netfilter/nft_reject_bridge.c
@@ -48,6 +48,7 @@ static void nft_reject_br_send_v4_tcp_reset(struct sk_buff 
*oldskb,
struct iphdr *niph;
const struct tcphdr *oth;
struct tcphdr _oth;
+   struct net *net = sock_net(oldskb->sk);
 
if (!nft_bridge_iphdr_validate(oldskb))
return;
@@ -63,9 +64,9 @@ static void nft_reject_br_send_v4_tcp_reset(struct sk_buff 
*oldskb,
 
skb_reserve(nskb, LL_MAX_HEADER);
niph = nf_reject_iphdr_put(nskb, oldskb, IPPROTO_TCP,
-  sysctl_ip_default_ttl);
+  net->ipv4.sysctl_ip_default_ttl);
nf_reject_ip_tcphdr_put(nskb, oldskb, oth);
-   niph->ttl   = sysctl_ip_default_ttl;
+   niph->ttl   = net->ipv4.sysctl_ip_default_ttl;
niph->tot_len   = htons(nskb->len);
ip_send_check(niph);
 
@@ -85,6 +86,7 @@ static void nft_reject_br_send_v4_unreach(struct sk_buff 
*oldskb,
void *payload;
__wsum csum;
u8 proto;
+   struct net *net = sock_net(oldskb->sk);
 
if (oldskb->csum_bad || !nft_bridge_iphdr_validate(oldskb))
return;
@@ -119,7 +121,7 @@ static void nft_reject_br_send_v4_unreach(struct sk_buff 
*oldskb,
 
skb_reserve(nskb, LL_MAX_HEADER);
niph = nf_reject_iphdr_put(nskb, oldskb, IPPROTO_ICMP,
-  sysctl_ip_default_ttl);
+  net->ipv4.sysctl_ip_default_ttl);
 
skb_reset_transport_header(nskb);
icmph = (struct icmphdr *)skb_put(nskb, sizeof(struct icmphdr));
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 64878efa045c..f734c42acdaf 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -79,9 +79,6 @@
 #include 
 #include 
 
-int sysctl_ip_default_ttl __read_mostly = IPDEFTTL;
-EXPORT_SYMBOL(sysctl_ip_default_ttl);
-
 static int
 ip_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
unsigned int mtu,
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 92808f147ef5..3f1befc4e17b 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -1341,10 +1341,13 @@ static int do_ip_getsockopt(struct sock *sk, int level, 
int optname,
val = inet->tos;
break;
case IP_TTL:
+   {
+   struct net *net = sock_net(sk);
val = (inet->uc_ttl == -1 ?
-  sysctl_ip_default_ttl :
+  net->ipv4.sysctl_ip_default_ttl :
   inet->uc_ttl);
break;
+   }
case IP_HDRINCL:
val = inet->hdrincl;
break;
diff --git a/net/ipv4/netfilter/ipt_SYNPROXY.c 
b/net/ipv4/netfilter/ipt_SYNPROXY.c
index 5fdc556514ba..7b8fbb352877 100644
--- a/net/ipv4/netfilter/ipt_SYNPROXY.c
+++ b/net/ipv4/netfilter/ipt_SYNPROXY.c
@@ -21,6 +21,7 @@ static struct iphdr *
 synproxy_build_ip(struct sk_buff *skb, __be32 saddr, __be32 daddr)
 {
struct iphdr *iph;
+   struct net *net = sock_net(skb->sk);
 
skb_reset_network_header(skb);
i

[PATCH 5/6] ipv4: namespacify ip fragment max dist sysctl knob

2016-02-15 Thread Nikolay Borisov
Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---
 include/net/inet_frag.h |  1 +
 net/ipv4/ip_fragment.c  | 25 +
 2 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 12aac0fd6ee7..909972aa3acd 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -13,6 +13,7 @@ struct netns_frags {
int timeout;
int high_thresh;
int low_thresh;
+   int max_dist;
 };
 
 /**
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 187c6fcc3027..957161413335 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -54,8 +54,6 @@
  * code now. If you change something here, _PLEASE_ update ipv6/reassembly.c
  * as well. Or notify me, at least. --ANK
  */
-
-static int sysctl_ipfrag_max_dist __read_mostly = 64;
 static const char ip_frag_cache_name[] = "ip4-frags";
 
 struct ipfrag_skb_cb
@@ -150,7 +148,7 @@ static void ip4_frag_init(struct inet_frag_queue *q, const 
void *a)
qp->daddr = arg->iph->daddr;
qp->vif = arg->vif;
qp->user = arg->user;
-   qp->peer = sysctl_ipfrag_max_dist ?
+   qp->peer = q->net->max_dist ?
inet_getpeer_v4(net->ipv4.peers, arg->iph->saddr, arg->vif, 1) :
NULL;
 }
@@ -275,7 +273,7 @@ static struct ipq *ip_find(struct net *net, struct iphdr 
*iph,
 static int ip_frag_too_far(struct ipq *qp)
 {
struct inet_peer *peer = qp->peer;
-   unsigned int max = sysctl_ipfrag_max_dist;
+   unsigned int max = qp->q.net->max_dist;
unsigned int start, end;
 
int rc;
@@ -749,6 +747,14 @@ static struct ctl_table ip4_frags_ns_ctl_table[] = {
.mode   = 0644,
.proc_handler   = proc_dointvec_jiffies,
},
+   {
+   .procname   = "ipfrag_max_dist",
+   .data   = _net.ipv4.frags.max_dist,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = 
+   },
{ }
 };
 
@@ -762,14 +768,6 @@ static struct ctl_table ip4_frags_ctl_table[] = {
.mode   = 0644,
.proc_handler   = proc_dointvec_jiffies,
},
-   {
-   .procname   = "ipfrag_max_dist",
-   .data   = _ipfrag_max_dist,
-   .maxlen = sizeof(int),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec_minmax,
-   .extra1 = 
-   },
{ }
 };
 
@@ -790,6 +788,7 @@ static int __net_init ip4_frags_ns_ctl_register(struct net 
*net)
table[1].data = >ipv4.frags.low_thresh;
table[1].extra2 = >ipv4.frags.high_thresh;
table[2].data = >ipv4.frags.timeout;
+   table[3].data = >ipv4.frags.max_dist;
 
/* Don't export sysctls to unprivileged users */
if (net->user_ns != _user_ns)
@@ -865,6 +864,8 @@ static int __net_init ipv4_frags_init_net(struct net *net)
 */
net->ipv4.frags.timeout = IP_FRAG_TIME;
 
+   net->ipv4.frags.max_dist = 64;
+
res = inet_frags_init_net(>ipv4.frags);
if (res)
return res;
-- 
2.5.0



[PATCH 2/6] igmp: net: Move igmp namespace init to correct file

2016-02-15 Thread Nikolay Borisov
When igmp related sysctl were namespacified their initializatin was
erroneously put into the tcp socket namespace constructor. This
patch moves the relevant code into the igmp namespace constructor to
keep things consistent.

Also sprinkle some #ifdefs to silence warnings

Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---
 net/ipv4/igmp.c | 14 ++
 net/ipv4/tcp_ipv4.c |  6 --
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index 7c95335bf85e..2aea9f1a2a31 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -1224,7 +1224,9 @@ static void igmp_group_dropped(struct ip_mc_list *im)
 static void igmp_group_added(struct ip_mc_list *im)
 {
struct in_device *in_dev = im->interface;
+#ifdef CONFIG_IP_MULTICAST
struct net *net = dev_net(in_dev->dev);
+#endif
 
if (im->loaded == 0) {
im->loaded = 1;
@@ -1316,7 +1318,9 @@ static void ip_mc_hash_remove(struct in_device *in_dev,
 void ip_mc_inc_group(struct in_device *in_dev, __be32 addr)
 {
struct ip_mc_list *im;
+#ifdef CONFIG_IP_MULTICAST
struct net *net = dev_net(in_dev->dev);
+#endif
 
ASSERT_RTNL();
 
@@ -1643,7 +1647,9 @@ void ip_mc_down(struct in_device *in_dev)
 
 void ip_mc_init_dev(struct in_device *in_dev)
 {
+#ifdef CONFIG_IP_MULTICAST
struct net *net = dev_net(in_dev->dev);
+#endif
ASSERT_RTNL();
 
 #ifdef CONFIG_IP_MULTICAST
@@ -1662,7 +1668,9 @@ void ip_mc_init_dev(struct in_device *in_dev)
 void ip_mc_up(struct in_device *in_dev)
 {
struct ip_mc_list *pmc;
+#ifdef CONFIG_IP_MULTICAST
struct net *net = dev_net(in_dev->dev);
+#endif
 
ASSERT_RTNL();
 
@@ -2923,6 +2931,12 @@ static int __net_init igmp_net_init(struct net *net)
goto out_sock;
}
 
+   /* Sysctl initialization */
+   net->ipv4.sysctl_igmp_max_memberships = 20;
+   net->ipv4.sysctl_igmp_max_msf = 10;
+   /* IGMP reports for link-local multicast groups are enabled by default 
*/
+   net->ipv4.sysctl_igmp_llm_reports = 1;
+   net->ipv4.sysctl_igmp_qrv = 2;
return 0;
 
 out_sock:
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index ba5d0146e3f0..3f872a6bc274 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2399,12 +2399,6 @@ static int __net_init tcp_sk_init(struct net *net)
net->ipv4.sysctl_tcp_fin_timeout = TCP_FIN_TIMEOUT;
net->ipv4.sysctl_tcp_notsent_lowat = UINT_MAX;
 
-   net->ipv4.sysctl_igmp_max_memberships = 20;
-   net->ipv4.sysctl_igmp_max_msf = 10;
-   /* IGMP reports for link-local multicast groups are enabled by default 
*/
-   net->ipv4.sysctl_igmp_llm_reports = 1;
-   net->ipv4.sysctl_igmp_qrv = 2;
-
return 0;
 fail:
tcp_sk_exit(net);
-- 
2.5.0



[PATCH 4/6] ipv4: namespacify ip_early_demux sysctl knob

2016-02-15 Thread Nikolay Borisov
Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---
 include/net/ip.h   |  3 ---
 include/net/netns/ipv4.h   |  1 +
 net/ipv4/ip_input.c|  5 +
 net/ipv4/sysctl_net_ipv4.c | 15 ---
 net/ipv6/ip6_input.c   |  2 +-
 5 files changed, 11 insertions(+), 15 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index e3fb25d76421..cbb134b2f0e4 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -245,9 +245,6 @@ extern int inet_peer_threshold;
 extern int inet_peer_minttl;
 extern int inet_peer_maxttl;
 
-/* From ip_input.c */
-extern int sysctl_ip_early_demux;
-
 void ipfrag_init(void);
 
 void ip_static_sysctl_init(void);
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index b7e3fb2587da..a69cde3ce460 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -86,6 +86,7 @@ struct netns_ipv4 {
int sysctl_ip_nonlocal_bind;
/* Shall we try to damage output packets if routing dev changes? */
int sysctl_ip_dynaddr;
+   int sysctl_ip_early_demux;
 
int sysctl_fwmark_reflect;
int sysctl_tcp_fwmark_accept;
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 852002f64c68..e3d782746d9d 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -308,15 +308,12 @@ drop:
return true;
 }
 
-int sysctl_ip_early_demux __read_mostly = 1;
-EXPORT_SYMBOL(sysctl_ip_early_demux);
-
 static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
const struct iphdr *iph = ip_hdr(skb);
struct rtable *rt;
 
-   if (sysctl_ip_early_demux &&
+   if (net->ipv4.sysctl_ip_early_demux &&
!skb_dst(skb) &&
!skb->sk &&
!ip_is_fragment(iph)) {
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 04ac5b763385..1e1fe6086dd9 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -297,13 +297,6 @@ static struct ctl_table ipv4_table[] = {
.proc_handler   = proc_dointvec
},
{
-   .procname   = "ip_early_demux",
-   .data   = _ip_early_demux,
-   .maxlen = sizeof(int),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec
-   },
-   {
.procname   = "tcp_fastopen",
.data   = _tcp_fastopen,
.maxlen = sizeof(int),
@@ -744,6 +737,13 @@ static struct ctl_table ipv4_net_table[] = {
.proc_handler   = proc_dointvec
},
{
+   .procname   = "ip_early_demux",
+   .data   = _net.ipv4.sysctl_ip_early_demux,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec
+   },
+   {
.procname   = "ip_default_ttl",
.data   = _net.ipv4.sysctl_ip_default_ttl,
.maxlen = sizeof(int),
@@ -990,6 +990,7 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
 
net->ipv4.sysctl_ip_default_ttl = IPDEFTTL;
net->ipv4.sysctl_ip_dynaddr = 0;
+   net->ipv4.sysctl_ip_early_demux = 1;
 
return 0;
 
diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index 31ac3c56da4b..c05c425c2389 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -49,7 +49,7 @@
 
 int ip6_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
-   if (sysctl_ip_early_demux && !skb_dst(skb) && skb->sk == NULL) {
+   if (net->ipv4.sysctl_ip_early_demux && !skb_dst(skb) && skb->sk == 
NULL) {
const struct inet6_protocol *ipprot;
 
ipprot = rcu_dereference(inet6_protos[ipv6_hdr(skb)->nexthdr]);
-- 
2.5.0



[PATCH 6/6] net: Export ip fragment sysctl to unprivileged users

2016-02-15 Thread Nikolay Borisov
Now that all the ip fragmentation related sysctls are namespaceified
there is no reason to hide them anymore from "root" users inside
containers.

Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---
 net/ipv4/ip_fragment.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 957161413335..efbd47d1a531 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -789,10 +789,6 @@ static int __net_init ip4_frags_ns_ctl_register(struct net 
*net)
table[1].extra2 = >ipv4.frags.high_thresh;
table[2].data = >ipv4.frags.timeout;
table[3].data = >ipv4.frags.max_dist;
-
-   /* Don't export sysctls to unprivileged users */
-   if (net->user_ns != _user_ns)
-   table[0].procname = NULL;
}
 
hdr = register_net_sysctl(net, "net/ipv4", table);
-- 
2.5.0



[PATCH 0/6] Namespacify various ip sysctl knobs

2016-02-15 Thread Nikolay Borisov
[Resending since I forgot to cc linux-netdev]

Hello, 

This series continues namespacifying more net related knobs.
The focus here is on ip options. Patches 1,3,4,5 namespacify
the respective sysctl knobs. Patch 2 moves some igmp code to the 
correct file (and function) and also adds some #ifdef guards to 
silence compilation warnings. 

Finally, patch 5 exposes the ip fragmentation related sysctls 
since all of the knobs are namespaced.

Nikolay Borisov (6):
  ipv4: Namespaceify ip_default_ttl sysctl knob
  igmp: net: Move igmp namespace init to correct file
  ipv4: Namespacify ip_dynaddr sysctl knob
  ipv4: namespacify ip_early_demux sysctl knob
  ipv4: namespacify ip fragment max dist sysctl knob
  net: Export ip fragment sysctl to unprivileged users

 include/net/inet_frag.h  |  1 +
 include/net/ip.h |  6 
 include/net/netns/ipv4.h |  4 +++
 include/net/route.h  |  5 ++--
 net/bridge/netfilter/nft_reject_bridge.c |  8 +++--
 net/ipv4/af_inet.c   | 10 ++-
 net/ipv4/igmp.c  | 14 +
 net/ipv4/ip_fragment.c   | 29 +-
 net/ipv4/ip_input.c  |  5 +---
 net/ipv4/ip_output.c |  3 --
 net/ipv4/ip_sockglue.c   |  5 +++-
 net/ipv4/netfilter/ipt_SYNPROXY.c|  3 +-
 net/ipv4/proc.c  |  2 +-
 net/ipv4/sysctl_net_ipv4.c   | 50 +---
 net/ipv4/tcp_ipv4.c  |  6 
 net/ipv6/ip6_input.c |  2 +-
 16 files changed, 77 insertions(+), 76 deletions(-)

-- 
2.5.0



Re: linux-next: build warning after merge of the net-next tree

2016-02-15 Thread Nikolay Borisov


On 02/15/2016 04:09 AM, Stephen Rothwell wrote:
> Hi all,
> 
> After merging the net-next tree, today's linux-next build (arm
> multi_v7_defconfig) produced this warning:
> 
> net/ipv4/igmp.c: In function 'igmp_group_added':
> net/ipv4/igmp.c:1227:14: warning: unused variable 'net' [-Wunused-variable]
>   struct net *net = dev_net(in_dev->dev);
>   ^
> net/ipv4/igmp.c: In function 'ip_mc_inc_group':
> net/ipv4/igmp.c:1319:14: warning: unused variable 'net' [-Wunused-variable]
>   struct net *net = dev_net(in_dev->dev);
>   ^
> net/ipv4/igmp.c: In function 'ip_mc_init_dev':
> net/ipv4/igmp.c:1646:14: warning: unused variable 'net' [-Wunused-variable]
>   struct net *net = dev_net(in_dev->dev);
>   ^
> net/ipv4/igmp.c: In function 'ip_mc_up':
> net/ipv4/igmp.c:1665:14: warning: unused variable 'net' [-Wunused-variable]
>   struct net *net = dev_net(in_dev->dev);
>   ^
> 
> Introduced by commits
> 
>   87a8a2ae65b7 ("igmp: Namespaceify igmp_llm_reports sysctl knob")
>   165094afcee7 ("igmp: Namespacify igmp_qrv sysctl knob")
> 
> CONFIG_IP_MULTICAST is not set for this build.

Right, I have forgotten to add the ifdef guards for the respective
variables, will squeeze a patch in the next series on namespaceifying
various sysctls.

Thanks for testing!



[PATCH 2/4] igmp: Namespaceify igmp_max_msf sysctl knob

2016-02-08 Thread Nikolay Borisov
Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---
 include/linux/igmp.h   |  1 -
 include/net/netns/ipv4.h   |  1 +
 net/ipv4/igmp.c|  5 +
 net/ipv4/ip_sockglue.c |  5 +++--
 net/ipv4/sysctl_net_ipv4.c | 14 +++---
 net/ipv4/tcp_ipv4.c|  1 +
 6 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/include/linux/igmp.h b/include/linux/igmp.h
index 57d6d06ce0b3..a91ec9f575e7 100644
--- a/include/linux/igmp.h
+++ b/include/linux/igmp.h
@@ -38,7 +38,6 @@ static inline struct igmpv3_query *
 }
 
 extern int sysctl_igmp_llm_reports;
-extern int sysctl_igmp_max_msf;
 extern int sysctl_igmp_qrv;
 
 struct ip_sf_socklist {
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 759cf624eec2..522a2cfe1ad9 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -109,6 +109,7 @@ struct netns_ipv4 {
unsigned int sysctl_tcp_notsent_lowat;
 
int sysctl_igmp_max_memberships;
+   int sysctl_igmp_max_msf;
 
struct ping_group_range ping_group_range;
 
diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index 5b86257c9d6b..6da2e467b63c 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -107,8 +107,6 @@
 #include 
 #endif
 
-#define IP_MAX_MSF 10
-
 /* IGMP reports for link-local multicast groups are enabled by default */
 int sysctl_igmp_llm_reports __read_mostly = 1;
 
@@ -1726,7 +1724,6 @@ static struct in_device *ip_mc_find_dev(struct net *net, 
struct ip_mreqn *imr)
 /*
  * Join a socket to a group
  */
-int sysctl_igmp_max_msf __read_mostly = IP_MAX_MSF;
 #ifdef CONFIG_IP_MULTICAST
 int sysctl_igmp_qrv __read_mostly = IGMP_QUERY_ROBUSTNESS_VARIABLE;
 #endif
@@ -2244,7 +2241,7 @@ int ip_mc_source(int add, int omode, struct sock *sk, 
struct
}
/* else, add a new source to the filter */
 
-   if (psl && psl->sl_count >= sysctl_igmp_max_msf) {
+   if (psl && psl->sl_count >= net->ipv4.sysctl_igmp_max_msf) {
err = -ENOBUFS;
goto done;
}
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 5f73a7c03e27..92808f147ef5 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -571,6 +571,7 @@ static int do_ip_setsockopt(struct sock *sk, int level,
int optname, char __user *optval, unsigned int 
optlen)
 {
struct inet_sock *inet = inet_sk(sk);
+   struct net *net = sock_net(sk);
int val = 0, err;
bool needs_rtnl = setsockopt_needs_rtnl(optname);
 
@@ -910,7 +911,7 @@ static int do_ip_setsockopt(struct sock *sk, int level,
}
/* numsrc >= (1G-4) overflow in 32 bits */
if (msf->imsf_numsrc >= 0x3ffcU ||
-   msf->imsf_numsrc > sysctl_igmp_max_msf) {
+   msf->imsf_numsrc > net->ipv4.sysctl_igmp_max_msf) {
kfree(msf);
err = -ENOBUFS;
break;
@@ -1065,7 +1066,7 @@ static int do_ip_setsockopt(struct sock *sk, int level,
 
/* numsrc >= (4G-140)/128 overflow in 32 bits */
if (gsf->gf_numsrc >= 0x1ff ||
-   gsf->gf_numsrc > sysctl_igmp_max_msf) {
+   gsf->gf_numsrc > net->ipv4.sysctl_igmp_max_msf) {
err = -ENOBUFS;
goto mc_msf_out;
}
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 6ea3dbb96db4..225659a02cf2 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -367,13 +367,6 @@ static struct ctl_table ipv4_table[] = {
.mode   = 0644,
.proc_handler   = proc_dointvec
},
-   {
-   .procname   = "igmp_max_msf",
-   .data   = _igmp_max_msf,
-   .maxlen = sizeof(int),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec
-   },
 #ifdef CONFIG_IP_MULTICAST
{
.procname   = "igmp_qrv",
@@ -872,6 +865,13 @@ static struct ctl_table ipv4_net_table[] = {
.proc_handler   = proc_dointvec
},
{
+   .procname   = "igmp_max_msf",
+   .data   = _net.ipv4.sysctl_igmp_max_msf,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec
+   },
+   {
.procname   = "tcp_keepalive_time",
.data   = _net.ipv4.sysctl_tcp_keepalive_time,
.maxlen = sizeof(int),
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 3e984d44295d..d0ac43b95378 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2398,6 +2398,7 @@ static int __net_init tcp_sk_init

[PATCH 3/4] igmp: Namespaceify igmp_llm_reports sysctl knob

2016-02-08 Thread Nikolay Borisov
This was initially introduced in commit df2cf4a78e48 ("IGMP: Inhibit
reports for local multicast groups") by defining the sysctl in the
ipv4_net_table array, however it was never implemented to be
namespace aware. Fix this by changing the code accordingly.

Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---
 include/linux/igmp.h   |  1 -
 include/net/netns/ipv4.h   |  1 +
 net/ipv4/igmp.c| 26 +++---
 net/ipv4/sysctl_net_ipv4.c |  2 +-
 net/ipv4/tcp_ipv4.c|  2 ++
 5 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/include/linux/igmp.h b/include/linux/igmp.h
index a91ec9f575e7..c683f4bf642b 100644
--- a/include/linux/igmp.h
+++ b/include/linux/igmp.h
@@ -37,7 +37,6 @@ static inline struct igmpv3_query *
return (struct igmpv3_query *)skb_transport_header(skb);
 }
 
-extern int sysctl_igmp_llm_reports;
 extern int sysctl_igmp_qrv;
 
 struct ip_sf_socklist {
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 522a2cfe1ad9..cbbf8115e8a7 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -110,6 +110,7 @@ struct netns_ipv4 {
 
int sysctl_igmp_max_memberships;
int sysctl_igmp_max_msf;
+   int sysctl_igmp_llm_reports;
 
struct ping_group_range ping_group_range;
 
diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index 6da2e467b63c..53ce9dd08b7a 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -107,9 +107,6 @@
 #include 
 #endif
 
-/* IGMP reports for link-local multicast groups are enabled by default */
-int sysctl_igmp_llm_reports __read_mostly = 1;
-
 #ifdef CONFIG_IP_MULTICAST
 /* Parameter names and values are taken from igmp-v2-06 draft */
 
@@ -430,6 +427,7 @@ static struct sk_buff *add_grec(struct sk_buff *skb, struct 
ip_mc_list *pmc,
int type, int gdeleted, int sdeleted)
 {
struct net_device *dev = pmc->interface->dev;
+   struct net *net = dev_net(dev);
struct igmpv3_report *pih;
struct igmpv3_grec *pgr = NULL;
struct ip_sf_list *psf, *psf_next, *psf_prev, **psf_list;
@@ -437,7 +435,7 @@ static struct sk_buff *add_grec(struct sk_buff *skb, struct 
ip_mc_list *pmc,
 
if (pmc->multiaddr == IGMP_ALL_HOSTS)
return skb;
-   if (ipv4_is_local_multicast(pmc->multiaddr) && !sysctl_igmp_llm_reports)
+   if (ipv4_is_local_multicast(pmc->multiaddr) && 
!net->ipv4.sysctl_igmp_llm_reports)
return skb;
 
isquery = type == IGMPV3_MODE_IS_INCLUDE ||
@@ -540,6 +538,7 @@ empty_source:
 static int igmpv3_send_report(struct in_device *in_dev, struct ip_mc_list *pmc)
 {
struct sk_buff *skb = NULL;
+   struct net *net = dev_net(in_dev->dev);
int type;
 
if (!pmc) {
@@ -548,7 +547,7 @@ static int igmpv3_send_report(struct in_device *in_dev, 
struct ip_mc_list *pmc)
if (pmc->multiaddr == IGMP_ALL_HOSTS)
continue;
if (ipv4_is_local_multicast(pmc->multiaddr) &&
-!sysctl_igmp_llm_reports)
+!net->ipv4.sysctl_igmp_llm_reports)
continue;
spin_lock_bh(>lock);
if (pmc->sfcount[MCAST_EXCLUDE])
@@ -684,7 +683,7 @@ static int igmp_send_report(struct in_device *in_dev, 
struct ip_mc_list *pmc,
if (type == IGMPV3_HOST_MEMBERSHIP_REPORT)
return igmpv3_send_report(in_dev, pmc);
 
-   if (ipv4_is_local_multicast(group) && !sysctl_igmp_llm_reports)
+   if (ipv4_is_local_multicast(group) && 
!net->ipv4.sysctl_igmp_llm_reports)
return 0;
 
if (type == IGMP_HOST_LEAVE_MESSAGE)
@@ -855,12 +854,13 @@ static int igmp_marksources(struct ip_mc_list *pmc, int 
nsrcs, __be32 *srcs)
 static bool igmp_heard_report(struct in_device *in_dev, __be32 group)
 {
struct ip_mc_list *im;
+   struct net *net = dev_net(in_dev->dev);
 
/* Timers are only set for non-local groups */
 
if (group == IGMP_ALL_HOSTS)
return false;
-   if (ipv4_is_local_multicast(group) && !sysctl_igmp_llm_reports)
+   if (ipv4_is_local_multicast(group) && 
!net->ipv4.sysctl_igmp_llm_reports)
return false;
 
rcu_read_lock();
@@ -884,6 +884,7 @@ static bool igmp_heard_query(struct in_device *in_dev, 
struct sk_buff *skb,
__be32  group = ih->group;
int max_delay;
int mark = 0;
+   struct net  *net = dev_net(in_dev->dev);
 
 
if (len == 8) {
@@ -969,7 +970,7 @@ static bool igmp_heard_query(struct in_device *in_dev, 
struct sk_buff *skb,
if (im->multiaddr == IGMP_ALL_HOSTS)
continue;
if (ipv4_is_local_mu

[PATCH 4/4] igmp: Namespacify igmp_qrv sysctl knob

2016-02-08 Thread Nikolay Borisov
Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---
 include/linux/igmp.h   |  2 --
 include/net/netns/ipv4.h   |  1 +
 net/ipv4/igmp.c| 29 +
 net/ipv4/sysctl_net_ipv4.c | 20 ++--
 net/ipv4/tcp_ipv4.c|  1 +
 5 files changed, 29 insertions(+), 24 deletions(-)

diff --git a/include/linux/igmp.h b/include/linux/igmp.h
index c683f4bf642b..12f6fba6d21a 100644
--- a/include/linux/igmp.h
+++ b/include/linux/igmp.h
@@ -37,8 +37,6 @@ static inline struct igmpv3_query *
return (struct igmpv3_query *)skb_transport_header(skb);
 }
 
-extern int sysctl_igmp_qrv;
-
 struct ip_sf_socklist {
unsigned intsl_max;
unsigned intsl_count;
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index cbbf8115e8a7..848fe8056534 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -111,6 +111,7 @@ struct netns_ipv4 {
int sysctl_igmp_max_memberships;
int sysctl_igmp_max_msf;
int sysctl_igmp_llm_reports;
+   int sysctl_igmp_qrv;
 
struct ping_group_range ping_group_range;
 
diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index 53ce9dd08b7a..52d4a6d5cf81 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -762,9 +762,10 @@ static void igmp_ifc_timer_expire(unsigned long data)
 
 static void igmp_ifc_event(struct in_device *in_dev)
 {
+   struct net *net = dev_net(in_dev->dev);
if (IGMP_V1_SEEN(in_dev) || IGMP_V2_SEEN(in_dev))
return;
-   in_dev->mr_ifc_count = in_dev->mr_qrv ?: sysctl_igmp_qrv;
+   in_dev->mr_ifc_count = in_dev->mr_qrv ?: net->ipv4.sysctl_igmp_qrv;
igmp_ifc_start_timer(in_dev, 1);
 }
 
@@ -1086,6 +1087,7 @@ static void ip_mc_filter_del(struct in_device *in_dev, 
__be32 addr)
 static void igmpv3_add_delrec(struct in_device *in_dev, struct ip_mc_list *im)
 {
struct ip_mc_list *pmc;
+   struct net *net = dev_net(in_dev->dev);
 
/* this is an "ip_mc_list" for convenience; only the fields below
 * are actually used. In particular, the refcnt and users are not
@@ -1100,7 +1102,7 @@ static void igmpv3_add_delrec(struct in_device *in_dev, 
struct ip_mc_list *im)
pmc->interface = im->interface;
in_dev_hold(in_dev);
pmc->multiaddr = im->multiaddr;
-   pmc->crcount = in_dev->mr_qrv ?: sysctl_igmp_qrv;
+   pmc->crcount = in_dev->mr_qrv ?: net->ipv4.sysctl_igmp_qrv;
pmc->sfmode = im->sfmode;
if (pmc->sfmode == MCAST_INCLUDE) {
struct ip_sf_list *psf;
@@ -1245,7 +1247,7 @@ static void igmp_group_added(struct ip_mc_list *im)
}
/* else, v3 */
 
-   im->crcount = in_dev->mr_qrv ?: sysctl_igmp_qrv;
+   im->crcount = in_dev->mr_qrv ?: net->ipv4.sysctl_igmp_qrv;
igmp_ifc_event(in_dev);
 #endif
 }
@@ -1314,6 +1316,7 @@ static void ip_mc_hash_remove(struct in_device *in_dev,
 void ip_mc_inc_group(struct in_device *in_dev, __be32 addr)
 {
struct ip_mc_list *im;
+   struct net *net = dev_net(in_dev->dev);
 
ASSERT_RTNL();
 
@@ -1340,7 +1343,7 @@ void ip_mc_inc_group(struct in_device *in_dev, __be32 
addr)
spin_lock_init(>lock);
 #ifdef CONFIG_IP_MULTICAST
setup_timer(>timer, igmp_timer_expire, (unsigned long)im);
-   im->unsolicit_count = sysctl_igmp_qrv;
+   im->unsolicit_count = net->ipv4.sysctl_igmp_qrv;
 #endif
 
im->next_rcu = in_dev->mc_list;
@@ -1640,6 +1643,7 @@ void ip_mc_down(struct in_device *in_dev)
 
 void ip_mc_init_dev(struct in_device *in_dev)
 {
+   struct net *net = dev_net(in_dev->dev);
ASSERT_RTNL();
 
 #ifdef CONFIG_IP_MULTICAST
@@ -1647,7 +1651,7 @@ void ip_mc_init_dev(struct in_device *in_dev)
(unsigned long)in_dev);
setup_timer(_dev->mr_ifc_timer, igmp_ifc_timer_expire,
(unsigned long)in_dev);
-   in_dev->mr_qrv = sysctl_igmp_qrv;
+   in_dev->mr_qrv = net->ipv4.sysctl_igmp_qrv;
 #endif
 
spin_lock_init(_dev->mc_tomb_lock);
@@ -1658,11 +1662,12 @@ void ip_mc_init_dev(struct in_device *in_dev)
 void ip_mc_up(struct in_device *in_dev)
 {
struct ip_mc_list *pmc;
+   struct net *net = dev_net(in_dev->dev);
 
ASSERT_RTNL();
 
 #ifdef CONFIG_IP_MULTICAST
-   in_dev->mr_qrv = sysctl_igmp_qrv;
+   in_dev->mr_qrv = net->ipv4.sysctl_igmp_qrv;
 #endif
ip_mc_inc_group(in_dev, IGMP_ALL_HOSTS);
 
@@ -1728,9 +1733,6 @@ static struct in_device *ip_mc_find_dev(struct net *net, 
struct ip_mreqn *imr)
 /*
  * Join a socket to a group
  */
-#ifdef CONFIG_IP_MULTICAST
-int sysctl_igmp_qrv __read_mostly = IGMP_QUERY_ROBUSTNESS_VARIABLE;
-#endif
 
 static int ip_mc_del1_src(struct ip_mc_list *pmc, int sfmode,
__be32 *psfsrc)
@@ -1755,6 +1757,7 @@ static int ip_mc_del1_src(

[PATCH 1/4] igmp: Namespaceify igmp_max_memberships sysctl knob

2016-02-08 Thread Nikolay Borisov
Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---
 include/linux/igmp.h   |  1 -
 include/net/netns/ipv4.h   |  2 ++
 net/ipv4/igmp.c|  4 +---
 net/ipv4/sysctl_net_ipv4.c | 14 +++---
 net/ipv4/tcp_ipv4.c|  2 ++
 5 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/include/linux/igmp.h b/include/linux/igmp.h
index 9c9de11549a7..57d6d06ce0b3 100644
--- a/include/linux/igmp.h
+++ b/include/linux/igmp.h
@@ -38,7 +38,6 @@ static inline struct igmpv3_query *
 }
 
 extern int sysctl_igmp_llm_reports;
-extern int sysctl_igmp_max_memberships;
 extern int sysctl_igmp_max_msf;
 extern int sysctl_igmp_qrv;
 
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 4d6ec3f6fafe..759cf624eec2 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -108,6 +108,8 @@ struct netns_ipv4 {
int sysctl_tcp_fin_timeout;
unsigned int sysctl_tcp_notsent_lowat;
 
+   int sysctl_igmp_max_memberships;
+
struct ping_group_range ping_group_range;
 
atomic_t dev_addr_genid;
diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index 05e4cba14162..5b86257c9d6b 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -107,7 +107,6 @@
 #include 
 #endif
 
-#define IP_MAX_MEMBERSHIPS 20
 #define IP_MAX_MSF 10
 
 /* IGMP reports for link-local multicast groups are enabled by default */
@@ -1727,7 +1726,6 @@ static struct in_device *ip_mc_find_dev(struct net *net, 
struct ip_mreqn *imr)
 /*
  * Join a socket to a group
  */
-int sysctl_igmp_max_memberships __read_mostly = IP_MAX_MEMBERSHIPS;
 int sysctl_igmp_max_msf __read_mostly = IP_MAX_MSF;
 #ifdef CONFIG_IP_MULTICAST
 int sysctl_igmp_qrv __read_mostly = IGMP_QUERY_ROBUSTNESS_VARIABLE;
@@ -2074,7 +2072,7 @@ int ip_mc_join_group(struct sock *sk, struct ip_mreqn 
*imr)
count++;
}
err = -ENOBUFS;
-   if (count >= sysctl_igmp_max_memberships)
+   if (count >= net->ipv4.sysctl_igmp_max_memberships)
goto done;
iml = sock_kmalloc(sk, sizeof(*iml), GFP_KERNEL);
if (!iml)
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 44bb59824267..6ea3dbb96db4 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -368,13 +368,6 @@ static struct ctl_table ipv4_table[] = {
.proc_handler   = proc_dointvec
},
{
-   .procname   = "igmp_max_memberships",
-   .data   = _igmp_max_memberships,
-   .maxlen = sizeof(int),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec
-   },
-   {
.procname   = "igmp_max_msf",
.data   = _igmp_max_msf,
.maxlen = sizeof(int),
@@ -872,6 +865,13 @@ static struct ctl_table ipv4_net_table[] = {
.proc_handler   = proc_dointvec
},
{
+   .procname   = "igmp_max_memberships",
+   .data   = _net.ipv4.sysctl_igmp_max_memberships,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec
+   },
+   {
.procname   = "tcp_keepalive_time",
.data   = _net.ipv4.sysctl_tcp_keepalive_time,
.maxlen = sizeof(int),
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 0d381fa164f8..3e984d44295d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2397,6 +2397,8 @@ static int __net_init tcp_sk_init(struct net *net)
net->ipv4.sysctl_tcp_fin_timeout = TCP_FIN_TIMEOUT;
net->ipv4.sysctl_tcp_notsent_lowat = UINT_MAX;
 
+   net->ipv4.sysctl_igmp_max_memberships = 20;
+
return 0;
 fail:
tcp_sk_exit(net);
-- 
2.5.0



[PATCH 0/4] Make igmp sysctl knobs namespace aware

2016-02-08 Thread Nikolay Borisov
This series continue making more of the net related sysctls
namespace aware. The first 2 and last patches are straight 
forward and convert sysctls which weren't defined to be 
namespace aware. The only thing in them is that each removes 
a define which is used in only one place (to initialise 
the respective sysctl) so I don't think this is a huge loss. 

The third patch however, converts igmp_llm_reports which was
already defined in the ipv4_net_table but wasn't using any of 
the net namespace infrastructure. 

Nikolay Borisov (4):
  igmp: Namespaceify igmp_max_memberships sysctl knob
  igmp: Namespaceify igmp_max_msf sysctl knob
  igmp: Namespaceify igmp_llm_reports sysctl knob
  igmp: Namespacify igmp_qrv sysctl knob

 include/linux/igmp.h   |  5 
 include/net/netns/ipv4.h   |  5 
 net/ipv4/igmp.c| 64 --
 net/ipv4/ip_sockglue.c |  5 ++--
 net/ipv4/sysctl_net_ipv4.c | 50 ++--
 net/ipv4/tcp_ipv4.c|  6 +
 6 files changed, 73 insertions(+), 62 deletions(-)

-- 
2.5.0



[PATCH v2 3/4] igmp: Namespaceify igmp_llm_reports sysctl knob

2016-02-08 Thread Nikolay Borisov
From: Nikolay Borisov <n.bori...@siteground.com>

This was initially introduced in df2cf4a78e488d26 ("IGMP: Inhibit
reports for local multicast groups") by defining the sysctl in the
ipv4_net_table array, however it was never implemented to be
namespace aware. Fix this by changing the code accordingly.
---

v2:
 Move definition of a local struct net var 
 inside ifdef to silence build warning

 include/linux/igmp.h   |  1 -
 include/net/netns/ipv4.h   |  1 +
 net/ipv4/igmp.c| 26 +++---
 net/ipv4/sysctl_net_ipv4.c |  2 +-
 net/ipv4/tcp_ipv4.c|  2 ++
 5 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/include/linux/igmp.h b/include/linux/igmp.h
index a91ec9f575e7..c683f4bf642b 100644
--- a/include/linux/igmp.h
+++ b/include/linux/igmp.h
@@ -37,7 +37,6 @@ static inline struct igmpv3_query *
return (struct igmpv3_query *)skb_transport_header(skb);
 }
 
-extern int sysctl_igmp_llm_reports;
 extern int sysctl_igmp_qrv;
 
 struct ip_sf_socklist {
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 522a2cfe1ad9..cbbf8115e8a7 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -110,6 +110,7 @@ struct netns_ipv4 {
 
int sysctl_igmp_max_memberships;
int sysctl_igmp_max_msf;
+   int sysctl_igmp_llm_reports;
 
struct ping_group_range ping_group_range;
 
diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index 6da2e467b63c..2e22ee0efc98 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -107,9 +107,6 @@
 #include 
 #endif
 
-/* IGMP reports for link-local multicast groups are enabled by default */
-int sysctl_igmp_llm_reports __read_mostly = 1;
-
 #ifdef CONFIG_IP_MULTICAST
 /* Parameter names and values are taken from igmp-v2-06 draft */
 
@@ -430,6 +427,7 @@ static struct sk_buff *add_grec(struct sk_buff *skb, struct 
ip_mc_list *pmc,
int type, int gdeleted, int sdeleted)
 {
struct net_device *dev = pmc->interface->dev;
+   struct net *net = dev_net(dev);
struct igmpv3_report *pih;
struct igmpv3_grec *pgr = NULL;
struct ip_sf_list *psf, *psf_next, *psf_prev, **psf_list;
@@ -437,7 +435,7 @@ static struct sk_buff *add_grec(struct sk_buff *skb, struct 
ip_mc_list *pmc,
 
if (pmc->multiaddr == IGMP_ALL_HOSTS)
return skb;
-   if (ipv4_is_local_multicast(pmc->multiaddr) && !sysctl_igmp_llm_reports)
+   if (ipv4_is_local_multicast(pmc->multiaddr) && 
!net->ipv4.sysctl_igmp_llm_reports)
return skb;
 
isquery = type == IGMPV3_MODE_IS_INCLUDE ||
@@ -540,6 +538,7 @@ empty_source:
 static int igmpv3_send_report(struct in_device *in_dev, struct ip_mc_list *pmc)
 {
struct sk_buff *skb = NULL;
+   struct net *net = dev_net(in_dev->dev);
int type;
 
if (!pmc) {
@@ -548,7 +547,7 @@ static int igmpv3_send_report(struct in_device *in_dev, 
struct ip_mc_list *pmc)
if (pmc->multiaddr == IGMP_ALL_HOSTS)
continue;
if (ipv4_is_local_multicast(pmc->multiaddr) &&
-!sysctl_igmp_llm_reports)
+!net->ipv4.sysctl_igmp_llm_reports)
continue;
spin_lock_bh(>lock);
if (pmc->sfcount[MCAST_EXCLUDE])
@@ -684,7 +683,7 @@ static int igmp_send_report(struct in_device *in_dev, 
struct ip_mc_list *pmc,
if (type == IGMPV3_HOST_MEMBERSHIP_REPORT)
return igmpv3_send_report(in_dev, pmc);
 
-   if (ipv4_is_local_multicast(group) && !sysctl_igmp_llm_reports)
+   if (ipv4_is_local_multicast(group) && 
!net->ipv4.sysctl_igmp_llm_reports)
return 0;
 
if (type == IGMP_HOST_LEAVE_MESSAGE)
@@ -855,12 +854,13 @@ static int igmp_marksources(struct ip_mc_list *pmc, int 
nsrcs, __be32 *srcs)
 static bool igmp_heard_report(struct in_device *in_dev, __be32 group)
 {
struct ip_mc_list *im;
+   struct net *net = dev_net(in_dev->dev);
 
/* Timers are only set for non-local groups */
 
if (group == IGMP_ALL_HOSTS)
return false;
-   if (ipv4_is_local_multicast(group) && !sysctl_igmp_llm_reports)
+   if (ipv4_is_local_multicast(group) && 
!net->ipv4.sysctl_igmp_llm_reports)
return false;
 
rcu_read_lock();
@@ -884,6 +884,7 @@ static bool igmp_heard_query(struct in_device *in_dev, 
struct sk_buff *skb,
__be32  group = ih->group;
int max_delay;
int mark = 0;
+   struct net  *net = dev_net(in_dev->dev);
 
 
if (len == 8) {
@@ -969,7 +970,7 @@ static bool igmp_heard_query(struct in_device *in_dev, 
struct sk_buff *skb,
if (i

[RESEND PATCH 1/9] ipv4: Namespaceify tcp syn retries sysctl knob

2016-02-02 Thread Nikolay Borisov
Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---
 include/net/netns/ipv4.h   |  2 ++
 include/net/tcp.h  |  1 -
 net/ipv4/sysctl_net_ipv4.c | 18 +-
 net/ipv4/tcp.c |  3 ++-
 net/ipv4/tcp_ipv4.c|  2 ++
 net/ipv4/tcp_timer.c   |  4 ++--
 6 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index ffa2777b6475..59c6155e4896 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -95,6 +95,8 @@ struct netns_ipv4 {
int sysctl_tcp_keepalive_probes;
int sysctl_tcp_keepalive_intvl;
 
+   int sysctl_tcp_syn_retries;
+
struct ping_group_range ping_group_range;
 
atomic_t dev_addr_genid;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 3ed10fc89c7d..a7f6f25297d7 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -240,7 +240,6 @@ extern int sysctl_tcp_timestamps;
 extern int sysctl_tcp_window_scaling;
 extern int sysctl_tcp_sack;
 extern int sysctl_tcp_fin_timeout;
-extern int sysctl_tcp_syn_retries;
 extern int sysctl_tcp_synack_retries;
 extern int sysctl_tcp_retries1;
 extern int sysctl_tcp_retries2;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index fccf8e92bf81..db95287d2b94 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -293,15 +293,6 @@ static struct ctl_table ipv4_table[] = {
.extra2 = _ttl_max,
},
{
-   .procname   = "tcp_syn_retries",
-   .data   = _tcp_syn_retries,
-   .maxlen = sizeof(int),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec_minmax,
-   .extra1 = _syn_retries_min,
-   .extra2 = _syn_retries_max
-   },
-   {
.procname   = "tcp_synack_retries",
.data   = _tcp_synack_retries,
.maxlen = sizeof(int),
@@ -950,6 +941,15 @@ static struct ctl_table ipv4_net_table[] = {
.mode   = 0644,
.proc_handler   = proc_dointvec_jiffies,
},
+   {
+   .procname   = "tcp_syn_retries",
+   .data   = _net.ipv4.sysctl_tcp_syn_retries,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = _syn_retries_min,
+   .extra2 = _syn_retries_max
+   },
{ }
 };
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index c82cca18c90f..bb36a39b5685 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2722,6 +2722,7 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 {
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
+   struct net *net = sock_net(sk);
int val, len;
 
if (get_user(len, optlen))
@@ -2756,7 +2757,7 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
val = keepalive_probes(tp);
break;
case TCP_SYNCNT:
-   val = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries;
+   val = icsk->icsk_syn_retries ? : 
net->ipv4.sysctl_tcp_syn_retries;
break;
case TCP_LINGER2:
val = tp->linger2;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 9db9bdb14449..c9944e0c48d3 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2382,6 +2382,8 @@ static int __net_init tcp_sk_init(struct net *net)
net->ipv4.sysctl_tcp_keepalive_probes = TCP_KEEPALIVE_PROBES;
net->ipv4.sysctl_tcp_keepalive_intvl = TCP_KEEPALIVE_INTVL;
 
+   net->ipv4.sysctl_tcp_syn_retries = TCP_SYN_RETRIES;
+
return 0;
 fail:
tcp_sk_exit(net);
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index a4730a28b220..c5d51f530c65 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -22,7 +22,6 @@
 #include 
 #include 
 
-int sysctl_tcp_syn_retries __read_mostly = TCP_SYN_RETRIES;
 int sysctl_tcp_synack_retries __read_mostly = TCP_SYNACK_RETRIES;
 int sysctl_tcp_retries1 __read_mostly = TCP_RETR1;
 int sysctl_tcp_retries2 __read_mostly = TCP_RETR2;
@@ -157,6 +156,7 @@ static int tcp_write_timeout(struct sock *sk)
 {
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
+   struct net *net = sock_net(sk);
int retry_until;
bool do_reset, syn_set = false;
 
@@ -169,7 +169,7 @@ static int tcp_write_timeout(struct sock *sk)
NET_INC_STATS_BH(sock_net(sk),
 
LINUX_MIB_TCPFASTOPENACTIVEFAIL);
}
-   retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries;
+   retry_until = 

[RESEND PATCH 2/9] ipv4: Namespaceify tcp synack retries sysctl knob

2016-02-02 Thread Nikolay Borisov
Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---
 include/net/netns/ipv4.h|  1 +
 include/net/tcp.h   |  1 -
 net/ipv4/inet_connection_sock.c |  7 ++-
 net/ipv4/sysctl_net_ipv4.c  | 14 +++---
 net/ipv4/tcp_ipv4.c |  1 +
 net/ipv4/tcp_timer.c|  3 +--
 6 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 59c6155e4896..bca049102441 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -96,6 +96,7 @@ struct netns_ipv4 {
int sysctl_tcp_keepalive_intvl;
 
int sysctl_tcp_syn_retries;
+   int sysctl_tcp_synack_retries;
 
struct ping_group_range ping_group_range;
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index a7f6f25297d7..5a162875e80c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -240,7 +240,6 @@ extern int sysctl_tcp_timestamps;
 extern int sysctl_tcp_window_scaling;
 extern int sysctl_tcp_sack;
 extern int sysctl_tcp_fin_timeout;
-extern int sysctl_tcp_synack_retries;
 extern int sysctl_tcp_retries1;
 extern int sysctl_tcp_retries2;
 extern int sysctl_tcp_orphan_retries;
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 46b9c887bede..9b17c1792dce 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -482,10 +482,6 @@ EXPORT_SYMBOL_GPL(inet_csk_route_child_sock);
 #define AF_INET_FAMILY(fam) true
 #endif
 
-/* Only thing we need from tcp.h */
-extern int sysctl_tcp_synack_retries;
-
-
 /* Decide when to expire the request and when to resend SYN-ACK */
 static inline void syn_ack_recalc(struct request_sock *req, const int thresh,
  const int max_retries,
@@ -557,6 +553,7 @@ static void reqsk_timer_handler(unsigned long data)
 {
struct request_sock *req = (struct request_sock *)data;
struct sock *sk_listener = req->rsk_listener;
+   struct net *net = sock_net(sk_listener);
struct inet_connection_sock *icsk = inet_csk(sk_listener);
struct request_sock_queue *queue = >icsk_accept_queue;
int qlen, expire = 0, resend = 0;
@@ -566,7 +563,7 @@ static void reqsk_timer_handler(unsigned long data)
if (sk_state_load(sk_listener) != TCP_LISTEN)
goto drop;
 
-   max_retries = icsk->icsk_syn_retries ? : sysctl_tcp_synack_retries;
+   max_retries = icsk->icsk_syn_retries ? : 
net->ipv4.sysctl_tcp_synack_retries;
thresh = max_retries;
/* Normally all the openreqs are young and become mature
 * (i.e. converted to established socket) for first timeout.
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index db95287d2b94..5dd89de5bf8d 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -293,13 +293,6 @@ static struct ctl_table ipv4_table[] = {
.extra2 = _ttl_max,
},
{
-   .procname   = "tcp_synack_retries",
-   .data   = _tcp_synack_retries,
-   .maxlen = sizeof(int),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec
-   },
-   {
.procname   = "tcp_max_orphans",
.data   = _tcp_max_orphans,
.maxlen = sizeof(int),
@@ -950,6 +943,13 @@ static struct ctl_table ipv4_net_table[] = {
.extra1 = _syn_retries_min,
.extra2 = _syn_retries_max
},
+   {
+   .procname   = "tcp_synack_retries",
+   .data   = _net.ipv4.sysctl_tcp_synack_retries,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec
+   },
{ }
 };
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index c9944e0c48d3..a5268576021c 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2383,6 +2383,7 @@ static int __net_init tcp_sk_init(struct net *net)
net->ipv4.sysctl_tcp_keepalive_intvl = TCP_KEEPALIVE_INTVL;
 
net->ipv4.sysctl_tcp_syn_retries = TCP_SYN_RETRIES;
+   net->ipv4.sysctl_tcp_synack_retries = TCP_SYNACK_RETRIES;
 
return 0;
 fail:
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index c5d51f530c65..ca25fdf0c525 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -22,7 +22,6 @@
 #include 
 #include 
 
-int sysctl_tcp_synack_retries __read_mostly = TCP_SYNACK_RETRIES;
 int sysctl_tcp_retries1 __read_mostly = TCP_RETR1;
 int sysctl_tcp_retries2 __read_mostly = TCP_RETR2;
 int sysctl_tcp_orphan_retries __read_mostly;
@@ -332,7 +331,7 @@ static void tcp_fastopen_synack_timer(struct sock *sk)
 {
struct inet_connection_sock *icsk = inet_csk(sk);
int max_retries = icsk->icsk_syn_retries ? :
-   sy

[RESEND PATCH 0/9] Namespaceify more of the tcp sysctl knobs

2016-02-02 Thread Nikolay Borisov
This patch series continues making more of the tcp-related
sysctl knobs be per net-namespace. Most of these apply per
socket and have global defaults so should be safe and I
don't expect any breakages. 

Having those per net-namespace is useful when multiple  
containers are hosted and it is required to tune the 
tcp settings for each independently of the host node. 

I've split the patches to be per-sysctl but after
the review if the outcome is positive I'm happy
to either send it in one big blob or just.  

Nikolay Borisov (9):
  ipv4: Namespaceify tcp syn retries sysctl knob
  ipv4: Namespaceify tcp synack retries sysctl knob
  ipv4: Namespaceify tcp syncookies sysctl knob
  ipv4: Namespaceify tcp reordering sysctl knob
  ipv4: Namespaceify tcp_retries1 sysctl knob
  ipv4: Namespaceify tcp_retries2 sysctl knob
  ipv4: Namespaceify tcp_orphan_retries sysctl knob
  ipv4: Namespaceify tcp_fin_timeout sysctl knob
  ipv4: Namespaceify tcp_notsent_lowat sysctl knob

 include/net/netns/ipv4.h|  10 +++
 include/net/tcp.h   |  17 ++---
 net/ipv4/inet_connection_sock.c |   7 +--
 net/ipv4/syncookies.c   |   4 +-
 net/ipv4/sysctl_net_ipv4.c  | 136 
 net/ipv4/tcp.c  |  12 ++--
 net/ipv4/tcp_input.c|  22 ---
 net/ipv4/tcp_ipv4.c |  11 +++-
 net/ipv4/tcp_metrics.c  |   3 +-
 net/ipv4/tcp_minisocks.c|   3 -
 net/ipv4/tcp_output.c   |   6 +-
 net/ipv4/tcp_timer.c|  23 +++
 net/ipv6/syncookies.c   |   2 +-
 13 files changed, 130 insertions(+), 126 deletions(-)

-- 
2.5.0



[RESEND PATCH 9/9] ipv4: Namespaceify tcp_notsent_lowat sysctl knob

2016-02-02 Thread Nikolay Borisov
Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---
 include/net/netns/ipv4.h   |  1 +
 include/net/tcp.h  |  4 ++--
 net/ipv4/sysctl_net_ipv4.c | 14 +++---
 net/ipv4/tcp_ipv4.c|  1 +
 net/ipv4/tcp_output.c  |  3 ---
 5 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index a1caddadecc2..df265ea5bc72 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -103,6 +103,7 @@ struct netns_ipv4 {
int sysctl_tcp_retries2;
int sysctl_tcp_orphan_retries;
int sysctl_tcp_fin_timeout;
+   unsigned int sysctl_tcp_notsent_lowat;
 
struct ping_group_range ping_group_range;
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index f8c3f75e6c99..83de2b9f970e 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -267,7 +267,6 @@ extern int sysctl_tcp_thin_dupack;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
-extern unsigned int sysctl_tcp_notsent_lowat;
 extern int sysctl_tcp_min_tso_segs;
 extern int sysctl_tcp_min_rtt_wlen;
 extern int sysctl_tcp_autocorking;
@@ -1665,7 +1664,8 @@ void __tcp_v4_send_check(struct sk_buff *skb, __be32 
saddr, __be32 daddr);
 
 static inline u32 tcp_notsent_lowat(const struct tcp_sock *tp)
 {
-   return tp->notsent_lowat ?: sysctl_tcp_notsent_lowat;
+   struct net *net = sock_net((struct sock *)tp);
+   return tp->notsent_lowat ?: net->ipv4.sysctl_tcp_notsent_lowat;
 }
 
 static inline bool tcp_stream_memory_free(const struct sock *sk)
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 20e086f88438..23afa08641c2 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -457,13 +457,6 @@ static struct ctl_table ipv4_table[] = {
.extra1 = ,
},
{
-   .procname   = "tcp_notsent_lowat",
-   .data   = _tcp_notsent_lowat,
-   .maxlen = sizeof(sysctl_tcp_notsent_lowat),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec,
-   },
-   {
.procname   = "tcp_rmem",
.data   = _tcp_rmem,
.maxlen = sizeof(sysctl_tcp_rmem),
@@ -950,6 +943,13 @@ static struct ctl_table ipv4_net_table[] = {
.mode   = 0644,
.proc_handler   = proc_dointvec_jiffies,
},
+   {
+   .procname   = "tcp_notsent_lowat",
+   .data   = _net.ipv4.sysctl_tcp_notsent_lowat,
+   .maxlen = sizeof(unsigned int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
{ }
 };
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 3c263c00f5ea..2871acf8f4b9 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2389,6 +2389,7 @@ static int __net_init tcp_sk_init(struct net *net)
net->ipv4.sysctl_tcp_retries2 = TCP_RETR2;
net->ipv4.sysctl_tcp_orphan_retries = 0;
net->ipv4.sysctl_tcp_fin_timeout = TCP_FIN_TIMEOUT;
+   net->ipv4.sysctl_tcp_notsent_lowat = UINT_MAX;
 
return 0;
 fail:
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index e997488b2f8f..54455739e851 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -62,9 +62,6 @@ int sysctl_tcp_tso_win_divisor __read_mostly = 3;
 /* By default, RFC2861 behavior.  */
 int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
 
-unsigned int sysctl_tcp_notsent_lowat __read_mostly = UINT_MAX;
-EXPORT_SYMBOL(sysctl_tcp_notsent_lowat);
-
 static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
   int push_one, gfp_t gfp);
 
-- 
2.5.0



[RESEND PATCH 6/9] ipv4: Namespaceify tcp_retries2 sysctl knob

2016-02-02 Thread Nikolay Borisov
Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---
 include/net/netns/ipv4.h   |  1 +
 include/net/tcp.h  |  1 -
 net/ipv4/sysctl_net_ipv4.c | 14 +++---
 net/ipv4/tcp_ipv4.c|  1 +
 net/ipv4/tcp_output.c  |  3 ++-
 net/ipv4/tcp_timer.c   |  5 ++---
 6 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 250bd940eb94..3cb2073c55f5 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -100,6 +100,7 @@ struct netns_ipv4 {
int sysctl_tcp_syncookies;
int sysctl_tcp_reordering;
int sysctl_tcp_retries1;
+   int sysctl_tcp_retries2;
 
struct ping_group_range ping_group_range;
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 60ee244772c9..9b3aabbac85e 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -240,7 +240,6 @@ extern int sysctl_tcp_timestamps;
 extern int sysctl_tcp_window_scaling;
 extern int sysctl_tcp_sack;
 extern int sysctl_tcp_fin_timeout;
-extern int sysctl_tcp_retries2;
 extern int sysctl_tcp_orphan_retries;
 extern int sysctl_tcp_fastopen;
 extern int sysctl_tcp_retrans_collapse;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 12216ec333b4..39c302fda534 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -321,13 +321,6 @@ static struct ctl_table ipv4_table[] = {
.proc_handler   = proc_dointvec
},
{
-   .procname   = "tcp_retries2",
-   .data   = _tcp_retries2,
-   .maxlen = sizeof(int),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec
-   },
-   {
.procname   = "tcp_fin_timeout",
.data   = _tcp_fin_timeout,
.maxlen = sizeof(int),
@@ -950,6 +943,13 @@ static struct ctl_table ipv4_net_table[] = {
.proc_handler   = proc_dointvec_minmax,
.extra2 = _retr1_max
},
+   {
+   .procname   = "tcp_retries2",
+   .data   = _net.ipv4.sysctl_tcp_retries2,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec
+   },
{ }
 };
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index ea5ed84f4fb1..3a2db4a7d651 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2386,6 +2386,7 @@ static int __net_init tcp_sk_init(struct net *net)
net->ipv4.sysctl_tcp_syncookies = 0;
net->ipv4.sysctl_tcp_reordering = TCP_FASTRETRANS_THRESH;
net->ipv4.sysctl_tcp_retries1 = TCP_RETR1;
+   net->ipv4.sysctl_tcp_retries2 = TCP_RETR2;
 
return 0;
 fail:
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 9bfc39ff2285..e997488b2f8f 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3471,6 +3471,7 @@ void tcp_send_probe0(struct sock *sk)
 {
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
+   struct net *net = sock_net(sk);
unsigned long probe_max;
int err;
 
@@ -3484,7 +3485,7 @@ void tcp_send_probe0(struct sock *sk)
}
 
if (err <= 0) {
-   if (icsk->icsk_backoff < sysctl_tcp_retries2)
+   if (icsk->icsk_backoff < net->ipv4.sysctl_tcp_retries2)
icsk->icsk_backoff++;
icsk->icsk_probes_out++;
probe_max = TCP_RTO_MAX;
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 6694e33149b9..09f4e0297e56 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -22,7 +22,6 @@
 #include 
 #include 
 
-int sysctl_tcp_retries2 __read_mostly = TCP_RETR2;
 int sysctl_tcp_orphan_retries __read_mostly;
 int sysctl_tcp_thin_linear_timeouts __read_mostly;
 
@@ -189,7 +188,7 @@ static int tcp_write_timeout(struct sock *sk)
dst_negative_advice(sk);
}
 
-   retry_until = sysctl_tcp_retries2;
+   retry_until = net->ipv4.sysctl_tcp_retries2;
if (sock_flag(sk, SOCK_DEAD)) {
const bool alive = icsk->icsk_rto < TCP_RTO_MAX;
 
@@ -303,7 +302,7 @@ static void tcp_probe_timer(struct sock *sk)
 (s32)(tcp_time_stamp - start_ts) > icsk->icsk_user_timeout)
goto abort;
 
-   max_probes = sysctl_tcp_retries2;
+   max_probes = sock_net(sk)->ipv4.sysctl_tcp_retries2;
if (sock_flag(sk, SOCK_DEAD)) {
const bool alive = inet_csk_rto_backoff(icsk, TCP_RTO_MAX) < 
TCP_RTO_MAX;
 
-- 
2.5.0



[RESEND PATCH 3/9] ipv4: Namespaceify tcp syncookies sysctl knob

2016-02-02 Thread Nikolay Borisov
Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---
 include/net/netns/ipv4.h   |  2 ++
 include/net/tcp.h  |  1 -
 net/ipv4/syncookies.c  |  4 +---
 net/ipv4/sysctl_net_ipv4.c | 18 +-
 net/ipv4/tcp_input.c   | 10 ++
 net/ipv4/tcp_ipv4.c|  3 ++-
 net/ipv4/tcp_minisocks.c   |  3 ---
 net/ipv6/syncookies.c  |  2 +-
 8 files changed, 21 insertions(+), 22 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index bca049102441..80da0d095eaf 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -98,6 +98,8 @@ struct netns_ipv4 {
int sysctl_tcp_syn_retries;
int sysctl_tcp_synack_retries;
 
+   int sysctl_tcp_syncookies;
+
struct ping_group_range ping_group_range;
 
atomic_t dev_addr_genid;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 5a162875e80c..5497cc809601 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -243,7 +243,6 @@ extern int sysctl_tcp_fin_timeout;
 extern int sysctl_tcp_retries1;
 extern int sysctl_tcp_retries2;
 extern int sysctl_tcp_orphan_retries;
-extern int sysctl_tcp_syncookies;
 extern int sysctl_tcp_fastopen;
 extern int sysctl_tcp_retrans_collapse;
 extern int sysctl_tcp_stdurg;
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 4cbe9f0a4281..1c2bfda72c07 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -19,8 +19,6 @@
 #include 
 #include 
 
-extern int sysctl_tcp_syncookies;
-
 static u32 syncookie_secret[2][16-4+SHA_DIGEST_WORDS] __read_mostly;
 
 #define COOKIEBITS 24  /* Upper bits store count */
@@ -307,7 +305,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct 
sk_buff *skb)
__u8 rcv_wscale;
struct flowi4 fl4;
 
-   if (!sysctl_tcp_syncookies || !th->ack || th->rst)
+   if (!sock_net(sk)->ipv4.sysctl_tcp_syncookies || !th->ack || th->rst)
goto out;
 
if (tcp_synq_no_recent_overflow(sk))
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 5dd89de5bf8d..007b9f8f7a2a 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -342,15 +342,6 @@ static struct ctl_table ipv4_table[] = {
.mode   = 0644,
.proc_handler   = proc_dointvec_jiffies,
},
-#ifdef CONFIG_SYN_COOKIES
-   {
-   .procname   = "tcp_syncookies",
-   .data   = _tcp_syncookies,
-   .maxlen = sizeof(int),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec
-   },
-#endif
{
.procname   = "tcp_fastopen",
.data   = _tcp_fastopen,
@@ -950,6 +941,15 @@ static struct ctl_table ipv4_net_table[] = {
.mode   = 0644,
.proc_handler   = proc_dointvec
},
+#ifdef CONFIG_SYN_COOKIES
+   {
+   .procname   = "tcp_syncookies",
+   .data   = _net.ipv4.sysctl_tcp_syncookies,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec
+   },
+#endif
{ }
 };
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2d656eef7f8e..dc8fe6c8a2e0 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6114,9 +6114,10 @@ static bool tcp_syn_flood_action(const struct sock *sk,
struct request_sock_queue *queue = _csk(sk)->icsk_accept_queue;
const char *msg = "Dropping request";
bool want_cookie = false;
+   struct net *net = sock_net(sk);
 
 #ifdef CONFIG_SYN_COOKIES
-   if (sysctl_tcp_syncookies) {
+   if (net->ipv4.sysctl_tcp_syncookies) {
msg = "Sending cookies";
want_cookie = true;
NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPREQQFULLDOCOOKIES);
@@ -6125,7 +6126,7 @@ static bool tcp_syn_flood_action(const struct sock *sk,
NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPREQQFULLDROP);
 
if (!queue->synflood_warned &&
-   sysctl_tcp_syncookies != 2 &&
+   net->ipv4.sysctl_tcp_syncookies != 2 &&
xchg(>synflood_warned, 1) == 0)
pr_info("%s: Possible SYN flooding on port %d. %s.  Check SNMP 
counters.\n",
proto, ntohs(tcp_hdr(skb)->dest), msg);
@@ -6158,6 +6159,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
__u32 isn = TCP_SKB_CB(skb)->tcp_tw_isn;
struct tcp_options_received tmp_opt;
struct tcp_sock *tp = tcp_sk(sk);
+   struct net *net = sock_net(sk);
struct sock *fastopen_sk = NULL;
struct dst_entry *dst = NULL;
struct request_sock *req;
@@ -6168,7 +6170,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 * limit

[RESEND PATCH 5/9] ipv4: Namespaceify tcp_retries1 sysctl knob

2016-02-02 Thread Nikolay Borisov
Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---
 include/net/netns/ipv4.h   |  1 +
 include/net/tcp.h  |  1 -
 net/ipv4/sysctl_net_ipv4.c | 16 
 net/ipv4/tcp_ipv4.c|  1 +
 net/ipv4/tcp_timer.c   |  8 
 5 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index dff8879e02fe..250bd940eb94 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -99,6 +99,7 @@ struct netns_ipv4 {
int sysctl_tcp_synack_retries;
int sysctl_tcp_syncookies;
int sysctl_tcp_reordering;
+   int sysctl_tcp_retries1;
 
struct ping_group_range ping_group_range;
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 64d01d289441..60ee244772c9 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -240,7 +240,6 @@ extern int sysctl_tcp_timestamps;
 extern int sysctl_tcp_window_scaling;
 extern int sysctl_tcp_sack;
 extern int sysctl_tcp_fin_timeout;
-extern int sysctl_tcp_retries1;
 extern int sysctl_tcp_retries2;
 extern int sysctl_tcp_orphan_retries;
 extern int sysctl_tcp_fastopen;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 12d752e6380b..12216ec333b4 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -321,14 +321,6 @@ static struct ctl_table ipv4_table[] = {
.proc_handler   = proc_dointvec
},
{
-   .procname   = "tcp_retries1",
-   .data   = _tcp_retries1,
-   .maxlen = sizeof(int),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec_minmax,
-   .extra2 = _retr1_max
-   },
-   {
.procname   = "tcp_retries2",
.data   = _tcp_retries2,
.maxlen = sizeof(int),
@@ -950,6 +942,14 @@ static struct ctl_table ipv4_net_table[] = {
.mode   = 0644,
.proc_handler   = proc_dointvec
},
+   {
+   .procname   = "tcp_retries1",
+   .data   = _net.ipv4.sysctl_tcp_retries1,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   .extra2 = _retr1_max
+   },
{ }
 };
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 785bbebd6768..ea5ed84f4fb1 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2385,6 +2385,7 @@ static int __net_init tcp_sk_init(struct net *net)
net->ipv4.sysctl_tcp_synack_retries = TCP_SYNACK_RETRIES;
net->ipv4.sysctl_tcp_syncookies = 0;
net->ipv4.sysctl_tcp_reordering = TCP_FASTRETRANS_THRESH;
+   net->ipv4.sysctl_tcp_retries1 = TCP_RETR1;
 
return 0;
 fail:
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index ca25fdf0c525..6694e33149b9 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -22,7 +22,6 @@
 #include 
 #include 
 
-int sysctl_tcp_retries1 __read_mostly = TCP_RETR1;
 int sysctl_tcp_retries2 __read_mostly = TCP_RETR2;
 int sysctl_tcp_orphan_retries __read_mostly;
 int sysctl_tcp_thin_linear_timeouts __read_mostly;
@@ -171,7 +170,7 @@ static int tcp_write_timeout(struct sock *sk)
retry_until = icsk->icsk_syn_retries ? : 
net->ipv4.sysctl_tcp_syn_retries;
syn_set = true;
} else {
-   if (retransmits_timed_out(sk, sysctl_tcp_retries1, 0, 0)) {
+   if (retransmits_timed_out(sk, net->ipv4.sysctl_tcp_retries1, 0, 
0)) {
/* Some middle-boxes may black-hole Fast Open _after_
 * the handshake. Therefore we conservatively disable
 * Fast Open on this path on recurring timeouts with
@@ -180,7 +179,7 @@ static int tcp_write_timeout(struct sock *sk)
if (tp->syn_data_acked &&
tp->bytes_acked <= tp->rx_opt.mss_clamp) {
tcp_fastopen_cache_set(sk, 0, NULL, true, 0);
-   if (icsk->icsk_retransmits == 
sysctl_tcp_retries1)
+   if (icsk->icsk_retransmits == 
net->ipv4.sysctl_tcp_retries1)
NET_INC_STATS_BH(sock_net(sk),
 
LINUX_MIB_TCPFASTOPENACTIVEFAIL);
}
@@ -359,6 +358,7 @@ static void tcp_fastopen_synack_timer(struct sock *sk)
 void tcp_retransmit_timer(struct sock *sk)
 {
struct tcp_sock *tp = tcp_sk(sk);
+   struct net *net = sock_net(sk);
struct inet_connection_sock *icsk = inet_csk(sk);
 
if (tp->fastopen_rsk) {
@@ -489,7 +489,7 @@ out_reset_timer:
icsk->icsk_rto = min(icsk->icsk_rto <&l

[RESEND PATCH 4/9] ipv4: Namespaceify tcp reordering sysctl knob

2016-02-02 Thread Nikolay Borisov
Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---
 include/net/netns/ipv4.h   |  2 +-
 include/net/tcp.h  |  4 +++-
 net/ipv4/sysctl_net_ipv4.c | 14 +++---
 net/ipv4/tcp.c |  2 +-
 net/ipv4/tcp_input.c   | 12 ++--
 net/ipv4/tcp_ipv4.c|  2 +-
 net/ipv4/tcp_metrics.c |  3 ++-
 7 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 80da0d095eaf..dff8879e02fe 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -97,8 +97,8 @@ struct netns_ipv4 {
 
int sysctl_tcp_syn_retries;
int sysctl_tcp_synack_retries;
-
int sysctl_tcp_syncookies;
+   int sysctl_tcp_reordering;
 
struct ping_group_range ping_group_range;
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 5497cc809601..64d01d289441 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -959,9 +959,11 @@ static inline void tcp_enable_fack(struct tcp_sock *tp)
  */
 static inline void tcp_enable_early_retrans(struct tcp_sock *tp)
 {
+   struct net *net = sock_net((struct sock *)tp);
+
tp->do_early_retrans = sysctl_tcp_early_retrans &&
sysctl_tcp_early_retrans < 4 && !sysctl_tcp_thin_dupack &&
-   sysctl_tcp_reordering == 3;
+   net->ipv4.sysctl_tcp_reordering == 3;
 }
 
 static inline void tcp_disable_early_retrans(struct tcp_sock *tp)
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 007b9f8f7a2a..12d752e6380b 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -457,13 +457,6 @@ static struct ctl_table ipv4_table[] = {
.proc_handler   = proc_dointvec,
},
{
-   .procname   = "tcp_reordering",
-   .data   = _tcp_reordering,
-   .maxlen = sizeof(int),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec
-   },
-   {
.procname   = "tcp_max_reordering",
.data   = _tcp_max_reordering,
.maxlen = sizeof(int),
@@ -950,6 +943,13 @@ static struct ctl_table ipv4_net_table[] = {
.proc_handler   = proc_dointvec
},
 #endif
+   {
+   .procname   = "tcp_reordering",
+   .data   = _net.ipv4.sysctl_tcp_reordering,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec
+   },
{ }
 };
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index bb36a39b5685..d0547395d81d 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -405,7 +405,7 @@ void tcp_init_sock(struct sock *sk)
tp->mss_cache = TCP_MSS_DEFAULT;
u64_stats_init(>syncp);
 
-   tp->reordering = sysctl_tcp_reordering;
+   tp->reordering = sock_net(sk)->ipv4.sysctl_tcp_reordering;
tcp_enable_early_retrans(tp);
tcp_assign_congestion_control(sk);
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index dc8fe6c8a2e0..3f08bba46147 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -80,9 +80,7 @@ int sysctl_tcp_timestamps __read_mostly = 1;
 int sysctl_tcp_window_scaling __read_mostly = 1;
 int sysctl_tcp_sack __read_mostly = 1;
 int sysctl_tcp_fack __read_mostly = 1;
-int sysctl_tcp_reordering __read_mostly = TCP_FASTRETRANS_THRESH;
 int sysctl_tcp_max_reordering __read_mostly = 300;
-EXPORT_SYMBOL(sysctl_tcp_reordering);
 int sysctl_tcp_dsack __read_mostly = 1;
 int sysctl_tcp_app_win __read_mostly = 31;
 int sysctl_tcp_adv_win_scale __read_mostly = 1;
@@ -1873,6 +1871,7 @@ void tcp_enter_loss(struct sock *sk)
 {
const struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
+   struct net *net = sock_net(sk);
struct sk_buff *skb;
bool new_recovery = icsk->icsk_ca_state < TCP_CA_Recovery;
bool is_reneg;  /* is receiver reneging on SACKs? */
@@ -1923,9 +1922,9 @@ void tcp_enter_loss(struct sock *sk)
 * suggests that the degree of reordering is over-estimated.
 */
if (icsk->icsk_ca_state <= TCP_CA_Disorder &&
-   tp->sacked_out >= sysctl_tcp_reordering)
+   tp->sacked_out >= net->ipv4.sysctl_tcp_reordering)
tp->reordering = min_t(unsigned int, tp->reordering,
-  sysctl_tcp_reordering);
+  net->ipv4.sysctl_tcp_reordering);
tcp_set_ca_state(sk, TCP_CA_Loss);
tp->high_seq = tp->snd_nxt;
tcp_ecn_queue_cwr(tp);
@@ -2109,6 +2108,7 @@ static bool tcp_time_to_recover(struct sock *sk, int flag)
 {
struct tcp_sock *tp = tcp_sk(sk);
__u32 packets_out;
+   int tcp_re

[RESEND PATCH 7/9] ipv4: Namespaceify tcp_orphan_retries sysctl knob

2016-02-02 Thread Nikolay Borisov
Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---
 include/net/netns/ipv4.h   |  1 +
 include/net/tcp.h  |  1 -
 net/ipv4/sysctl_net_ipv4.c | 14 +++---
 net/ipv4/tcp_ipv4.c|  1 +
 net/ipv4/tcp_timer.c   |  3 +--
 5 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 3cb2073c55f5..6903335fbe3a 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -101,6 +101,7 @@ struct netns_ipv4 {
int sysctl_tcp_reordering;
int sysctl_tcp_retries1;
int sysctl_tcp_retries2;
+   int sysctl_tcp_orphan_retries;
 
struct ping_group_range ping_group_range;
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 9b3aabbac85e..606a0a1a6d15 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -240,7 +240,6 @@ extern int sysctl_tcp_timestamps;
 extern int sysctl_tcp_window_scaling;
 extern int sysctl_tcp_sack;
 extern int sysctl_tcp_fin_timeout;
-extern int sysctl_tcp_orphan_retries;
 extern int sysctl_tcp_fastopen;
 extern int sysctl_tcp_retrans_collapse;
 extern int sysctl_tcp_stdurg;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 39c302fda534..e866e9fe6d84 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -421,13 +421,6 @@ static struct ctl_table ipv4_table[] = {
.proc_handler   = proc_dointvec_jiffies,
},
{
-   .procname   = "tcp_orphan_retries",
-   .data   = _tcp_orphan_retries,
-   .maxlen = sizeof(int),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec
-   },
-   {
.procname   = "tcp_fack",
.data   = _tcp_fack,
.maxlen = sizeof(int),
@@ -950,6 +943,13 @@ static struct ctl_table ipv4_net_table[] = {
.mode   = 0644,
.proc_handler   = proc_dointvec
},
+   {
+   .procname   = "tcp_orphan_retries",
+   .data   = _net.ipv4.sysctl_tcp_orphan_retries,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec
+   },
{ }
 };
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 3a2db4a7d651..fc4d4ee38012 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2387,6 +2387,7 @@ static int __net_init tcp_sk_init(struct net *net)
net->ipv4.sysctl_tcp_reordering = TCP_FASTRETRANS_THRESH;
net->ipv4.sysctl_tcp_retries1 = TCP_RETR1;
net->ipv4.sysctl_tcp_retries2 = TCP_RETR2;
+   net->ipv4.sysctl_tcp_orphan_retries = 0;
 
return 0;
 fail:
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 09f4e0297e56..49bc474f8e35 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -22,7 +22,6 @@
 #include 
 #include 
 
-int sysctl_tcp_orphan_retries __read_mostly;
 int sysctl_tcp_thin_linear_timeouts __read_mostly;
 
 static void tcp_write_err(struct sock *sk)
@@ -78,7 +77,7 @@ static int tcp_out_of_resources(struct sock *sk, bool 
do_reset)
 /* Calculate maximal number or retries on an orphaned socket. */
 static int tcp_orphan_retries(struct sock *sk, bool alive)
 {
-   int retries = sysctl_tcp_orphan_retries; /* May be zero. */
+   int retries = sock_net(sk)->ipv4.sysctl_tcp_orphan_retries; /* May be 
zero. */
 
/* We know from an ICMP that something is wrong. */
if (sk->sk_err_soft && !alive)
-- 
2.5.0



[RESEND PATCH 8/9] ipv4: Namespaceify tcp_fin_timeout sysctl knob

2016-02-02 Thread Nikolay Borisov
Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---
 include/net/netns/ipv4.h   |  1 +
 include/net/tcp.h  |  3 +--
 net/ipv4/sysctl_net_ipv4.c | 14 +++---
 net/ipv4/tcp.c |  7 +++
 net/ipv4/tcp_ipv4.c|  1 +
 5 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 6903335fbe3a..a1caddadecc2 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -102,6 +102,7 @@ struct netns_ipv4 {
int sysctl_tcp_retries1;
int sysctl_tcp_retries2;
int sysctl_tcp_orphan_retries;
+   int sysctl_tcp_fin_timeout;
 
struct ping_group_range ping_group_range;
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 606a0a1a6d15..f8c3f75e6c99 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -239,7 +239,6 @@ extern struct inet_timewait_death_row tcp_death_row;
 extern int sysctl_tcp_timestamps;
 extern int sysctl_tcp_window_scaling;
 extern int sysctl_tcp_sack;
-extern int sysctl_tcp_fin_timeout;
 extern int sysctl_tcp_fastopen;
 extern int sysctl_tcp_retrans_collapse;
 extern int sysctl_tcp_stdurg;
@@ -1245,7 +1244,7 @@ static inline u32 keepalive_time_elapsed(const struct 
tcp_sock *tp)
 
 static inline int tcp_fin_time(const struct sock *sk)
 {
-   int fin_timeout = tcp_sk(sk)->linger2 ? : sysctl_tcp_fin_timeout;
+   int fin_timeout = tcp_sk(sk)->linger2 ? : 
sock_net(sk)->ipv4.sysctl_tcp_fin_timeout;
const int rto = inet_csk(sk)->icsk_rto;
 
if (fin_timeout < (rto << 2) - (rto >> 1))
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index e866e9fe6d84..20e086f88438 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -321,13 +321,6 @@ static struct ctl_table ipv4_table[] = {
.proc_handler   = proc_dointvec
},
{
-   .procname   = "tcp_fin_timeout",
-   .data   = _tcp_fin_timeout,
-   .maxlen = sizeof(int),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec_jiffies,
-   },
-   {
.procname   = "tcp_fastopen",
.data   = _tcp_fastopen,
.maxlen = sizeof(int),
@@ -950,6 +943,13 @@ static struct ctl_table ipv4_net_table[] = {
.mode   = 0644,
.proc_handler   = proc_dointvec
},
+   {
+   .procname   = "tcp_fin_timeout",
+   .data   = _net.ipv4.sysctl_tcp_fin_timeout,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_jiffies,
+   },
{ }
 };
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index d0547395d81d..ad903790c0a4 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -281,8 +281,6 @@
 #include 
 #include 
 
-int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
-
 int sysctl_tcp_min_tso_segs __read_mostly = 2;
 
 int sysctl_tcp_autocorking __read_mostly = 1;
@@ -2324,6 +2322,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 {
struct tcp_sock *tp = tcp_sk(sk);
struct inet_connection_sock *icsk = inet_csk(sk);
+   struct net *net = sock_net(sk);
int val;
int err = 0;
 
@@ -2520,7 +2519,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
case TCP_LINGER2:
if (val < 0)
tp->linger2 = -1;
-   else if (val > sysctl_tcp_fin_timeout / HZ)
+   else if (val > net->ipv4.sysctl_tcp_fin_timeout / HZ)
tp->linger2 = 0;
else
tp->linger2 = val * HZ;
@@ -2762,7 +2761,7 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
case TCP_LINGER2:
val = tp->linger2;
if (val >= 0)
-   val = (val ? : sysctl_tcp_fin_timeout) / HZ;
+   val = (val ? : net->ipv4.sysctl_tcp_fin_timeout) / HZ;
break;
case TCP_DEFER_ACCEPT:
val = retrans_to_secs(icsk->icsk_accept_queue.rskq_defer_accept,
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index fc4d4ee38012..3c263c00f5ea 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2388,6 +2388,7 @@ static int __net_init tcp_sk_init(struct net *net)
net->ipv4.sysctl_tcp_retries1 = TCP_RETR1;
net->ipv4.sysctl_tcp_retries2 = TCP_RETR2;
net->ipv4.sysctl_tcp_orphan_retries = 0;
+   net->ipv4.sysctl_tcp_fin_timeout = TCP_FIN_TIMEOUT;
 
return 0;
 fail:
-- 
2.5.0



[PATCH 1/3] ipv4: Namespaceify tcp_keepalive_time sysctl knob

2016-01-07 Thread Nikolay Borisov
Different net namespaces might have different requirements as to
the keepalive time of tcp sockets. This might be required in cases
where different firewall rules are in place which require tcp
timeout sockets to be increased/decreased independently of the host.

Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---
 include/net/netns/ipv4.h   |  2 ++
 include/net/tcp.h  |  5 +++--
 net/ipv4/sysctl_net_ipv4.c | 14 +++---
 net/ipv4/tcp_ipv4.c|  2 ++
 net/ipv4/tcp_timer.c   |  1 -
 5 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index c68926b4899c..d7ee5120e3ec 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -91,6 +91,8 @@ struct netns_ipv4 {
int sysctl_tcp_probe_threshold;
u32 sysctl_tcp_probe_interval;
 
+   int sysctl_tcp_keepalive_time;
+
struct ping_group_range ping_group_range;
 
atomic_t dev_addr_genid;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index f80e74c5ad18..1145f890f55c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -240,7 +240,6 @@ extern int sysctl_tcp_timestamps;
 extern int sysctl_tcp_window_scaling;
 extern int sysctl_tcp_sack;
 extern int sysctl_tcp_fin_timeout;
-extern int sysctl_tcp_keepalive_time;
 extern int sysctl_tcp_keepalive_probes;
 extern int sysctl_tcp_keepalive_intvl;
 extern int sysctl_tcp_syn_retries;
@@ -1228,7 +1227,9 @@ static inline int keepalive_intvl_when(const struct 
tcp_sock *tp)
 
 static inline int keepalive_time_when(const struct tcp_sock *tp)
 {
-   return tp->keepalive_time ? : sysctl_tcp_keepalive_time;
+   struct net *net = sock_net((struct sock *)tp);
+
+   return tp->keepalive_time ? : net->ipv4.sysctl_tcp_keepalive_time;
 }
 
 static inline int keepalive_probes(const struct tcp_sock *tp)
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index a0bd7a55193e..8755825b92a5 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -337,13 +337,6 @@ static struct ctl_table ipv4_table[] = {
.proc_handler   = proc_dointvec
},
{
-   .procname   = "tcp_keepalive_time",
-   .data   = _tcp_keepalive_time,
-   .maxlen = sizeof(int),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec_jiffies,
-   },
-   {
.procname   = "tcp_keepalive_probes",
.data   = _tcp_keepalive_probes,
.maxlen = sizeof(int),
@@ -950,6 +943,13 @@ static struct ctl_table ipv4_net_table[] = {
.mode   = 0644,
.proc_handler   = proc_dointvec
},
+   {
+   .procname   = "tcp_keepalive_time",
+   .data   = _net.ipv4.sysctl_tcp_keepalive_time,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_jiffies,
+   },
{ }
 };
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index d8841a2f1569..ca8d98de7846 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2378,6 +2378,8 @@ static int __net_init tcp_sk_init(struct net *net)
net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL;
 
+   net->ipv4.sysctl_tcp_keepalive_time = TCP_KEEPALIVE_TIME;
+
return 0;
 fail:
tcp_sk_exit(net);
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 193ba1fa8a9a..166f27b43cc0 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -24,7 +24,6 @@
 
 int sysctl_tcp_syn_retries __read_mostly = TCP_SYN_RETRIES;
 int sysctl_tcp_synack_retries __read_mostly = TCP_SYNACK_RETRIES;
-int sysctl_tcp_keepalive_time __read_mostly = TCP_KEEPALIVE_TIME;
 int sysctl_tcp_keepalive_probes __read_mostly = TCP_KEEPALIVE_PROBES;
 int sysctl_tcp_keepalive_intvl __read_mostly = TCP_KEEPALIVE_INTVL;
 int sysctl_tcp_retries1 __read_mostly = TCP_RETR1;
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] netfilter: nfnetlink_queue: Unregister pernet subsys in case of init failure

2015-12-07 Thread Nikolay Borisov
Commit 3bfe049807c2403 ('netfilter: nfnetlink_{log,queue}:
Register pernet in first place') reorganised the initialisation
order of the pernet_subsys to avoid "use-before-initialised"
condition. However, in doing so the cleanup logic in nfnetlink_queue
got botched in that the pernet_subsys wasn't cleaned in case
nfnetlink_subsys_register failed. This patch adds the necessary
cleanup routine call.

Fixes: 3bfe049807c2403 ('netfilter: nfnetlink_{log,queue}: Register
pernet in first place')

Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---
 net/netfilter/nfnetlink_queue.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
index 7d81d280cb4f..2e94603c2dec 100644
--- a/net/netfilter/nfnetlink_queue.c
+++ b/net/netfilter/nfnetlink_queue.c
@@ -1417,6 +1417,7 @@ static int __init nfnetlink_queue_init(void)
 
 cleanup_netlink_notifier:
netlink_unregister_notifier(_rtnl_notifier);
+   unregister_pernet_subsys(_queue_net_ops);
 out:
return status;
 }
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] netfilter: nfnetlink_queue: Unregister pernet subsys in case of init failure

2015-12-07 Thread Nikolay Borisov


On 12/07/2015 02:29 PM, Sergei Shtylyov wrote:
> Hello.
> 
> On 12/07/2015 01:13 PM, Nikolay Borisov wrote:
> 
>> Commit 3bfe049807c2403 ('netfilter: nfnetlink_{log,queue}:
> 
>Double quotes please, that's what scripts/checkpatch.pl enforces now.
> 
>> Register pernet in first place') reorganised the initialisation
>> order of the pernet_subsys to avoid "use-before-initialised"
>> condition. However, in doing so the cleanup logic in nfnetlink_queue
>> got botched in that the pernet_subsys wasn't cleaned in case
>> nfnetlink_subsys_register failed. This patch adds the necessary
>> cleanup routine call.
>>
>> Fixes: 3bfe049807c2403 ('netfilter: nfnetlink_{log,queue}: Register
>> pernet in first place')
> 
>Likewise.

I will resend it with proper quotes (even though I think this is a minor
issue) but I'd like to first gather some review feedback.

Also I dunno if this should be marked for stable.

> 
>>
>> Signed-off-by: Nikolay Borisov <ker...@kyup.com>
> 
> [...]
> 
> MBR, Sergei
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[BUG] Erroneous behavior in try_to_coalesce

2015-10-28 Thread Nikolay Borisov
Hello,

Recently I observed 2 crashes on one of my server with the following backtraces:

[22751.889645] [ cut here ]
[22751.889660] WARNING: CPU: 38 PID: 12807 at net/core/skbuff.c:3498
skb_try_coalesce+0x34b/0x360()
[22751.889661] Modules linked in: tcp_diag inet_diag xt_LOG xt_limit
xt_addrtype xt_multiport xt_pkt
type xt_conntrack netconsole act_police cls_basic sch_ingress veth
ipv6 openvswitch gre vxlan ip_tun
nel xt_owner xt_state iptable_mangle xt_nat iptable_nat
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat xt_CT nf_conntrack
iptable_raw ext2 dm_thin_pool dm_bio_prison dm_persistent_data
dm_bufio dm_mirror dm_region_hash dm_log ixgbe i2c_i801 lpc_ich
mfd_core igb i2c_algo_bit ioapic ses enclosure ioatdma dca
ipmi_devintf ipmi_si ipmi_msghandler aacraid
[22751.889704] CPU: 38 PID: 12807 Comm: handler22 Not tainted
3.12.49-clouder2 #2
[22751.889706] Hardware name: Supermicro
PIO-617R-TLN4F+-ST031/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0b 05/27/2014
[22751.889708]  0daa 883fff4839e8 81643c91
0daa
[22751.889716]   883fff483a28 81089acc
883fff483b68
[22751.889721]  8832bd282b00 882e6b0190e8 883fff483aa4
05b4
[22751.889726] Call Trace:
[22751.889728][] dump_stack+0x58/0x7f
[22751.889739]  [] warn_slowpath_common+0x8c/0xc0
[22751.889742]  [] warn_slowpath_null+0x1a/0x20
[22751.889745]  [] skb_try_coalesce+0x34b/0x360
[22751.889752]  [] tcp_try_coalesce+0x69/0xc0
[22751.889755]  [] tcp_queue_rcv+0x53/0x130
[22751.889758]  [] tcp_data_queue+0x1d3/0xd40
[22751.889761]  [] tcp_rcv_established+0x319/0x5e0
[22751.889767]  [] ? nf_nat_ipv4_fn+0x1e1/0x270 [iptable_nat]
[22751.889771]  [] tcp_v4_do_rcv+0x152/0x3d0
[22751.889777]  [] ? security_sock_rcv_skb+0x16/0x20
[22751.889781]  [] ? sk_filter+0x37/0xf0
[22751.889784]  [] tcp_v4_rcv+0x6b7/0x730
[22751.889787]  [] ? ip_rcv+0x3a0/0x3a0
[22751.889791]  [] ? nf_hook_slow+0x85/0x130
[22751.889794]  [] ? ip_rcv+0x3a0/0x3a0
[22751.889796]  [] ip_local_deliver_finish+0xc2/0x250
[22751.889799]  [] ip_local_deliver+0x88/0x90
[22751.889802]  [] ip_rcv_finish+0x119/0x380
[22751.889804]  [] ip_rcv+0x2c5/0x3a0
[22751.889809]  [] ? netdev_frame_hook+0xb5/0x130
[openvswitch]
[22751.889815]  [] __netif_receive_skb_core+0x626/0x7e0
[22751.889818]  [] __netif_receive_skb+0x27/0x70
[22751.889820]  [] process_backlog+0xd9/0x1e0
[22751.889823]  [] net_rx_action+0x12c/0x280
[22751.889828]  [] __do_softirq+0x137/0x2e0
[22751.889832]  [] call_softirq+0x1c/0x30
[22751.889833][] do_softirq+0x8d/0xc0
[22751.889843]  [] ?
ovs_packet_cmd_execute+0x217/0x250 [openvswitch]
[22751.889846]  [] local_bh_enable+0xdb/0xf0
[22751.889849]  []
ovs_packet_cmd_execute+0x217/0x250 [openvswitch]
[22751.889853]  [] genl_family_rcv_msg+0x221/0x390
[22751.889856]  [] ? genl_family_rcv_msg+0x390/0x390
[22751.889858]  [] genl_rcv_msg+0x63/0xb0
[22751.889861]  [] netlink_rcv_skb+0xa9/0xd0
[22751.889864]  [] genl_rcv+0x2c/0x40
[22751.889867]  [] netlink_unicast+0x10f/0x190
[22751.889869]  [] netlink_sendmsg+0x2bb/0x650
[22751.889874]  [] ? __pollwait+0xf0/0xf0
[22751.889881]  [] sock_sendmsg+0x90/0xc0
[22751.889883]  [] ? __pollwait+0xf0/0xf0
[22751.889887]  [] ? local_bh_enable_ip+0x87/0xf0
[22751.889890]  [] ? _raw_spin_unlock_bh+0x24/0x30
[22751.889894]  [] ? verify_iovec+0x8d/0x110
[22751.889898]  [] ___sys_sendmsg+0x417/0x440
[22751.889904]  [] ? ep_poll+0x144/0x370

And then alter the actual crashed occured:

[44923.628546] BUG: unable to handle kernel paging request at 00820299
[44923.629139] IP: [] kfree_skb_list+0x18/0x30
[44923.629463] PGD 35cc3b5067 PUD 0
[44923.629823] Oops:  [#1] SMP
[44923.630182] Modules linked in: tcp_diag inet_diag xt_LOG xt_limit
xt_addrtype xt_multiport xt_pkttype xt_conntrack netconsole act_police
cls_basic sch_ingress veth ipv6 openvswitch gre vxlan ip_tunnel
xt_owner xt_state iptable_mangle xt_nat iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat xt_CT nf_conntrack iptable_raw ext2
dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio dm_mirror
dm_region_hash dm_log ixgbe i2c_i801 lpc_ich mfd_core igb i2c_algo_bit
ioapic ses enclosure ioatdma dca ipmi_devintf ipmi_si ipmi_msghandler
aacraid
[44923.634368] CPU: 10 PID: 39391 Comm: kworker/u80:0 Tainted: G
 W3.12.49-clouder2 #2
[44923.634851] Hardware name: Supermicro
PIO-617R-TLN4F+-ST031/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0b 05/27/2014
[44923.635340] Workqueue: dm-thin do_worker [dm_thin_pool]
[44923.635653] task: 881918cb0810 ti: 880d5a4ea000 task.ti:
880d5a4ea000
[44923.635926] RIP: 0010:[]  []
kfree_skb_list+0x18/0x30
[44923.636251] RSP: 0018:883fff003cd0  EFLAGS: 00010206
[44923.636521] RAX:  RBX: 882e5622be00 RCX: 883fd12b9800
[44923.636791] RDX: 0100 RSI: 0040 RDI: 00820299
[44923.637064] RBP: 883fff003ce0 R08: 00dc R09: 0003
[44923.637336] R10: 0003 R11: 

[PATCH v3] netfilter: ipset: Fix sleeping memory allocation in atomic context

2015-10-16 Thread Nikolay Borisov
Commit 00590fdd5be0 introduced RCU locking in list type and in
doing so introduced a memory allocation in list_set_add, which 
is done in an atomic context, due to the fact that ipset rcu 
list modifications are serialised with a spin lock. The reason 
why we can't use a mutex is that in addition to modifying the 
list with ipset commands, it's also being modified when a
particular ipset rule timeout expires aka garbage collection. 
This gc is triggered from set_cleanup_entries, which in turn 
is invoked from a timer thus requiring the lock to be bh-safe. 

Concretely the following call chain can lead to "sleeping function
called in atomic context" splat: 
call_ad -> list_set_uadt -> list_set_uadd -> kzalloc(, GFP_KERNEL).
And since GFP_KERNEL allows initiating direct reclaim thus
potentially sleeping in the allocation path.

To fix the issue change the allocation type to GFP_ATOMIC, to
correctly reflect that it is occuring in an atomic context.

Fixes: 00590fdd5be0 ("netfilter: ipset: Introduce RCU locking in list type")

Acked-by: Jozsef Kadlecsik <kad...@blackhole.kfki.hu>
Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---

Changes since v2:
 * Massaged the changelog to reflect discussion 
 on the mailing list

Changes since v1: 
 * Added acked-by 
 * Fixed patch header

 

 net/netfilter/ipset/ip_set_list_set.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/ipset/ip_set_list_set.c 
b/net/netfilter/ipset/ip_set_list_set.c
index a1fe537..5a30ce6 100644
--- a/net/netfilter/ipset/ip_set_list_set.c
+++ b/net/netfilter/ipset/ip_set_list_set.c
@@ -297,7 +297,7 @@ list_set_uadd(struct ip_set *set, void *value, const struct 
ip_set_ext *ext,
  ip_set_timeout_expired(ext_timeout(n, set
n =  NULL;
 
-   e = kzalloc(set->dsize, GFP_KERNEL);
+   e = kzalloc(set->dsize, GFP_ATOMIC);
if (!e)
return -ENOMEM;
e->id = d->id;
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] netfilter: ipset: Fix sleeping memory allocation in atomic context

2015-10-15 Thread Nikolay Borisov
Commit 00590fdd5be0 introduced RCU locking in list type and in
doing so introduced a memory allocation in list_set_add, which
results in the following splat:

BUG: sleeping function called from invalid context at mm/page_alloc.c:2759
in_atomic(): 1, irqs_disabled(): 0, pid: 9664, name: ipset
CPU: 18 PID: 9664 Comm: ipset Tainted: G   O 3.12.47-clouder3 #1
Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015
 0002 881fd14273c8 8163d891 881fcb4264b0
 881fcb4260c0 881fd14273e8 810ba5bf 881fd1427558
  881fd1427568 81142b33 881f
Call Trace:
 [] dump_stack+0x58/0x7f
 [] __might_sleep+0xdf/0x110
 [] __alloc_pages_nodemask+0x243/0xc20
 [] alloc_pages_current+0xbe/0x170
 [] new_slab+0x295/0x340
 [] __slab_alloc+0x2c0/0x5a0
 [] ? __schedule+0x2dc/0x760
 [] __kmalloc+0x11b/0x230
 [] ? ip_set_get_byname+0xec/0x100 [ip_set]
 [] list_set_uadd+0x16b/0x314 [ip_set_list_set]
 [] ? _raw_write_unlock_bh+0x28/0x30
 [] list_set_uadt+0x21c/0x320 [ip_set_list_set]
 [] ? list_set_create+0x1a0/0x1a0 [ip_set_list_set]
 [] call_ad+0x82/0x200 [ip_set]
 [] ? find_set_type+0x51/0xa0 [ip_set]
 [] ? nla_parse+0xf5/0x130
 [] ip_set_uadd+0x20e/0x2d0 [ip_set]
 [] ? ip_set_create+0x2a3/0x450 [ip_set]
 [] ? ip_set_udel+0x2e0/0x2e0 [ip_set]
 [] nfnetlink_rcv_msg+0x31e/0x330
 [] ? nfnetlink_rcv_msg+0x41/0x330
 [] ? nfnl_lock+0x30/0x30
 [] netlink_rcv_skb+0xa9/0xd0
 [] nfnetlink_rcv+0x15/0x20
 [] netlink_unicast+0x10f/0x190
 [] netlink_sendmsg+0x2c0/0x660
 [] sock_sendmsg+0x90/0xc0
 [] ? move_addr_to_user+0xa3/0xc0
 [] ? ___sys_recvmsg+0x182/0x300
 [] SYSC_sendto+0x134/0x180
 [] ? mntput+0x21/0x30
 [] ? __kfree_skb+0x3f/0xa0
 [] SyS_sendto+0xe/0x10
 [] system_call_fastpath+0x16/0x1b

The call chain leading to this is as follow:
call_ad -> list_set_uadt -> list_set_uadd -> kzalloc(, GFP_KERNEL).
And since GFP_KERNEL allows initiating direct reclaim thus
potentially sleeping in the allocation path, this leads to the
aforementioned splat.

To fix it change the allocation type to GFP_ATOMIC, to
correctly reflect that it is occuring in an atomic context.

Fixes: 00590fdd5be0 ("netfilter: ipset: Introduce RCU locking in list type")

Acked-by: Jozsef Kadlecsik <kad...@blackhole.kfki.hu>
Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---

Changes since V1: 
 * Added acked-by 
 * Fixed patch header 

 net/netfilter/ipset/ip_set_list_set.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/ipset/ip_set_list_set.c 
b/net/netfilter/ipset/ip_set_list_set.c
index a1fe537..5a30ce6 100644
--- a/net/netfilter/ipset/ip_set_list_set.c
+++ b/net/netfilter/ipset/ip_set_list_set.c
@@ -297,7 +297,7 @@ list_set_uadd(struct ip_set *set, void *value, const struct 
ip_set_ext *ext,
  ip_set_timeout_expired(ext_timeout(n, set
n =  NULL;
 
-   e = kzalloc(set->dsize, GFP_KERNEL);
+   e = kzalloc(set->dsize, GFP_ATOMIC);
if (!e)
return -ENOMEM;
e->id = d->id;
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Fix sleeping memory allocation in atomic context

2015-10-15 Thread Nikolay Borisov
Ipset 6.26 produces the following splat:

BUG: sleeping function called from invalid context at mm/page_alloc.c:2759
in_atomic(): 1, irqs_disabled(): 0, pid: 9664, name: ipset
CPU: 18 PID: 9664 Comm: ipset Tainted: G   O 3.12.47-clouder3 #1
Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015
 0002 881fd14273c8 8163d891 881fcb4264b0
 881fcb4260c0 881fd14273e8 810ba5bf 881fd1427558
  881fd1427568 81142b33 881f
Call Trace:
 [] dump_stack+0x58/0x7f
 [] __might_sleep+0xdf/0x110
 [] __alloc_pages_nodemask+0x243/0xc20
 [] alloc_pages_current+0xbe/0x170
 [] new_slab+0x295/0x340
 [] __slab_alloc+0x2c0/0x5a0
 [] ? __schedule+0x2dc/0x760
 [] __kmalloc+0x11b/0x230
 [] ? ip_set_get_byname+0xec/0x100 [ip_set]
 [] list_set_uadd+0x16b/0x314 [ip_set_list_set]
 [] ? _raw_write_unlock_bh+0x28/0x30
 [] list_set_uadt+0x21c/0x320 [ip_set_list_set]
 [] ? list_set_create+0x1a0/0x1a0 [ip_set_list_set]
 [] call_ad+0x82/0x200 [ip_set]
 [] ? find_set_type+0x51/0xa0 [ip_set]
 [] ? nla_parse+0xf5/0x130
 [] ip_set_uadd+0x20e/0x2d0 [ip_set]
 [] ? ip_set_create+0x2a3/0x450 [ip_set]
 [] ? ip_set_udel+0x2e0/0x2e0 [ip_set]
 [] nfnetlink_rcv_msg+0x31e/0x330
 [] ? nfnetlink_rcv_msg+0x41/0x330
 [] ? nfnl_lock+0x30/0x30
 [] netlink_rcv_skb+0xa9/0xd0
 [] nfnetlink_rcv+0x15/0x20
 [] netlink_unicast+0x10f/0x190
 [] netlink_sendmsg+0x2c0/0x660
 [] sock_sendmsg+0x90/0xc0
 [] ? move_addr_to_user+0xa3/0xc0
 [] ? ___sys_recvmsg+0x182/0x300
 [] SYSC_sendto+0x134/0x180
 [] ? mntput+0x21/0x30
 [] ? __kfree_skb+0x3f/0xa0
 [] SyS_sendto+0xe/0x10
 [] system_call_fastpath+0x16/0x1b

The call chain leading to this as follow:
call_add -> list_set_uadt -> list_set_uadd -> kzalloc(, GFP_KERNEL).
And since GFP_KERNEL allows initiating direct reclaim thus
potentially sleeping in the allocation path, this leads to the
aforementioned splat.

To fix it change that particular allocation type to GFP_ATOMIC, to
correctly reflect that it is happening in an atomic context.

Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---

Even though this patch has been generated against the stand-alone
ipset sources I just checked the 4.3-rc4 sources and the problem
exists there as well.

 kernel/net/netfilter/ipset/ip_set_list_set.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/net/netfilter/ipset/ip_set_list_set.c 
b/kernel/net/netfilter/ipset/ip_set_list_set.c
index b11ba96..0f9195f 100644
--- a/kernel/net/netfilter/ipset/ip_set_list_set.c
+++ b/kernel/net/netfilter/ipset/ip_set_list_set.c
@@ -298,7 +298,7 @@ list_set_uadd(struct ip_set *set, void *value, const struct 
ip_set_ext *ext,
  ip_set_timeout_expired(ext_timeout(n, set
n =  NULL;
 
-   e = kzalloc(set->dsize, GFP_KERNEL);
+   e = kzalloc(set->dsize, GFP_ATOMIC);
if (!e)
return -ENOMEM;
e->id = d->id;
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] netfilter: ipset: Fix sleeping memory allocation in atomic context

2015-10-15 Thread Nikolay Borisov


On 10/15/2015 04:32 PM, Eric Dumazet wrote:
> On Thu, 2015-10-15 at 13:56 +0300, Nikolay Borisov wrote:
>> Commit 00590fdd5be0 introduced RCU locking in list type and in
>> doing so introduced a memory allocation in list_set_add, which
>> results in the following splat:
>>
>> BUG: sleeping function called from invalid context at mm/page_alloc.c:2759
>> in_atomic(): 1, irqs_disabled(): 0, pid: 9664, name: ipset
>> CPU: 18 PID: 9664 Comm: ipset Tainted: G   O 3.12.47-clouder3 #1
>> Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015
>>  0002 881fd14273c8 8163d891 881fcb4264b0
>>  881fcb4260c0 881fd14273e8 810ba5bf 881fd1427558
>>   881fd1427568 81142b33 881f
>> Call Trace:
>>  [] dump_stack+0x58/0x7f
>>  [] __might_sleep+0xdf/0x110
>>  [] __alloc_pages_nodemask+0x243/0xc20
>>  [] alloc_pages_current+0xbe/0x170
>>  [] new_slab+0x295/0x340
>>  [] __slab_alloc+0x2c0/0x5a0
>>  [] ? __schedule+0x2dc/0x760
>>  [] __kmalloc+0x11b/0x230
>>  [] ? ip_set_get_byname+0xec/0x100 [ip_set]
>>  [] list_set_uadd+0x16b/0x314 [ip_set_list_set]
>>  [] ? _raw_write_unlock_bh+0x28/0x30
>>  [] list_set_uadt+0x21c/0x320 [ip_set_list_set]
>>  [] ? list_set_create+0x1a0/0x1a0 [ip_set_list_set]
>>  [] call_ad+0x82/0x200 [ip_set]
>>  [] ? find_set_type+0x51/0xa0 [ip_set]
>>  [] ? nla_parse+0xf5/0x130
>>  [] ip_set_uadd+0x20e/0x2d0 [ip_set]
>>  [] ? ip_set_create+0x2a3/0x450 [ip_set]
>>  [] ? ip_set_udel+0x2e0/0x2e0 [ip_set]
>>  [] nfnetlink_rcv_msg+0x31e/0x330
>>  [] ? nfnetlink_rcv_msg+0x41/0x330
>>  [] ? nfnl_lock+0x30/0x30
>>  [] netlink_rcv_skb+0xa9/0xd0
>>  [] nfnetlink_rcv+0x15/0x20
>>  [] netlink_unicast+0x10f/0x190
>>  [] netlink_sendmsg+0x2c0/0x660
>>  [] sock_sendmsg+0x90/0xc0
>>  [] ? move_addr_to_user+0xa3/0xc0
>>  [] ? ___sys_recvmsg+0x182/0x300
>>  [] SYSC_sendto+0x134/0x180
>>  [] ? mntput+0x21/0x30
>>  [] ? __kfree_skb+0x3f/0xa0
>>  [] SyS_sendto+0xe/0x10
>>  [] system_call_fastpath+0x16/0x1b
>>
>> The call chain leading to this is as follow:
>> call_ad -> list_set_uadt -> list_set_uadd -> kzalloc(, GFP_KERNEL).
>> And since GFP_KERNEL allows initiating direct reclaim thus
>> potentially sleeping in the allocation path, this leads to the
>> aforementioned splat.
>>
>> To fix it change the allocation type to GFP_ATOMIC, to
>> correctly reflect that it is occuring in an atomic context.
>>
>> Fixes: 00590fdd5be0 ("netfilter: ipset: Introduce RCU locking in list type")
>>
>> Acked-by: Jozsef Kadlecsik <kad...@blackhole.kfki.hu>
>> Signed-off-by: Nikolay Borisov <ker...@kyup.com>
>> ---
>>
>> Changes since V1: 
>>  * Added acked-by 
>>  * Fixed patch header 
>>
>>  net/netfilter/ipset/ip_set_list_set.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/netfilter/ipset/ip_set_list_set.c 
>> b/net/netfilter/ipset/ip_set_list_set.c
>> index a1fe537..5a30ce6 100644
>> --- a/net/netfilter/ipset/ip_set_list_set.c
>> +++ b/net/netfilter/ipset/ip_set_list_set.c
>> @@ -297,7 +297,7 @@ list_set_uadd(struct ip_set *set, void *value, const 
>> struct ip_set_ext *ext,
>>ip_set_timeout_expired(ext_timeout(n, set
>>  n =  NULL;
>>  
>> -e = kzalloc(set->dsize, GFP_KERNEL);
>> +e = kzalloc(set->dsize, GFP_ATOMIC);
>>  if (!e)
>>  return -ENOMEM;
>>  e->id = d->id;
> 
> This patch looks very bogus to me.
> 
> Could we fix the root cause please ?
> 
> Root cause is that somewhere in this controlling path, an erroneous
> rcu_read_lock() is used, while it is very probably not needed, as
> controlling path should be protected by a mutex, which definitely is
> sane, because it allows us to perform GFP_KERNEL allocations and being
> preempted.
> 
> Why are we using rcu_read_lock() in list_set_list() ?
> 
> This looks as yet another bit of 'let us throw
> rcu_read_lock()/rcu_read_unlock() pairs' all over the places because it
> feels so good.

I did check the call paths and there isn't an rcu_read_lock called in
list_set_uadt/list_set_uadd. On the contrary, this "write" operation to
the list is being serialised in call_ad() via set->lock spin_lock.

What am I missing here?


> 
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] netfilter: ipset: Fix sleeping memory allocation in atomic context

2015-10-15 Thread Nikolay Borisov


On 10/15/2015 05:32 PM, Eric Dumazet wrote:
> On Thu, 2015-10-15 at 16:41 +0300, Nikolay Borisov wrote:
>>
>> On 10/15/2015 04:32 PM, Eric Dumazet wrote:
>>> On Thu, 2015-10-15 at 13:56 +0300, Nikolay Borisov wrote:
>>>> Commit 00590fdd5be0 introduced RCU locking in list type and in
>>>> doing so introduced a memory allocation in list_set_add, which
>>>> results in the following splat:
>>>>
>>>> BUG: sleeping function called from invalid context at mm/page_alloc.c:2759
>>>> in_atomic(): 1, irqs_disabled(): 0, pid: 9664, name: ipset
>>>> CPU: 18 PID: 9664 Comm: ipset Tainted: G   O 3.12.47-clouder3 #1
>>>> Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015
>>>>  0002 881fd14273c8 8163d891 881fcb4264b0
>>>>  881fcb4260c0 881fd14273e8 810ba5bf 881fd1427558
>>>>   881fd1427568 81142b33 881f
>>>> Call Trace:
>>>>  [] dump_stack+0x58/0x7f
>>>>  [] __might_sleep+0xdf/0x110
>>>>  [] __alloc_pages_nodemask+0x243/0xc20
>>>>  [] alloc_pages_current+0xbe/0x170
>>>>  [] new_slab+0x295/0x340
>>>>  [] __slab_alloc+0x2c0/0x5a0
>>>>  [] ? __schedule+0x2dc/0x760
>>>>  [] __kmalloc+0x11b/0x230
>>>>  [] ? ip_set_get_byname+0xec/0x100 [ip_set]
>>>>  [] list_set_uadd+0x16b/0x314 [ip_set_list_set]
>>>>  [] ? _raw_write_unlock_bh+0x28/0x30
>>>>  [] list_set_uadt+0x21c/0x320 [ip_set_list_set]
>>>>  [] ? list_set_create+0x1a0/0x1a0 [ip_set_list_set]
>>>>  [] call_ad+0x82/0x200 [ip_set]
>>>>  [] ? find_set_type+0x51/0xa0 [ip_set]
>>>>  [] ? nla_parse+0xf5/0x130
>>>>  [] ip_set_uadd+0x20e/0x2d0 [ip_set]
>>>>  [] ? ip_set_create+0x2a3/0x450 [ip_set]
>>>>  [] ? ip_set_udel+0x2e0/0x2e0 [ip_set]
>>>>  [] nfnetlink_rcv_msg+0x31e/0x330
>>>>  [] ? nfnetlink_rcv_msg+0x41/0x330
>>>>  [] ? nfnl_lock+0x30/0x30
>>>>  [] netlink_rcv_skb+0xa9/0xd0
>>>>  [] nfnetlink_rcv+0x15/0x20
>>>>  [] netlink_unicast+0x10f/0x190
>>>>  [] netlink_sendmsg+0x2c0/0x660
>>>>  [] sock_sendmsg+0x90/0xc0
>>>>  [] ? move_addr_to_user+0xa3/0xc0
>>>>  [] ? ___sys_recvmsg+0x182/0x300
>>>>  [] SYSC_sendto+0x134/0x180
>>>>  [] ? mntput+0x21/0x30
>>>>  [] ? __kfree_skb+0x3f/0xa0
>>>>  [] SyS_sendto+0xe/0x10
>>>>  [] system_call_fastpath+0x16/0x1b
>>>>
>>>> The call chain leading to this is as follow:
>>>> call_ad -> list_set_uadt -> list_set_uadd -> kzalloc(, GFP_KERNEL).
>>>> And since GFP_KERNEL allows initiating direct reclaim thus
>>>> potentially sleeping in the allocation path, this leads to the
>>>> aforementioned splat.
>>>>
>>>> To fix it change the allocation type to GFP_ATOMIC, to
>>>> correctly reflect that it is occuring in an atomic context.
>>>>
>>>> Fixes: 00590fdd5be0 ("netfilter: ipset: Introduce RCU locking in list 
>>>> type")
>>>>
>>>> Acked-by: Jozsef Kadlecsik <kad...@blackhole.kfki.hu>
>>>> Signed-off-by: Nikolay Borisov <ker...@kyup.com>
>>>> ---
>>>>
>>>> Changes since V1: 
>>>>  * Added acked-by 
>>>>  * Fixed patch header 
>>>>
>>>>  net/netfilter/ipset/ip_set_list_set.c | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/net/netfilter/ipset/ip_set_list_set.c 
>>>> b/net/netfilter/ipset/ip_set_list_set.c
>>>> index a1fe537..5a30ce6 100644
>>>> --- a/net/netfilter/ipset/ip_set_list_set.c
>>>> +++ b/net/netfilter/ipset/ip_set_list_set.c
>>>> @@ -297,7 +297,7 @@ list_set_uadd(struct ip_set *set, void *value, const 
>>>> struct ip_set_ext *ext,
>>>>  ip_set_timeout_expired(ext_timeout(n, set
>>>>n =  NULL;
>>>>  
>>>> -  e = kzalloc(set->dsize, GFP_KERNEL);
>>>> +  e = kzalloc(set->dsize, GFP_ATOMIC);
>>>>if (!e)
>>>>return -ENOMEM;
>>>>e->id = d->id;
>>>
>>> This patch looks very bogus to me.
>>>
>>> Could we fix the root cause please ?
>>>
>>> Root cause is that somewhere in this controlling path, an erroneous
>>> rcu_read_lock() is used, 

Re: [PATCH v2] netfilter: ipset: Fix sleeping memory allocation in atomic context

2015-10-15 Thread Nikolay Borisov
On Thu, Oct 15, 2015 at 9:46 PM, Eric Dumazet  wrote:
> On Thu, 2015-10-15 at 20:25 +0200, Jozsef Kadlecsik wrote:
>
>> Nikolay answered this pretty well: we wouldn't need the spinlock at all,
>> because all commands are serialized anyway with the netlink mutex. But the
>> garbage collector is called by a timer and therefore spinlock is used.
>>
>
> Good, please Nikolay, send a v2 of the patch with all these details
> explained in the changelog, so that we can all agree.

While GFP_ATOMIC does indeed look the correct solution for this particular
case I was wondering whether something like (GFP_KERNEL & ~__GFP_WAIT)
wouldn't also make the cut without causing sleeping? I guess this is exactly
the sort of situation that Mel Gorman's patch can address
(marc.info/?l=linux-kernel=144283282101953) ?

In any case I will send v2 tomorrow.

>
> If properly explained, no need to add the stack trace which does not
> really tell us the story.
>
> Thanks !
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html