Re: [patch 1/1] net/netfilter/x_tables.c: make allocation less aggressive
On Tue, 2018-01-30 at 11:30 -0800, a...@linux-foundation.org wrote: > From: Michal Hocko > Subject: net/netfilter/x_tables.c: make allocation less aggressive > > syzbot has noticed that xt_alloc_table_info can allocate a lot of memory. > This is an admin only interface but an admin in a namespace is sufficient > as well. eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc() in > xt_alloc_table_info()") has changed the opencoded kmalloc->vmalloc > fallback into kvmalloc. It has dropped __GFP_NORETRY on the way because > vmalloc has simply never fully supported __GFP_NORETRY semantic. This is > still the case because e.g. page tables backing the vmalloc area are > hardcoded GFP_KERNEL. > > Revert back to __GFP_NORETRY as a poors man defence against excessively > large allocation request here. We will not rule out the OOM killer > completely but __GFP_NORETRY should at least stop the large request in > most cases. > > [a...@linux-foundation.org: coding-style fixes] > Fixes: eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc() in > xt_alloc_tableLink: > http://lkml.kernel.org/r/20180130140104.ge21...@dhcp22.suse.cz > Signed-off-by: Michal Hocko > Acked-by: Florian Westphal > Reviewed-by: Andrew Morton > Cc: David S. Miller > Signed-off-by: Andrew Morton > --- > > net/netfilter/x_tables.c |7 ++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff -puN > net/netfilter/x_tables.c~net-netfilter-x_tablesc-make-allocation-less-aggressive > net/netfilter/x_tables.c > --- > a/net/netfilter/x_tables.c~net-netfilter-x_tablesc-make-allocation-less-aggressive > +++ a/net/netfilter/x_tables.c > @@ -1008,7 +1008,12 @@ struct xt_table_info *xt_alloc_table_inf > if ((size >> PAGE_SHIFT) + 2 > totalram_pages) > return NULL; > > - info = kvmalloc(sz, GFP_KERNEL); > + /* __GFP_NORETRY is not fully supported by kvmalloc but it should > + * work reasonably well if sz is too large and bail out rather > + * than shoot all processes down before realizing there is nothing > + * more to reclaim. > + */ > + info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY); > if (!info) > return NULL; How is __GFP_NORETRY working exactly ? Surely, if some firewall tools attempt to load a new iptables rules, we do not want to abort them if the request can be satisfied after few pages moved on swap or written back to disk. We want to avoid huge allocations, but leave reasonable ones succeed. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 1/1] net/netfilter/x_tables.c: make allocation less aggressive
From: Michal Hocko Subject: net/netfilter/x_tables.c: make allocation less aggressive syzbot has noticed that xt_alloc_table_info can allocate a lot of memory. This is an admin only interface but an admin in a namespace is sufficient as well. eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc() in xt_alloc_table_info()") has changed the opencoded kmalloc->vmalloc fallback into kvmalloc. It has dropped __GFP_NORETRY on the way because vmalloc has simply never fully supported __GFP_NORETRY semantic. This is still the case because e.g. page tables backing the vmalloc area are hardcoded GFP_KERNEL. Revert back to __GFP_NORETRY as a poors man defence against excessively large allocation request here. We will not rule out the OOM killer completely but __GFP_NORETRY should at least stop the large request in most cases. [a...@linux-foundation.org: coding-style fixes] Fixes: eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc() in xt_alloc_tableLink: http://lkml.kernel.org/r/20180130140104.ge21...@dhcp22.suse.cz Signed-off-by: Michal Hocko Acked-by: Florian Westphal Reviewed-by: Andrew Morton Cc: David S. Miller Signed-off-by: Andrew Morton --- net/netfilter/x_tables.c |7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff -puN net/netfilter/x_tables.c~net-netfilter-x_tablesc-make-allocation-less-aggressive net/netfilter/x_tables.c --- a/net/netfilter/x_tables.c~net-netfilter-x_tablesc-make-allocation-less-aggressive +++ a/net/netfilter/x_tables.c @@ -1008,7 +1008,12 @@ struct xt_table_info *xt_alloc_table_inf if ((size >> PAGE_SHIFT) + 2 > totalram_pages) return NULL; - info = kvmalloc(sz, GFP_KERNEL); + /* __GFP_NORETRY is not fully supported by kvmalloc but it should +* work reasonably well if sz is too large and bail out rather +* than shoot all processes down before realizing there is nothing +* more to reclaim. +*/ + info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY); if (!info) return NULL; _ -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)
On Tue, 30 Jan 2018 15:01:04 +0100 Michal Hocko wrote: > > Well, this is not about syzkaller, it merely pointed out a potential > > DoS... And that has to be addressed somehow. > > So how about this? > --- argh ;) > >From d48e950f1b04f234b57b9e34c363bdcfec10aeee Mon Sep 17 00:00:00 2001 > From: Michal Hocko > Date: Tue, 30 Jan 2018 14:51:07 +0100 > Subject: [PATCH] net/netfilter/x_tables.c: make allocation less aggressive > > syzbot has noticed that xt_alloc_table_info can allocate a lot of > memory. This is an admin only interface but an admin in a namespace > is sufficient as well. eacd86ca3b03 ("net/netfilter/x_tables.c: use > kvmalloc() in xt_alloc_table_info()") has changed the opencoded > kmalloc->vmalloc fallback into kvmalloc. It has dropped __GFP_NORETRY on > the way because vmalloc has simply never fully supported __GFP_NORETRY > semantic. This is still the case because e.g. page tables backing the > vmalloc area are hardcoded GFP_KERNEL. > > Revert back to __GFP_NORETRY as a poors man defence against excessively > large allocation request here. We will not rule out the OOM killer > completely but __GFP_NORETRY should at least stop the large request > in most cases. > > Fixes: eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc() in > xt_alloc_table_info()") > Signed-off-by: Michal Hocko > --- > net/netfilter/x_tables.c | 8 +++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c > index d8571f414208..a5f5c29bcbdc 100644 > --- a/net/netfilter/x_tables.c > +++ b/net/netfilter/x_tables.c > @@ -1003,7 +1003,13 @@ struct xt_table_info *xt_alloc_table_info(unsigned int > size) > if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 > totalram_pages) > return NULL; offtopic: preceding comment here is "prevent them from hitting BUG() in vmalloc.c". I suspect this is ancient code and vmalloc sure as heck shouldn't go BUG with this input. And it should be using `sz' ;) So I suspect and hope that this code can be removed. If not, let's fix vmalloc! > - info = kvmalloc(sz, GFP_KERNEL); > + /* > + * __GFP_NORETRY is not fully supported by kvmalloc but it should > + * work reasonably well if sz is too large and bail out rather > + * than shoot all processes down before realizing there is nothing > + * more to reclaim. > + */ > + info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY); > if (!info) > return NULL; checkpatch sayeth networking block comments don't use an empty /* line, use /* Comment... So I'll do that and shall scoot the patch Davewards. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net] netfilter: on sockopt() acquire sock lock only in the required scope
Syzbot reported several deadlocks in the netfilter area caused by rtnl lock and socket lock being acquired with a different order on different code paths, leading to backtraces like the following one: == WARNING: possible circular locking dependency detected 4.15.0-rc9+ #212 Not tainted -- syzkaller041579/3682 is trying to acquire lock: (sk_lock-AF_INET6){+.+.}, at: [<8775e4dd>] lock_sock include/net/sock.h:1463 [inline] (sk_lock-AF_INET6){+.+.}, at: [<8775e4dd>] do_ipv6_setsockopt.isra.8+0x3c5/0x39d0 net/ipv6/ipv6_sockglue.c:167 but task is already holding lock: (rtnl_mutex){+.+.}, at: [<4342eaa9>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (rtnl_mutex){+.+.}: __mutex_lock_common kernel/locking/mutex.c:756 [inline] __mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893 mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908 rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 register_netdevice_notifier+0xad/0x860 net/core/dev.c:1607 tee_tg_check+0x1a0/0x280 net/netfilter/xt_TEE.c:106 xt_check_target+0x22c/0x7d0 net/netfilter/x_tables.c:845 check_target net/ipv6/netfilter/ip6_tables.c:538 [inline] find_check_entry.isra.7+0x935/0xcf0 net/ipv6/netfilter/ip6_tables.c:580 translate_table+0xf52/0x1690 net/ipv6/netfilter/ip6_tables.c:749 do_replace net/ipv6/netfilter/ip6_tables.c:1165 [inline] do_ip6t_set_ctl+0x370/0x5f0 net/ipv6/netfilter/ip6_tables.c:1691 nf_sockopt net/netfilter/nf_sockopt.c:106 [inline] nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115 ipv6_setsockopt+0x115/0x150 net/ipv6/ipv6_sockglue.c:928 udpv6_setsockopt+0x45/0x80 net/ipv6/udp.c:1422 sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2978 SYSC_setsockopt net/socket.c:1849 [inline] SyS_setsockopt+0x189/0x360 net/socket.c:1828 entry_SYSCALL_64_fastpath+0x29/0xa0 -> #0 (sk_lock-AF_INET6){+.+.}: lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3914 lock_sock_nested+0xc2/0x110 net/core/sock.c:2780 lock_sock include/net/sock.h:1463 [inline] do_ipv6_setsockopt.isra.8+0x3c5/0x39d0 net/ipv6/ipv6_sockglue.c:167 ipv6_setsockopt+0xd7/0x150 net/ipv6/ipv6_sockglue.c:922 udpv6_setsockopt+0x45/0x80 net/ipv6/udp.c:1422 sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2978 SYSC_setsockopt net/socket.c:1849 [inline] SyS_setsockopt+0x189/0x360 net/socket.c:1828 entry_SYSCALL_64_fastpath+0x29/0xa0 other info that might help us debug this: Possible unsafe locking scenario: CPU0CPU1 lock(rtnl_mutex); lock(sk_lock-AF_INET6); lock(rtnl_mutex); lock(sk_lock-AF_INET6); *** DEADLOCK *** 1 lock held by syzkaller041579/3682: #0: (rtnl_mutex){+.+.}, at: [<4342eaa9>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 The problem, as Florian noted, is that nf_setsockopt() is always called with the socket held, even if the lock itself is required only for very tight scopes and only for some operation. This patch addresses the issues moving the lock_sock() call only where really needed, namely in ipv*_getorigdst(), so that nf_setsockopt() does not need anymore to acquire both locks. Fixes: 22265a5c3c10 ("netfilter: xt_TEE: resolve oif using netdevice notifiers") Reported-by: syzbot+a4c2dc980ac1af699...@syzkaller.appspotmail.com Suggested-by: Florian Westphal Signed-off-by: Paolo Abeni --- net/ipv4/ip_sockglue.c | 14 -- net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c | 6 +- net/ipv6/ipv6_sockglue.c | 17 + net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c | 18 -- 4 files changed, 26 insertions(+), 29 deletions(-) diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c index 60fb1eb7d7d8..c7df4969f80a 100644 --- a/net/ipv4/ip_sockglue.c +++ b/net/ipv4/ip_sockglue.c @@ -1251,11 +1251,8 @@ int ip_setsockopt(struct sock *sk, int level, if (err == -ENOPROTOOPT && optname != IP_HDRINCL && optname != IP_IPSEC_POLICY && optname != IP_XFRM_POLICY && - !ip_mroute_opt(optname)) { - lock_sock(sk); + !ip_mroute_opt(optname)) err = nf_setsockopt(sk, PF_INET, optname, optval, optlen); - release_sock(sk); - } #endif return err; } @@ -1280,12 +1277,9 @@ int compat_ip_setsockopt(struct sock *sk, int level, int optname, if (err == -ENOPROTOOPT && optname != IP_HDRINCL && optname != IP_
Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)
On Tue 30-01-18 15:01:11, Florian Westphal wrote: > > From d48e950f1b04f234b57b9e34c363bdcfec10aeee Mon Sep 17 00:00:00 2001 > > From: Michal Hocko > > Date: Tue, 30 Jan 2018 14:51:07 +0100 > > Subject: [PATCH] net/netfilter/x_tables.c: make allocation less aggressive > > Acked-by: Florian Westphal Thanks! How should we route this change? Andrew, David? -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] netfilter: fix out-of-bounds accesses in clusterip_tg_check()
Commit 136e92bbec0a switched local_nodes from an array to a bitmask but did not add proper bounds checks. As the result clusterip_config_init_nodelist() can both over-read ipt_clusterip_tgt_info.local_nodes and over-write clusterip_config.local_nodes. Add bounds checks for both. Signed-off-by: Dmitry Vyukov Fixes: 136e92bbec0a ("[NETFILTER] CLUSTERIP: use a bitmap to store node responsibility data") Reported-by: syzbot Cc: netfilter-devel@vger.kernel.org Cc: coret...@netfilter.org Cc: net...@vger.kernel.org Cc: linux-ker...@vger.kernel.org Cc: syzkal...@googlegroups.com --- net/ipv4/netfilter/ipt_CLUSTERIP.c | 16 +--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/net/ipv4/netfilter/ipt_CLUSTERIP.c b/net/ipv4/netfilter/ipt_CLUSTERIP.c index 69060e3abe85..1e4a7209a3d2 100644 --- a/net/ipv4/netfilter/ipt_CLUSTERIP.c +++ b/net/ipv4/netfilter/ipt_CLUSTERIP.c @@ -431,7 +431,7 @@ static int clusterip_tg_check(const struct xt_tgchk_param *par) struct ipt_clusterip_tgt_info *cipinfo = par->targinfo; const struct ipt_entry *e = par->entryinfo; struct clusterip_config *config; - int ret; + int ret, i; if (par->nft_compat) { pr_err("cannot use CLUSTERIP target from nftables compat\n"); @@ -450,8 +450,18 @@ static int clusterip_tg_check(const struct xt_tgchk_param *par) pr_info("Please specify destination IP\n"); return -EINVAL; } - - /* FIXME: further sanity checks */ + if (cipinfo->num_local_nodes > ARRAY_SIZE(cipinfo->local_nodes)) { + pr_info("bad num_local_nodes %u\n", cipinfo->num_local_nodes); + return -EINVAL; + } + for (i = 0; i < cipinfo->num_local_nodes; i++) { + if (cipinfo->local_nodes[i] - 1 >= + sizeof(config->local_nodes) * 8) { + pr_info("bad local_nodes[%d] %u\n", + i, cipinfo->local_nodes[i]); + return -EINVAL; + } + } config = clusterip_config_find_get(par->net, e->ip.dst.s_addr, 1); if (!config) { -- 2.16.0.rc1.238.g530d649a79-goog -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)
> From d48e950f1b04f234b57b9e34c363bdcfec10aeee Mon Sep 17 00:00:00 2001 > From: Michal Hocko > Date: Tue, 30 Jan 2018 14:51:07 +0100 > Subject: [PATCH] net/netfilter/x_tables.c: make allocation less aggressive Acked-by: Florian Westphal -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)
On Tue 30-01-18 10:57:39, Michal Hocko wrote: > On Tue 30-01-18 10:02:34, Dmitry Vyukov wrote: > > On Tue, Jan 30, 2018 at 9:28 AM, Kirill A. Shutemov > > wrote: > > > On Tue, Jan 30, 2018 at 09:11:27AM +0100, Florian Westphal wrote: > > >> Michal Hocko wrote: > > >> > On Mon 29-01-18 23:35:22, Florian Westphal wrote: > > >> > > Kirill A. Shutemov wrote: > > >> > [...] > > >> > > > I hate what I'm saying, but I guess we need some tunable here. > > >> > > > Not sure what exactly. > > >> > > > > >> > > Would memcg help? > > >> > > > >> > That really depends. I would have to check whether vmalloc path obeys > > >> > __GFP_ACCOUNT (I suspect it does except for page tables allocations but > > >> > that shouldn't be a big deal). But then the other potential problem is > > >> > the life time of the xt_table_info (or other potentially large) data > > >> > structures. Are they bound to any process life time. > > >> > > >> No. > > > > > > Well, IIUC they bound to net namespace life time, so killing all > > > proccesses in the namespace would help to get memory back. :) > > > > ... unless the namespace is mounted into file system. > > > > Let's start with NOWARN as that's what kernel generally uses for > > allocations with user-controllable size. ENOMEM is roughly as > > informative as the WARNING message in this case. > > You want __GFP_NORETRY but that is not _fully_ supported by kvmalloc > right now. More specifically kvmalloc doesn't guanratee that the request > will not trigger the OOM killer (like regular __GFP_NORETRY). This is > because of internal vmalloc restrictions. If you are however OK to > simply bail out in most cases then __GFP_NORETRY should work reasonably > fine. > > > I think we also need to consider setting up memory cgroup for > > syzkaller test processes (we do RLIMIT_AS, but that's weak). > > Well, this is not about syzkaller, it merely pointed out a potential > DoS... And that has to be addressed somehow. So how about this? --- >From d48e950f1b04f234b57b9e34c363bdcfec10aeee Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Tue, 30 Jan 2018 14:51:07 +0100 Subject: [PATCH] net/netfilter/x_tables.c: make allocation less aggressive syzbot has noticed that xt_alloc_table_info can allocate a lot of memory. This is an admin only interface but an admin in a namespace is sufficient as well. eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc() in xt_alloc_table_info()") has changed the opencoded kmalloc->vmalloc fallback into kvmalloc. It has dropped __GFP_NORETRY on the way because vmalloc has simply never fully supported __GFP_NORETRY semantic. This is still the case because e.g. page tables backing the vmalloc area are hardcoded GFP_KERNEL. Revert back to __GFP_NORETRY as a poors man defence against excessively large allocation request here. We will not rule out the OOM killer completely but __GFP_NORETRY should at least stop the large request in most cases. Fixes: eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc() in xt_alloc_table_info()") Signed-off-by: Michal Hocko --- net/netfilter/x_tables.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c index d8571f414208..a5f5c29bcbdc 100644 --- a/net/netfilter/x_tables.c +++ b/net/netfilter/x_tables.c @@ -1003,7 +1003,13 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size) if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 > totalram_pages) return NULL; - info = kvmalloc(sz, GFP_KERNEL); + /* +* __GFP_NORETRY is not fully supported by kvmalloc but it should +* work reasonably well if sz is too large and bail out rather +* than shoot all processes down before realizing there is nothing +* more to reclaim. +*/ + info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY); if (!info) return NULL; -- 2.15.1 -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4] netfilter : add NAT support for shifted portmap ranges
This is a patch proposal to support shifted ranges in portmaps. (i.e. tcp/udp incoming port 5000-5100 on WAN redirected to LAN 192.168.1.5:2000-2100) Currently DNAT only works for single port or identical port ranges. (i.e. ports 5000-5100 on WAN interface redirected to a LAN host while original destination port is not altered) When different port ranges are configured, either 'random' mode should be used, or else all incoming connections are mapped onto the first port in the redirect range. (in described example WAN:5000-5100 will all be mapped to 192.168.1.5:2000) This patch introduces a new mode indicated by flag NF_NAT_RANGE_PROTO_OFFSET which uses a base port value to calculate an offset with the destination port present in the incoming stream. That offset is then applied as index within the redirect port range (index modulo rangewidth to handle range overflow). In described example the base port would be 5000. An incoming stream with destination port 5004 would result in an offset value 4 which means that the NAT'ed stream will be using destination port 2004. Other possibilities include deterministic mapping of larger or multiple ranges to a smaller range : WAN:5000-5999 -> LAN:5000-5099 (maps WAN port 5*xx to port 51xx) This patch does not change any current behavior. It just adds new NAT proto range functionality which must be selected via the specific flag when intended to use. A patch for iptables (libipt_DNAT.c + libip6t_DNAT.c) will also be proposed which makes this functionality immediately available. Signed-off-by: Thierry Du Tre --- Changes in v4: - renamed nf_nat_range1 to nf_nat_range_v1 Changes in v3: - use nf_nat_range as name for updated struct, renamed existing nf_nat_range to nf_nat_range1 - reverted all nf_nat_range2 occurences Changes in v2: - added new revision for SNAT and DNAT targets to support the new base port variable in struct nf_nat_range2 - replaced all occurences of struct nf_nat_range with struct nf_nat_range2 include/uapi/linux/netfilter/nf_nat.h | 12 ++- net/netfilter/nf_nat_core.c | 9 +++--- net/netfilter/nf_nat_proto_common.c | 5 ++- net/netfilter/xt_nat.c| 60 +-- 4 files changed, 78 insertions(+), 8 deletions(-) diff --git a/include/uapi/linux/netfilter/nf_nat.h b/include/uapi/linux/netfilter/nf_nat.h index a33000d..5d919c7 100644 --- a/include/uapi/linux/netfilter/nf_nat.h +++ b/include/uapi/linux/netfilter/nf_nat.h @@ -10,6 +10,7 @@ #define NF_NAT_RANGE_PROTO_RANDOM (1 << 2) #define NF_NAT_RANGE_PERSISTENT(1 << 3) #define NF_NAT_RANGE_PROTO_RANDOM_FULLY(1 << 4) +#define NF_NAT_RANGE_PROTO_OFFSET (1 << 5) #define NF_NAT_RANGE_PROTO_RANDOM_ALL \ (NF_NAT_RANGE_PROTO_RANDOM | NF_NAT_RANGE_PROTO_RANDOM_FULLY) @@ -17,7 +18,7 @@ #define NF_NAT_RANGE_MASK \ (NF_NAT_RANGE_MAP_IPS | NF_NAT_RANGE_PROTO_SPECIFIED | \ NF_NAT_RANGE_PROTO_RANDOM | NF_NAT_RANGE_PERSISTENT | \ -NF_NAT_RANGE_PROTO_RANDOM_FULLY) +NF_NAT_RANGE_PROTO_RANDOM_FULLY | NF_NAT_RANGE_PROTO_OFFSET) struct nf_nat_ipv4_range { unsigned intflags; @@ -32,12 +33,21 @@ struct nf_nat_ipv4_multi_range_compat { struct nf_nat_ipv4_rangerange[1]; }; +struct nf_nat_range_v1 { + unsigned intflags; + union nf_inet_addr min_addr; + union nf_inet_addr max_addr; + union nf_conntrack_man_protomin_proto; + union nf_conntrack_man_protomax_proto; +}; + struct nf_nat_range { unsigned intflags; union nf_inet_addr min_addr; union nf_inet_addr max_addr; union nf_conntrack_man_protomin_proto; union nf_conntrack_man_protomax_proto; + union nf_conntrack_man_protobase_proto; }; #endif /* _NETFILTER_NF_NAT_H */ diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c index af8345f..de5c327 100644 --- a/net/netfilter/nf_nat_core.c +++ b/net/netfilter/nf_nat_core.c @@ -347,9 +347,10 @@ get_unique_tuple(struct nf_conntrack_tuple *tuple, /* Only bother mapping if it's not already in range and unique */ if (!(range->flags & NF_NAT_RANGE_PROTO_RANDOM_ALL)) { if (range->flags & NF_NAT_RANGE_PROTO_SPECIFIED) { - if (l4proto->in_range(tuple, maniptype, - &range->min_proto, - &range->max_proto) && + if (!(range->flags & NF_NAT_RANGE_PROTO_OFFSET) && + l4proto->in_range(tuple, maniptype, + &range->min_proto, + &range->max_proto) && (range->m
[PATCH v2] extensions: libipt_DNAT: support shifted portmap ranges
This is a proposal patch for iptables DNAT extension to support shifted portmap ranges. It is related to the kernel patch proposed in earlier message '[PATCH v4] netfilter : add NAT support for shifted portmap ranges'. The definition for struct nf_nat_range was extended but backwards compatibility is achieved via the use of a new revision (2) for the DNAT target. The extensions using struct nf_nat_range were adapted for the renamed struct nf_nat_range_v1. Current DNAT revisions for Ipv4 (rev 0) and IPv6 (rev 1) are kept so functionality with older kernels is not impacted. The syntax for shifted portmaps uses an extra value in '--to-destination' for setting the base port which determines the offset in the redirect port range for incoming connections. i.e. : iptables -t nat -A zone_wan_prerouting -p tcp -m tcp --dport 5000:5100 -j DNAT --to-destination '192.168.1.2:2000-2100;5000' The base port value is totally optional, so current behavior is not impacted in any way. The use of semicolon as separator is an arbitrary choice, all other suggestions are valid of course. Another approach using an additional option seems also possible (i.e. '--base-port 5000'). However, that would mean more parsing logic with extra lines of code and thus increased risk for regression. Signed-off-by: Thierry Du Tre --- Changes in v2: - renaming old struct nf_nat_range to nf_nat_range_v1 - introduction of new DNAT revision for both IPv4 and IPv6 extensions/libip6t_DNAT.c| 166 ++- extensions/libip6t_MASQUERADE.c | 14 +- extensions/libip6t_NETMAP.c | 8 +- extensions/libip6t_REDIRECT.c| 14 +- extensions/libip6t_SNAT.c| 20 +-- extensions/libipt_DNAT.c | 275 +-- include/linux/netfilter/nf_nat.h | 15 +++ 7 files changed, 435 insertions(+), 77 deletions(-) diff --git a/extensions/libip6t_DNAT.c b/extensions/libip6t_DNAT.c index 8e478b2..aeccc36 100644 --- a/extensions/libip6t_DNAT.c +++ b/extensions/libip6t_DNAT.c @@ -34,6 +34,15 @@ static void DNAT_help(void) "[--random] [--persistent]\n"); } +static void DNAT_help_v2(void) +{ + printf( +"DNAT target options:\n" +" --to-destination [[-]][:port[-port[;port]]]\n" +" Address to map destination to.\n" +"[--random] [--persistent]\n"); +} + static const struct xt_option_entry DNAT_opts[] = { {.name = "to-destination", .id = O_TO_DEST, .type = XTTYPE_STRING, .flags = XTOPT_MAND | XTOPT_MULTI}, @@ -44,7 +53,7 @@ static const struct xt_option_entry DNAT_opts[] = { /* Ranges expected in network order. */ static void -parse_to(const char *orig_arg, int portok, struct nf_nat_range *range) +parse_to(const char *orig_arg, int portok, struct nf_nat_range *range, int rev) { char *arg, *start, *end = NULL, *colon = NULL, *dash, *error; const struct in6_addr *ip; @@ -144,10 +153,10 @@ parse_to(const char *orig_arg, int portok, struct nf_nat_range *range) return; } -static void DNAT_parse(struct xt_option_call *cb) +static void _DNAT_parse(struct xt_option_call *cb, + struct nf_nat_range *range, int rev) { const struct ip6t_entry *entry = cb->xt_entry; - struct nf_nat_range *range = cb->data; int portok; if (entry->ipv6.proto == IPPROTO_TCP || @@ -175,16 +184,40 @@ static void DNAT_parse(struct xt_option_call *cb) } } -static void DNAT_fcheck(struct xt_fcheck_call *cb) +static void DNAT_parse(struct xt_option_call *cb) +{ + struct nf_nat_range_v1 *range_v1 = (void *)cb->data; + struct nf_nat_range range = {}; + + memcpy(&range, range_v1, sizeof(*range_v1)); + _DNAT_parse(cb, &range, 1); + memcpy(range_v1, &range, sizeof(*range_v1)); +} + +static void DNAT_parse_v2(struct xt_option_call *cb) +{ + _DNAT_parse(cb, (struct nf_nat_range *)cb->data, 2); +} + +static void _DNAT_fcheck(struct xt_fcheck_call *cb, unsigned int *flags) { static const unsigned int f = F_TO_DEST | F_RANDOM; - struct nf_nat_range *mr = cb->data; if ((cb->xflags & f) == f) - mr->flags |= NF_NAT_RANGE_PROTO_RANDOM; + *flags |= NF_NAT_RANGE_PROTO_RANDOM; +} + +static void DNAT_fcheck(struct xt_fcheck_call *cb) +{ + _DNAT_fcheck(cb, &((struct nf_nat_range_v1 *)cb->data)->flags); } -static void print_range(const struct nf_nat_range *range) +static void DNAT_fcheck_v2(struct xt_fcheck_call *cb) +{ + _DNAT_fcheck(cb, &((struct nf_nat_range *)cb->data)->flags); +} + +static void print_range(const struct nf_nat_range *range, int rev) { if (range->flags & NF_NAT_RANGE_MAP_IPS) { if (range->flags & NF_NAT_RANGE_PROTO_SPECIFIED) @@ -201,36 +234,63 @@ static void print_range(const struct nf_nat_range *range) printf("%hu", ntohs(range->min_proto.tcp.port)); if (range->max_proto.tcp.port != range->min_proto.tc
Re: question about UNDEFINE/REDEFINE
Hello Arturo, Dne pátek 26. ledna 2018 19:43:18 CET, Arturo Borrero Gonzalez napsal(a): > My suggestion is to simply create one variable per value: > > define INET_IFACES_VLAN43 = { bond0.x, bond3.y} > define INET_IFACES_VLAN3 = { bond3.x, bond3.y} > define XXX_VLAN43 = xxx > define XXX_VLAN3 = xxx > > you could generate such a file, something like 'defines.nft' and > include it once in your main ruleset file. that is exactly the boilerplate that we are trying to avoid. By using consistent (and non-unique) variable names we are able to freely move the rules from one customer to another without rewriting every use of a variable every time. We also do not want to build a code-generating harness in bash (or any other language) since that would sort of defeat the purpose of scripting in nftables in my eyes. the redefine keyword was just my first idea to solve the problem of a single flat variable scope. There may be a better approach but I think that if nftables wants to have scripting capabilities, some kind of variable scoping (even in flat notation) and more ubiquitous variable use within rules is necessary. I event went so far and made some experimental patches that allowed me to use string variables and string concatenation in places like interface names and rule targets. With that I was able to create very generic rules and I tied them to a customer/VLAN just by changing one or two constants in the header of a file (e.g. the VLAN number). Of course, I had to use redefine in the header. -- S pozdravem, David Fabian Cluster Design, s.r.o. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: question about UNDEFINE/REDEFINE
Hello Pablo, Dne pátek 26. ledna 2018 14:45:49 CET, Pablo Neira Ayuso napsal(a): > 2) Probably even cleaner is to look at 'local' scopes like in bash. > > define local IP1 = 1.1.1.1 > > so the symbol is bound to this file - consider the content of a file > determines a given scope. This can be also useful to the nested > notation. > > 3) You rework your ruleset to use the notation with nesting :-). But I > think 2) can be useful for both the flat and nested notation. > > I'm not asking you to do 2), but I would like to see how a patch that > adds explicit scoping for the flat ruleset representation looks like. I know about scopes in the code but unfortunately as you said, the flat notation only has a single scope. Since we are talking about analogy to bash, bash allows me to redefine a variable in the same scope. Variables in nftables feel more like constants which is not necessarily bad as it can prevent some typos but is hard to work with in scripting as it's not that flexible. From those options you listed I would strongly prefer to have an implicit scope for each file included in the flat notation. That way, defining a variable in one file would not collide with the same variable in a sibling include. Variables from outer scopes would still be available in inner scopes. For people that would want to have their "global" definitions in a separate include, I would recommend creating a new keyword like global or export that would tie a variable to the top-level scope and thus make it available to everyone. I don't think that would be that hard to implement and I may try to if we agree on it. Anyway there should definitely be a way to de-register (undefine) a variable from a scope to prevent a misuse due to typos. By the way, can we restructure the FW using nesting and still be able to retain all per-customer rules in a single file? Wouldn't that require us to split prerouting, postrouting, forward and other rules to separate scopes/ table definitions? That would be highly inconvenient. -- S pozdravem, David Fabian Cluster Design, s.r.o. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)
On Tue 30-01-18 10:02:34, Dmitry Vyukov wrote: > On Tue, Jan 30, 2018 at 9:28 AM, Kirill A. Shutemov > wrote: > > On Tue, Jan 30, 2018 at 09:11:27AM +0100, Florian Westphal wrote: > >> Michal Hocko wrote: > >> > On Mon 29-01-18 23:35:22, Florian Westphal wrote: > >> > > Kirill A. Shutemov wrote: > >> > [...] > >> > > > I hate what I'm saying, but I guess we need some tunable here. > >> > > > Not sure what exactly. > >> > > > >> > > Would memcg help? > >> > > >> > That really depends. I would have to check whether vmalloc path obeys > >> > __GFP_ACCOUNT (I suspect it does except for page tables allocations but > >> > that shouldn't be a big deal). But then the other potential problem is > >> > the life time of the xt_table_info (or other potentially large) data > >> > structures. Are they bound to any process life time. > >> > >> No. > > > > Well, IIUC they bound to net namespace life time, so killing all > > proccesses in the namespace would help to get memory back. :) > > ... unless the namespace is mounted into file system. > > Let's start with NOWARN as that's what kernel generally uses for > allocations with user-controllable size. ENOMEM is roughly as > informative as the WARNING message in this case. You want __GFP_NORETRY but that is not _fully_ supported by kvmalloc right now. More specifically kvmalloc doesn't guanratee that the request will not trigger the OOM killer (like regular __GFP_NORETRY). This is because of internal vmalloc restrictions. If you are however OK to simply bail out in most cases then __GFP_NORETRY should work reasonably fine. > I think we also need to consider setting up memory cgroup for > syzkaller test processes (we do RLIMIT_AS, but that's weak). Well, this is not about syzkaller, it merely pointed out a potential DoS... And that has to be addressed somehow. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)
On Tue 30-01-18 09:11:27, Florian Westphal wrote: > Michal Hocko wrote: > > On Mon 29-01-18 23:35:22, Florian Westphal wrote: > > > Kirill A. Shutemov wrote: > > [...] > > > > I hate what I'm saying, but I guess we need some tunable here. > > > > Not sure what exactly. > > > > > > Would memcg help? > > > > That really depends. I would have to check whether vmalloc path obeys > > __GFP_ACCOUNT (I suspect it does except for page tables allocations but > > that shouldn't be a big deal). But then the other potential problem is > > the life time of the xt_table_info (or other potentially large) data > > structures. Are they bound to any process life time. > > No. > > > Because if they are > > not then the OOM killer will not help. The OOM panic earlier in this > > thread suggests it doesn't because the test case managed to eat all the > > available memory and killed all the eligible tasks which didn't help. > > Yes, which is why we do not want any OOM killer invocation in first > place... The problem is that as soon as you eat that memory and ask for more until you fail with ENOMEM then the OOM is simply unavoidable. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: possible deadlock in xt_find_revision
#syz dup: possible deadlock in do_ip_getsockopt -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: possible deadlock in xt_find_table_lock
#syz dup: possible deadlock in do_ip_getsockopt -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)
On Tue, Jan 30, 2018 at 9:28 AM, Kirill A. Shutemov wrote: > On Tue, Jan 30, 2018 at 09:11:27AM +0100, Florian Westphal wrote: >> Michal Hocko wrote: >> > On Mon 29-01-18 23:35:22, Florian Westphal wrote: >> > > Kirill A. Shutemov wrote: >> > [...] >> > > > I hate what I'm saying, but I guess we need some tunable here. >> > > > Not sure what exactly. >> > > >> > > Would memcg help? >> > >> > That really depends. I would have to check whether vmalloc path obeys >> > __GFP_ACCOUNT (I suspect it does except for page tables allocations but >> > that shouldn't be a big deal). But then the other potential problem is >> > the life time of the xt_table_info (or other potentially large) data >> > structures. Are they bound to any process life time. >> >> No. > > Well, IIUC they bound to net namespace life time, so killing all > proccesses in the namespace would help to get memory back. :) ... unless the namespace is mounted into file system. Let's start with NOWARN as that's what kernel generally uses for allocations with user-controllable size. ENOMEM is roughly as informative as the WARNING message in this case. I think we also need to consider setting up memory cgroup for syzkaller test processes (we do RLIMIT_AS, but that's weak). -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)
On Tue, Jan 30, 2018 at 09:11:27AM +0100, Florian Westphal wrote: > Michal Hocko wrote: > > On Mon 29-01-18 23:35:22, Florian Westphal wrote: > > > Kirill A. Shutemov wrote: > > [...] > > > > I hate what I'm saying, but I guess we need some tunable here. > > > > Not sure what exactly. > > > > > > Would memcg help? > > > > That really depends. I would have to check whether vmalloc path obeys > > __GFP_ACCOUNT (I suspect it does except for page tables allocations but > > that shouldn't be a big deal). But then the other potential problem is > > the life time of the xt_table_info (or other potentially large) data > > structures. Are they bound to any process life time. > > No. Well, IIUC they bound to net namespace life time, so killing all proccesses in the namespace would help to get memory back. :) -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)
Michal Hocko wrote: > On Mon 29-01-18 23:35:22, Florian Westphal wrote: > > Kirill A. Shutemov wrote: > [...] > > > I hate what I'm saying, but I guess we need some tunable here. > > > Not sure what exactly. > > > > Would memcg help? > > That really depends. I would have to check whether vmalloc path obeys > __GFP_ACCOUNT (I suspect it does except for page tables allocations but > that shouldn't be a big deal). But then the other potential problem is > the life time of the xt_table_info (or other potentially large) data > structures. Are they bound to any process life time. No. > Because if they are > not then the OOM killer will not help. The OOM panic earlier in this > thread suggests it doesn't because the test case managed to eat all the > available memory and killed all the eligible tasks which didn't help. Yes, which is why we do not want any OOM killer invocation in first place... > So in some sense the memcg would help to stop the excessive allocation, > but it wouldn't resolve it other than kill all tasks in the affected > memcg/container. Whether this is sufficient or not, I dunno. It sounds > quite suboptimal to me. But it is true this would be less tricky then > adding a global knob... Global knob doesn't really help at all, I can add multiple large iptables rulesets (so we would have to account), and we have same issue in virtually all of networking, so we need limits for interface count, tunnel count, ipsec policies/SAs, nftables, tc, etc etc. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html